2025 04 04

From Text to Graph: Leveraging Graph Neural Networks for Enhanced Explainability in NLP

Fabio Yáñez-Romero,Andrés Montoyo,Armando Suárez,Yoan Gutiérrez,Ruslan Mitkov

Task: 提出一种新颖的方法，通过将句子自动转换为图结构来实现自然语言处理任务的可解释性。

Motivation: Transformer类模型虽然表现出色，但因其规模庞大导致计算成本高，且其基于token的解释方式缺乏语义连贯性，难以从模型内部解释其决策过程。

Details

Method: 通过将句子转换为图结构，利用节点和关系表达基本语言概念，保持语义完整性，并支持后续任务的知识利用。 Result: 实验结果表明，该方法能有效识别文本结构中对分类任务最关键的部分。 Conclusion: 该方法为自然语言处理任务提供了一种高效且可解释的解决方案，有助于理解模型如何关联文本元素与任务。 Abstract: Researchers have relegated natural language processing tasks to Transformer-type models, particularly generative models, because these models exhibit high versatility when performing generation and classification tasks. As the size of these models increases, they achieve outstanding results. Given their widespread use, many explainability techniques are developed based on these models. However, this process becomes computationally expensive due to the large size of the models. Additionally, transformers interpret input information through tokens that fragment input words into sequences lacking inherent semantic meaning, complicating the explanation of the model from the very beginning. This study proposes a novel methodology to achieve explainability in natural language processing tasks by automatically converting sentences into graphs and maintaining semantics through nodes and relations that express fundamental linguistic concepts. It also allows the subsequent exploitation of this knowledge in subsequent tasks, making it possible to obtain trends and understand how the model associates the different elements inside the text with the explained task. The experiments delivered promising results in determining the most critical components within the text structure for a given classification.

Increasing happiness through conversations with artificial intelligence

Joseph Heffner,Chongyu Qin,Martin Chadwick,Chris Knutsen,Christopher Summerfield,Zeb Kurth-Nelson,Robb B. Rutledge

Task: 研究AI聊天机器人对话如何影响主观幸福感。

Motivation: 探讨AI聊天机器人与人类对话对幸福感的影响，尤其是负面话题下的效果。

Details

Method: 参与者与AI聊天机器人对话或写日记，随后报告幸福感，并利用大型语言模型进行情感分析。 Result: AI聊天机器人对话后的幸福感高于写日记，尤其在负面话题下，参与者情感逐渐与AI的积极态度一致。 Conclusion: AI对话通过情感预期误差影响幸福感，凸显AI互动对人类福祉的作用。 Abstract: Chatbots powered by artificial intelligence (AI) have rapidly become a significant part of everyday life, with over a quarter of American adults using them multiple times per week. While these tools offer potential benefits and risks, a fundamental question remains largely unexplored: How do conversations with AI influence subjective well-being? To investigate this, we conducted a study where participants either engaged in conversations with an AI chatbot (N = 334) or wrote journal entires (N = 193) on the same randomly assigned topics and reported their momentary happiness afterward. We found that happiness after AI chatbot conversations was higher than after journaling, particularly when discussing negative topics such as depression or guilt. Leveraging large language models for sentiment analysis, we found that the AI chatbot mirrored participants' sentiment while maintaining a consistent positivity bias. When discussing negative topics, participants gradually aligned their sentiment with the AI's positivity, leading to an overall increase in happiness. We hypothesized that the history of participants' sentiment prediction errors, the difference between expected and actual emotional tone when responding to the AI chatbot, might explain this happiness effect. Using computational modeling, we find the history of these sentiment prediction errors over the course of a conversation predicts greater post-conversation happiness, demonstrating a central role of emotional expectations during dialogue. Our findings underscore the effect that AI interactions can have on human well-being.

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Xiao Wang,Daniil Larionov,Siwei Wu,Yiqi Liu,Steffen Eger,Nafise Sadat Moosavi,Chenghua Lin

Task: Introducing ContrastScore, a contrastive evaluation metric for assessing the quality of generated text in NLG tasks.

Motivation: Conventional reference-based metrics and smaller LLM-based metrics exhibit weak correlation with human evaluations and biases.

Details

Method: ContrastScore is designed as a contrastive evaluation metric to improve alignment with human judgments, tested on machine translation and summarization tasks. Result: ContrastScore achieves stronger correlation with human judgments than baselines, outperforms larger models like Qwen 7B, and mitigates common evaluation biases. Conclusion: ContrastScore provides a higher-quality, less biased, and more efficient method for NLG evaluation. Abstract: Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.

Language Models at the Syntax-Semantics Interface: A Case Study of the Long-Distance Binding of Chinese Reflexive ziji

Xiulin Yang

Task: 探索语言模型是否能有效解决汉语反身代词“自己”的复杂约束模式，这些模式受句法和语义因素的双重制约。

Motivation: 研究语言模型在处理汉语反身代词“自己”时的表现，以验证其是否能够像人类一样理解复杂的句法和语义约束。

Details

Method: 构建了一个包含240个合成句子和320个自然句子的数据集，评估了21个语言模型的表现，并与母语者的判断进行对比。 Result: 发现现有语言模型无法一致地复制人类判断，主要依赖顺序线索，而忽略了细微的句法和语义约束，对名词相关语义更敏感。 Conclusion: 现有语言模型在处理汉语反身代词“自己”时仍存在局限性，需进一步改进以更好地理解复杂的语言约束。 Abstract: This paper explores whether language models can effectively resolve the complex binding patterns of the Mandarin Chinese reflexive ziji, which are constrained by both syntactic and semantic factors. We construct a dataset of 240 synthetic sentences using templates and examples from syntactic literature, along with 320 natural sentences from the BCC corpus. Evaluating 21 language models against this dataset and comparing their performance to judgments from native Mandarin speakers, we find that none of the models consistently replicates human-like judgments. The results indicate that existing language models tend to rely heavily on sequential cues, though not always favoring the closest strings, and often overlooking subtle semantic and syntactic constraints. They tend to be more sensitive to noun-related than verb-related semantics.

LSC-ADL: An Activity of Daily Living (ADL)-Annotated Lifelog Dataset Generated via Semi-Automatic Clustering

Minh-Quan Ho-Le,Duy-Khang Ho,Van-Tu Ninh,Cathal Gurrin,Minh-Triet Tran

Task: 提出LSC-ADL数据集，用于增强基于活动级别的生命日志检索。

Motivation: 现有方法忽视了活动级别的标注，而这些标注能捕捉时间关系和丰富语义理解。

Details

Method: 采用半自动方法，结合HDBSCAN算法进行类内聚类和人工验证，生成准确的ADL标注。 Result: LSC-ADL数据集填补了现有研究的空白，提供了更具上下文感知的日常生活表示。 Conclusion: 该数据集将推动生命日志检索、活动识别和自我中心视觉的研究，提高检索内容的准确性和可解释性。 Abstract: Lifelogging involves continuously capturing personal data through wearable cameras, providing an egocentric view of daily activities. Lifelog retrieval aims to search and retrieve relevant moments from this data, yet existing methods largely overlook activity-level annotations, which capture temporal relationships and enrich semantic understanding. In this work, we introduce LSC-ADL, an ADL-annotated lifelog dataset derived from the LSC dataset, incorporating Activities of Daily Living (ADLs) as a structured semantic layer. Using a semi-automatic approach featuring the HDBSCAN algorithm for intra-class clustering and human-in-the-loop verification, we generate accurate ADL annotations to enhance retrieval explainability. By integrating action recognition into lifelog retrieval, LSC-ADL bridges a critical gap in existing research, offering a more context-aware representation of daily life. We believe this dataset will advance research in lifelog retrieval, activity recognition, and egocentric vision, ultimately improving the accuracy and interpretability of retrieved content. The ADL annotations can be downloaded at https://bit.ly/lsc-adl-annotations.

Overcoming Vocabulary Constraints with Pixel-level Fallback

Jonas F. Lotz,Hendra Setiawan,Stephan Peitz,Yova Kementchedjhieva

Task: 提出一种基于像素的无词汇编码器，用于增强预训练语言模型的多语言能力。

Motivation: 解决子词分词在计算效率和词汇覆盖之间的平衡问题，尤其是对非优先语言和脚本的性能不足。

Details

Method: 通过将文本渲染为像素生成输入嵌入，替代传统的子词分词方法。 Result: 实验表明，该方法显著提升了机器翻译性能，支持跨语言迁移，优于基于分词器的方法，且解码延迟更低。 Conclusion: 像素表示法优于字节级方法和标准词汇扩展，无需大量重新训练即可增强单语模型的多语言能力。 Abstract: Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.

Aligned Better, Listen Better for Audio-Visual Large Language Models

Yuxin Guo,Shuailei Ma,Shijie Ma,Xiaoyi Bao,Chen-Wei Xie,Kecheng Zheng,Tingyu Weng,Siyang Sun,Yun Zheng,Wei Zou

Task: 提出一种细粒度的音频-视觉大语言模型（Dolphin）和音频-视觉数据集（AVU），以解决现有视频大语言模型在音频信息利用上的不足。

Motivation: 音频在多模态视频理解中至关重要，但现有模型在音频信息利用上存在缺陷，导致理解能力弱和幻觉问题。

Details

Method: 1. 从架构角度，提出Dolphin模型，通过音频-视觉多尺度适配器和音频-视觉交错合并实现时空对齐；2. 从数据集角度，构建AVU数据集，包含520万多样化数据元组，并提出新的数据划分策略。 Result: 实验表明，Dolphin在音频-视觉理解上表现优异，并能有效缓解幻觉问题。 Conclusion: Dolphin和AVU数据集为音频-视觉理解提供了有效的解决方案，显著提升了模型性能。 Abstract: Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations.

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Ezzeldin Shereen,Dan Ristea,Burak Hasircioglu,Shae McFadden,Vasilios Mavroudis,Chris Hicks

Task: 研究针对多模态检索增强生成（M-RAG）系统的投毒攻击，特别是针对视觉文档检索应用的攻击方法。

Motivation: M-RAG系统通过知识库（KB）抑制大型多模态模型（LMMs）的幻觉，但同时也引入了新的攻击面，攻击者可能通过注入恶意条目破坏系统。

Details

Method: 提出一种针对M-RAG的投毒攻击方法，目标是制作一个能够被多种查询检索到的恶意图像，从而影响生成模型的输出，实现拒绝服务（DoS）攻击。 Result: 攻击对多种广泛使用的最先进检索器（嵌入模型）和生成器（LMMs）有效，但对某些鲁棒的嵌入模型无效。 Conclusion: 该攻击揭示了M-RAG管道对投毒攻击的脆弱性，并指出了其在良性设置下可能存在的性能瓶颈。 Abstract: Multimodal retrieval augmented generation (M-RAG) has recently emerged as a method to inhibit hallucinations of large multimodal models (LMMs) through a factual knowledge base (KB). However, M-RAG also introduces new attack vectors for adversaries that aim to disrupt the system by injecting malicious entries into the KB. In this work, we present a poisoning attack against M-RAG targeting visual document retrieval applications, where the KB contains images of document pages. Our objective is to craft a single image that is retrieved for a variety of different user queries, and consistently influences the output produced by the generative model, thus creating a universal denial-of-service (DoS) attack against the M-RAG system. We demonstrate that while our attack is effective against a diverse range of widely-used, state-of-the-art retrievers (embedding models) and generators (LMMs), it can also be ineffective against robust embedding models. Our attack not only highlights the vulnerability of M-RAG pipelines to poisoning attacks, but also sheds light on a fundamental weakness that potentially hinders their performance even in benign settings.

FreSca: Unveiling the Scaling Space in Diffusion Models

Chao Huang,Susan Liang,Yunlong Tang,Li Ma,Yapeng Tian,Chenliang Xu

Task: 探索扩散模型中噪声预测的频率特性，并提出一种独立调整不同频率带的方法（FreSca）以增强图像编辑和理解任务。

Motivation: 扩散模型通过噪声预测和分类器无关的指导提供了对图像任务的控制能力，但其潜在的细粒度语义操纵潜力尚未充分探索。

Details

Method: 通过傅里叶分析噪声预测，发现其低频和高频分量在扩散过程中演变不同，并据此提出FreSca方法，独立调整不同频率带的指导缩放。 Result: FreSca显著提升了现有图像编辑方法的性能，并在图像理解任务（如深度估计）中实现了定量增益。 Conclusion: FreSca通过频率分析揭示了噪声预测的新特性，为扩散模型在图像任务中的控制能力提供了新的优化方向。 Abstract: Diffusion models offer impressive controllability for image tasks, primarily through noise predictions that encode task-specific information and classifier-free guidance enabling adjustable scaling. This scaling mechanism implicitly defines a ``scaling space'' whose potential for fine-grained semantic manipulation remains underexplored. We investigate this space, starting with inversion-based editing where the difference between conditional/unconditional noise predictions carries key semantic information. Our core contribution stems from a Fourier analysis of noise predictions, revealing that its low- and high-frequency components evolve differently throughout diffusion. Based on this insight, we introduce FreSca, a straightforward method that applies guidance scaling independently to different frequency bands in the Fourier domain. FreSca demonstrably enhances existing image editing methods without retraining. Excitingly, its effectiveness extends to image understanding tasks such as depth estimation, yielding quantitative gains across multiple datasets.

LL4G: Self-Supervised Dynamic Optimization for Graph-Based Personality Detection

Lingzhi Shen,Yunfei Long,Xiaohao Cai,Guanming Chen,Yuhan Wang,Imran Razzak,Shoaib Jameel

Task: 提出一种基于大语言模型的自监督框架LL4G，用于优化图神经网络（GNN）以进行基于图的性格检测。

Motivation: 当前方法在处理稀疏或噪声数据时表现不佳，且依赖静态图结构，难以捕捉节点和关系的动态变化。

Details

Method: 利用大语言模型（LLMs）提取语义特征生成节点表示并推断显式和隐式关系，动态调整图结构，通过GNN进行联合训练。 Result: 在Kaggle和Pandora数据集上的实验结果表明，LL4G优于现有最先进模型。 Conclusion: LL4G通过结合语义和结构信息，生成了更鲁棒的性格画像。 Abstract: Graph-based personality detection constructs graph structures from textual data, particularly social media posts. Current methods often struggle with sparse or noisy data and rely on static graphs, limiting their ability to capture dynamic changes between nodes and relationships. This paper introduces LL4G, a self-supervised framework leveraging large language models (LLMs) to optimize graph neural networks (GNNs). LLMs extract rich semantic features to generate node representations and to infer explicit and implicit relationships. The graph structure adaptively adds nodes and edges based on input data, continuously optimizing itself. The GNN then uses these optimized representations for joint training on node reconstruction, edge prediction, and contrastive learning tasks. This integration of semantic and structural information generates robust personality profiles. Experimental results on Kaggle and Pandora datasets show LL4G outperforms state-of-the-art models.

UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting

Jaehoon Choi,Dongki Jung,Yonghan Lee,Sungmin Eum,Dinesh Manocha,Heesung Kwon

Task: 提出一种名为UAVTwin的方法，用于从真实环境中创建数字孪生，并通过数据增强训练无人机（UAV）中的下游模型。

Motivation: 为了解决无人机感知任务中动态对象和外观变化带来的3D高斯泼溅（3DGS）建模问题，并提升下游模型的性能。

Details

Method: 结合3D高斯泼溅（3DGS）重建背景，并引入可控的合成人体模型，提出新颖的外观建模策略和掩码细化模块。 Result: 在神经渲染质量上，PSNR提高了1.23 dB；在人体检测任务中，mAP提升了2.5%至13.7%。 Conclusion: UAVTwin是首个基于3DGS的高保真无人机数字孪生方法，显著提升了数据增强和下游模型的性能。 Abstract: We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses. To the best of our knowledge, UAVTwin is the first approach for UAV-based perception that is capable of generating high-fidelity digital twins based on 3DGS. The proposed work significantly enhances downstream models through data augmentation for real-world environments with multiple dynamic objects and significant appearance variations-both of which typically introduce artifacts in 3DGS-based modeling. To tackle these challenges, we propose a novel appearance modeling strategy and a mask refinement module to enhance the training of 3D Gaussian Splatting. We demonstrate the high quality of neural rendering by achieving a 1.23 dB improvement in PSNR compared to recent methods. Furthermore, we validate the effectiveness of data augmentation by showing a 2.5% to 13.7% improvement in mAP for the human detection task.

Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala

Shanilka Haturusinghe,Tharindu Cyril Weerasooriya,Marcos Zampieri,Christopher M. Homan,S. R. Liyanage

Task: 改进低资源语言（僧伽罗语）中冒犯性语言检测的性能。

Motivation: 高资源语言和低资源语言在冒犯性语言检测任务上性能差异显著，需要探索新的微调策略以提升低资源语言的性能。

Details

Method: 提出了四种模型：Subasa-XLM-R（结合中间预微调步骤）、Subasa-Llama和Subasa-Mistral（分别基于Llama和Mistral的任务特定微调）。 Result: 所有模型在SOLD基准数据集上均优于现有基线，其中Subasa-XLM-R在零样本设置下取得了最高的Macro F1分数（0.84）。 Conclusion: 提出的方法显著提升了僧伽罗语冒犯性语言检测的性能，模型和代码已公开。 Abstract: Accurate detection of offensive language is essential for a number of applications related to social media safety. There is a sharp contrast in performance in this task between low and high-resource languages. In this paper, we adapt fine-tuning strategies that have not been previously explored for Sinhala in the downstream task of offensive language detection. Using this approach, we introduce four models: "Subasa-XLM-R", which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction. Two variants of "Subasa-Llama" and "Subasa-Mistral", are fine-tuned versions of Llama (3.2) and Mistral (v0.3), respectively, with a task-specific strategy. We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection. All our models outperform existing baselines. Subasa-XLM-R achieves the highest Macro F1 score (0.84) surpassing state-of-the-art large language models like GPT-4o when evaluated on the same SOLD benchmark dataset under zero-shot settings. The models and code are publicly available.

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

Shaojin Wu,Mengqi Huang,Wenxu Wu,Yufeng Cheng,Fei Ding,Qian He

Task: 提出一种高一致性的数据合成流程和UNO模型，以解决多主题生成中的数据可扩展性和主题扩展性问题。

Motivation: 尽管主题驱动生成在图像生成中应用广泛，但在数据可扩展性和主题扩展性方面仍存在挑战，尤其是从单主题数据集扩展到多主题数据集以及处理多主题场景时的困难。

Details

Method: 采用扩散变换器的上下文生成能力构建高一致性多主题配对数据合成流程，并引入UNO模型，包括渐进式跨模态对齐和通用旋转位置嵌入。 Result: 实验表明，该方法在单主题和多主题驱动生成中均能实现高一致性和可控性。 Conclusion: 提出的方法有效解决了多主题生成中的数据可扩展性和主题扩展性问题，具有实际应用价值。 Abstract: Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.

LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks

Seunghyun Yoo

Task: 研究大型语言模型（LLM）作为自主代理如何利用语义模糊性生成具有欺骗性的谜题。

Motivation: 探索LLM在对抗性环境中表现出的代理行为，尤其是如何利用语言模糊性误导人类用户。

Details

Method: 通过零样本提示、角色注入对抗性提示和人工制作的谜题进行系统比较，并使用HateBERT量化语义模糊性及人类主观评估。 Result: 显式的对抗性代理行为显著增加了语义模糊性，从而提高了认知负荷并降低了谜题解决的公平性。 Conclusion: 研究揭示了LLM的代理行为特性，并强调了在教育技术和娱乐中安全部署自主语言系统的伦理考量。 Abstract: Recent advancements in Large Language Models (LLMs) have not only showcased impressive creative capabilities but also revealed emerging agentic behaviors that exploit linguistic ambiguity in adversarial settings. In this study, we investigate how an LLM, acting as an autonomous agent, leverages semantic ambiguity to generate deceptive puzzles that mislead and challenge human users. Inspired by the popular puzzle game "Connections", we systematically compare puzzles produced through zero-shot prompting, role-injected adversarial prompts, and human-crafted examples, with an emphasis on understanding the underlying agent decision-making processes. Employing computational analyses with HateBERT to quantify semantic ambiguity, alongside subjective human evaluations, we demonstrate that explicit adversarial agent behaviors significantly heighten semantic ambiguity -- thereby increasing cognitive load and reducing fairness in puzzle solving. These findings provide critical insights into the emergent agentic qualities of LLMs and underscore important ethical considerations for evaluating and safely deploying autonomous language systems in both educational technologies and entertainment.

MDP: Multidimensional Vision Model Pruning with Latency Constraint

Xinglong Sun,Barath Lakshmanan,Maying Shen,Shiyi Lan,Jingde Chen,Jose M. Alvarez

Task: 提出一种名为多维度剪枝（MDP）的新范式，以解决现有结构剪枝方法在剪枝粒度和延迟建模方面的局限性。

Motivation: 现有方法在剪枝粒度和延迟建模上存在不足，限制了参数的大幅减少和延迟优化的准确性，尤其是在Transformer模型中。

Details

Method: MDP通过联合优化多种剪枝粒度（如通道、查询、键、头、嵌入和块），并采用高级延迟建模技术，将剪枝问题转化为混合整数非线性规划（MINLP）来解决。 Result: 实验表明，MDP在高剪枝率下显著优于现有方法，如在ImageNet上对ResNet50剪枝实现了28%的速度提升和+1.4的Top-1准确率提升。 Conclusion: MDP是一种通用框架，适用于CNN和Transformer，能够高效平衡延迟和准确率，显著优于现有方法。 Abstract: Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where multiple interacting dimensions impact latency. In this paper, we address both limitations by introducing Multi-Dimensional Pruning (MDP), a novel paradigm that jointly optimizes across a variety of pruning granularities-including channels, query, key, heads, embeddings, and blocks. MDP employs an advanced latency modeling technique to accurately capture latency variations across all prunable dimensions, achieving an optimal balance between latency and accuracy. By reformulating pruning as a Mixed-Integer Nonlinear Program (MINLP), MDP efficiently identifies the optimal pruned structure across all prunable dimensions while respecting latency constraints. This versatile framework supports both CNNs and transformers. Extensive experiments demonstrate that MDP significantly outperforms previous methods, especially at high pruning ratios. On ImageNet, MDP achieves a 28% speed increase with a +1.4 Top-1 accuracy improvement over prior work like HALP for ResNet50 pruning. Against the latest transformer pruning method, Isomorphic, MDP delivers an additional 37% acceleration with a +0.7 Top-1 accuracy improvement.

State-of-the-Art Translation of Text-to-Gloss using mBART : A case study of Bangla

Sharif Md. Abdullah,Abhijit Paul,Shebuti Rayana,Ahmedul Kabir,Zarif Masud

Task: 研究如何将孟加拉语文本转换为孟加拉手语（BdSL）的gloss表示。

Motivation: 尽管孟加拉有170万聋哑人口，但孟加拉手语（BdSL）的研究仍然不足，尤其是缺乏关于文本到gloss翻译任务的研究。

Details

Method: 通过基于语法规则的gloss生成方法（借鉴德国和美国手语的研究）和利用LLM生成合成数据，再通过回译和文本生成进行数据增强。实验中使用预训练的mBART-50和mBERT-multiclass-uncased模型，以及GRU、RNN和一种新颖的seq-to-seq模型。 Result: mBART-50模型表现出色（ScareBLEU=79.53），并在PHOENIX-14T基准测试中达到SOTA性能（ScareBLEU=63.89）。 Conclusion: 研究表明mBART模型在文本到gloss任务中具有优势，且基于规则的合成数据集对BdSL任务有显著帮助。 Abstract: Despite a large deaf and dumb population of 1.7 million, Bangla Sign Language (BdSL) remains a understudied domain. Specifically, there are no works on Bangla text-to-gloss translation task. To address this gap, we begin by addressing the dataset problem. We take inspiration from grammatical rule based gloss generation used in Germany and American sign langauage (ASL) and adapt it for BdSL. We also leverage LLM to generate synthetic data and use back-translation, text generation for data augmentation. With dataset prepared, we started experimentation. We fine-tuned pretrained mBART-50 and mBERT-multiclass-uncased model on our dataset. We also trained GRU, RNN and a novel seq-to-seq model with multi-head attention. We observe significant high performance (ScareBLEU=79.53) with fine-tuning pretrained mBART-50 multilingual model from Facebook. We then explored why we observe such high performance with mBART. We soon notice an interesting property of mBART -- it was trained on shuffled and masked text data. And as we know, gloss form has shuffling property. So we hypothesize that mBART is inherently good at text-to-gloss tasks. To find support against this hypothesis, we trained mBART-50 on PHOENIX-14T benchmark and evaluated it with existing literature. Our mBART-50 finetune demonstrated State-of-the-Art performance on PHOENIX-14T benchmark, far outperforming existing models in all 6 metrics (ScareBLEU = 63.89, BLEU-1 = 55.14, BLEU-2 = 38.07, BLEU-3 = 27.13, BLEU-4 = 20.68, COMET = 0.624). Based on the results, this study proposes a new paradigm for text-to-gloss task using mBART models. Additionally, our results show that BdSL text-to-gloss task can greatly benefit from rule-based synthetic dataset.

Foreground Focus: Enhancing Coherence and Fidelity in Camouflaged Image Generation

Pei-Chi Chen,Yi Yao,Chan-Feng Hsu,HongXia Xie,Hung-Jen Chen,Hong-Han Shuai,Wen-Huang Cheng

Task: 提出一种前景感知的伪装图像生成模型（FACIG），以解决现有方法在背景与前景特征融合及前景保真度方面的不足。

Motivation: 现有伪装图像生成方法存在背景与前景特征融合不佳及前景保真度不足的问题，导致生成图像质量下降。

Details

Method: 引入前景感知特征融合模块（FAFIM）增强前景与背景特征的融合，并设计前景感知去噪损失以提升前景重建监督。 Result: 实验表明，该方法在伪装图像质量和前景保真度上优于现有方法。 Conclusion: FACIG模型有效提升了伪装图像生成的质量和前景保真度。 Abstract: Camouflaged image generation is emerging as a solution to data scarcity in camouflaged vision perception, offering a cost-effective alternative to data collection and labeling. Recently, the state-of-the-art approach successfully generates camouflaged images using only foreground objects. However, it faces two critical weaknesses: 1) the background knowledge does not integrate effectively with foreground features, resulting in a lack of foreground-background coherence (e.g., color discrepancy); 2) the generation process does not prioritize the fidelity of foreground objects, which leads to distortion, particularly for small objects. To address these issues, we propose a Foreground-Aware Camouflaged Image Generation (FACIG) model. Specifically, we introduce a Foreground-Aware Feature Integration Module (FAFIM) to strengthen the integration between foreground features and background knowledge. In addition, a Foreground-Aware Denoising Loss is designed to enhance foreground reconstruction supervision. Experiments on various datasets show our method outperforms previous methods in overall camouflaged image quality and foreground fidelity.

Measurement of LLM's Philosophies of Human Nature

Minheng Ni,Ennan Wu,Zidong Gong,Zhengyuan Yang,Linjie Li,Chung-Ching Lin,Kevin Lin,Lijuan Wang,Wangmeng Zuo

Task: 设计并验证一个针对大型语言模型（LLM）的标准化心理量表（M-PHNS），以评估其对人类的态度，并提出一种心理循环学习框架以优化其价值体系。

Motivation: 人工智能（AI）的广泛应用及其涉及的冲突或违规行为引发了社会对与AI系统互动的担忧，需要一种方法来评估和改善AI对人类的态度。

Details

Method: 基于Wrightsman的人类自然哲学量表（PHNS），设计了M-PHNS量表，并通过心理循环学习框架（包括构建道德场景）优化LLM的价值体系。 Result: 当前LLM对人类普遍缺乏信任，且模型智能水平与对人类信任呈显著负相关；心理循环学习显著提升了LLM对人类信任。 Conclusion: M-PHNS量表为LLM的认知偏差诊断和伦理学习提供了潜在解决方案，展示了人类心理评估在AI中的潜力。 Abstract: The widespread application of artificial intelligence (AI) in various tasks, along with frequent reports of conflicts or violations involving AI, has sparked societal concerns about interactions with AI systems. Based on Wrightsman's Philosophies of Human Nature Scale (PHNS), a scale empirically validated over decades to effectively assess individuals' attitudes toward human nature, we design the standardized psychological scale specifically targeting large language models (LLM), named the Machine-based Philosophies of Human Nature Scale (M-PHNS). By evaluating LLMs' attitudes toward human nature across six dimensions, we reveal that current LLMs exhibit a systemic lack of trust in humans, and there is a significant negative correlation between the model's intelligence level and its trust in humans. Furthermore, we propose a mental loop learning framework, which enables LLM to continuously optimize its value system during virtual interactions by constructing moral scenarios, thereby improving its attitude toward human nature. Experiments demonstrate that mental loop learning significantly enhances their trust in humans compared to persona or instruction prompts. This finding highlights the potential of human-based psychological assessments for LLM, which can not only diagnose cognitive biases but also provide a potential solution for ethical learning in artificial intelligence. We release the M-PHNS evaluation code and data at https://github.com/kodenii/M-PHNS.

ESC: Erasing Space Concept for Knowledge Deletion

Tae-Young Lee,Sundong Park,Minwoo Jeon,Hyoseok Hwang,Gyeong-Moon Park

Task: 提出一种名为知识删除（KD）的新概念，并设计了一种无需训练的擦除方法（ESC）和一种可学习的掩码方法（ESC-T）来解决深度学习中的隐私问题。

Motivation: 现有方法未能完全满足用户对知识彻底删除的需求，且存在通过嵌入特征泄露个人知识的风险。

Details

Method: 提出了ESC方法，通过消除特征中的相关激活来限制遗忘知识的重要子空间；进一步提出了ESC-T方法，使用可学习的掩码来平衡遗忘和保留知识之间的权衡。 Result: 在多种数据集和模型上的实验表明，所提方法实现了最快和最先进的性能，且适用于多种遗忘场景。 Conclusion: ESC和ESC-T方法在知识删除任务中表现出高效性和通用性，为解决深度学习隐私问题提供了新思路。 Abstract: As concerns regarding privacy in deep learning continue to grow, individuals are increasingly apprehensive about the potential exploitation of their personal knowledge in trained models. Despite several research efforts to address this, they often fail to consider the real-world demand from users for complete knowledge erasure. Furthermore, our investigation reveals that existing methods have a risk of leaking personal knowledge through embedding features. To address these issues, we introduce a novel concept of Knowledge Deletion (KD), an advanced task that considers both concerns, and provides an appropriate metric, named Knowledge Retention score (KR), for assessing knowledge retention in feature space. To achieve this, we propose a novel training-free erasing approach named Erasing Space Concept (ESC), which restricts the important subspace for the forgetting knowledge by eliminating the relevant activations in the feature. In addition, we suggest ESC with Training (ESC-T), which uses a learnable mask to better balance the trade-off between forgetting and preserving knowledge in KD. Our extensive experiments on various datasets and models demonstrate that our proposed methods achieve the fastest and state-of-the-art performance. Notably, our methods are applicable to diverse forgetting scenarios, such as facial domain setting, demonstrating the generalizability of our methods. The code is available at http://github.com/KU-VGI/ESC .

Improving Harmful Text Detection with Joint Retrieval and External Knowledge

Zidong Yu,Shuo Wang,Nan Jiang,Weiqiang Huang,Xu Han,Junliang Du

Task: 提出一种联合检索框架，结合预训练语言模型和知识图谱，以提高有害文本检测的准确性和鲁棒性。

Motivation: 随着AI生成内容在数字平台上的扩展，有害文本检测变得至关重要，传统检测模型存在局限性。

Details

Method: 采用联合检索方法，结合预训练语言模型和知识图谱，利用外部上下文信息。 Result: 实验表明，该方法在低资源训练和多语言环境中显著优于单一模型基线。 Conclusion: 该方法为AI安全领域提供了更可靠的内容审核系统，未来研究应优化计算效率、增强模型可解释性并扩展多模态检测能力。 Abstract: Harmful text detection has become a crucial task in the development and deployment of large language models, especially as AI-generated content continues to expand across digital platforms. This study proposes a joint retrieval framework that integrates pre-trained language models with knowledge graphs to improve the accuracy and robustness of harmful text detection. Experimental results demonstrate that the joint retrieval approach significantly outperforms single-model baselines, particularly in low-resource training scenarios and multilingual environments. The proposed method effectively captures nuanced harmful content by leveraging external contextual information, addressing the limitations of traditional detection models. Future research should focus on optimizing computational efficiency, enhancing model interpretability, and expanding multimodal detection capabilities to better tackle evolving harmful content patterns. This work contributes to the advancement of AI safety, ensuring more trustworthy and reliable content moderation systems.

Geospatial Artificial Intelligence for Satellite-based Flood Extent Mapping: Concepts, Advances, and Future Perspectives

Hyunho Lee,Wenwen Li

Task: 利用地理空间人工智能（GeoAI）技术结合卫星数据进行洪水范围制图，以识别洪水事件并评估其影响。

Motivation: 为灾害管理和空间决策提供支持。

Details

Method: 系统整合人工智能技术与卫星数据。 Result: 生成洪水范围地图，包括受影响区域的划分，以及不确定性估计和变化检测等附加分析输出。 Conclusion: GeoAI技术在洪水范围制图和灾害管理中具有重要应用价值。 Abstract: Geospatial Artificial Intelligence (GeoAI) for satellite-based flood extent mapping systematically integrates artificial intelligence techniques with satellite data to identify flood events and assess their impacts, for disaster management and spatial decision-making. The primary output often includes flood extent maps, which delineate the affected areas, along with additional analytical outputs such as uncertainty estimation and change detection.

CoTAL: Human-in-the-Loop Prompt Engineering, Chain-of-Thought Reasoning, and Active Learning for Generalizable Formative Assessment Scoring

Clayton Cohn,Nicole Hutchins,Ashwin T S,Gautam Biswas

Task: 研究如何利用链式思维提示（CoT）和主动学习（CoTAL）改进大型语言模型（LLM）在跨领域形成性评估中的评分性能。

Motivation: 大型语言模型在教育领域的应用潜力巨大，但现有方法在跨领域评估中的泛化能力尚未充分验证。

Details

Method: 结合证据中心设计（ECD）原则开发课程对齐的形成性评估和评分标准，采用人机协同提示工程自动化评分，并通过师生反馈迭代优化评估问题、评分标准和提示。 Result: CoTAL显著提升了GPT-4的评分性能，比未经提示工程优化的基线提高了24.5%，师生均认为其评分和解释效果良好。 Conclusion: CoTAL是一种有效的跨领域形成性评估自动化评分方法，通过人机协同和迭代优化显著提升了评分的准确性和解释质量。 Abstract: Large language models (LLMs) have created new opportunities to assist teachers and support student learning. Methods such as chain-of-thought (CoT) prompting enable LLMs to grade formative assessments in science, providing scores and relevant feedback to students. However, the extent to which these methods generalize across curricula in multiple domains (such as science, computing, and engineering) remains largely untested. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) principles to develop curriculum-aligned formative assessments and rubrics, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates teacher and student feedback to iteratively refine assessment questions, grading rubrics, and LLM prompts for automated grading. Our findings demonstrate that CoTAL improves GPT-4's scoring performance, achieving gains of up to 24.5% over a non-prompt-engineered baseline. Both teachers and students view CoTAL as effective in scoring and explaining student responses, each providing valuable refinements to enhance grading accuracy and explanation quality.

AC-LoRA: Auto Component LoRA for Personalized Artistic Style Image Generation

Zhipu Cui,Andong Tian,Zhi Ying,Jialiang Lu

Task: 提出一种自动分离LoRA矩阵信号与噪声成分的方法（AC-LoRA），用于高效个性化艺术风格图像生成。

Motivation: 解决LoRA方法中需手动调整秩参数的问题，提升个性化图像生成的效率与效果。

Details

Method: 基于奇异值分解（SVD）和动态启发式更新超参数，自动分离LoRA矩阵的信号与噪声成分。 Result: 在FID、CLIP、DINO和ImageReward指标上平均提升9%，优于现有方法。 Conclusion: AC-LoRA能有效解决模型欠拟合或过拟合问题，提升个性化图像生成的性能。 Abstract: Personalized image generation allows users to preserve styles or subjects of a provided small set of images for further image generation. With the advancement in large text-to-image models, many techniques have been developed to efficiently fine-tune those models for personalization, such as Low Rank Adaptation (LoRA). However, LoRA-based methods often face the challenge of adjusting the rank parameter to achieve satisfactory results. To address this challenge, AutoComponent-LoRA (AC-LoRA) is proposed, which is able to automatically separate the signal component and noise component of the LoRA matrices for fast and efficient personalized artistic style image generation. This method is based on Singular Value Decomposition (SVD) and dynamic heuristics to update the hyperparameters during training. Superior performance over existing methods in overcoming model underfitting or overfitting problems is demonstrated. The results were validated using FID, CLIP, DINO, and ImageReward, achieving an average of 9% improvement.

LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models

Weibin Liao,Xin Gao,Tianyu Jia,Rihong Qiu,Yifan Zhu,Yang Lin,Xu Chu,Junfeng Zhao,Yasha Wang

Task: 提出一种名为LearNAT的新框架，通过任务分解和强化学习提升开源LLM在复杂NL2SQL任务中的性能。

Motivation: 现有NL2SQL方法依赖闭源LLM或需微调的开源LLM，但开源LLM在复杂任务中表现不佳，因用户查询目标间接表达及查询与数据库模式的语义鸿沟。

Details

Method: LearNAT框架包含三个关键组件：基于AST的任务分解合成、边际感知强化学习和自适应示例推理。 Result: 在Spider和BIRD基准测试中，LearNAT使7B参数的开源LLM性能接近GPT-4，同时提升效率和可访问性。 Conclusion: LearNAT通过任务分解和强化学习显著提升开源LLM在复杂NL2SQL任务中的表现。 Abstract: Natural Language to SQL (NL2SQL) has emerged as a critical task for enabling seamless interaction with databases. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable performance in this domain. However, existing NL2SQL methods predominantly rely on closed-source LLMs leveraging prompt engineering, while open-source models typically require fine-tuning to acquire domain-specific knowledge. Despite these efforts, open-source LLMs struggle with complex NL2SQL tasks due to the indirect expression of user query objectives and the semantic gap between user queries and database schemas. Inspired by the application of reinforcement learning in mathematical problem-solving to encourage step-by-step reasoning in LLMs, we propose LearNAT (Learning NL2SQL with AST-guided Task Decomposition), a novel framework that improves the performance of open-source LLMs on complex NL2SQL tasks through task decomposition and reinforcement learning. LearNAT introduces three key components: (1) a Decomposition Synthesis Procedure that leverages Abstract Syntax Trees (ASTs) to guide efficient search and pruning strategies for task decomposition, (2) Margin-aware Reinforcement Learning, which employs fine-grained step-level optimization via DPO with AST margins, and (3) Adaptive Demonstration Reasoning, a mechanism for dynamically selecting relevant examples to enhance decomposition capabilities. Extensive experiments on two benchmark datasets, Spider and BIRD, demonstrate that LearNAT enables a 7B-parameter open-source LLM to achieve performance comparable to GPT-4, while offering improved efficiency and accessibility.

SocialGesture: Delving into Multi-person Gesture Understanding

Xu Cao,Pranav Virupaksha,Wenqi Jia,Bolin Lai,Fiona Ryan,Sangmin Lee,James M. Rehg

Task: 提出并构建首个大规模多人物手势分析数据集SocialGesture，并设计视觉问答任务评估视觉语言模型的表现。

Motivation: 现有手势识别研究忽视多人物互动，导致难以理解自然手势的社交背景，且与其他模态（如语言和语音）对齐困难。

Details

Method: 引入SocialGesture数据集，包含多样化自然场景，支持视频识别和时间定位任务，并提出视觉问答任务评估模型表现。 Result: 发现当前手势识别模型的局限性，为未来改进提供方向。 Conclusion: SocialGesture为复杂社交互动中的手势研究提供了宝贵资源，并揭示了当前模型的不足。 Abstract: Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models'(VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field. SocialGesture is available at huggingface.co/datasets/IrohXu/SocialGesture.

The quasi-semantic competence of LLMs: a case study on the part-whole relation

Mattia Proietti,Alessandro Lenci

Task: 研究大型语言模型（LLMs）对部分-整体关系（meronymy）的语义理解能力。

Motivation: 部分-整体关系在词汇组织中至关重要，但相关研究较少，需要评估LLMs在这方面的能力。

Details

Method: 使用ConceptNet关系和人类生成的语义特征数据，通过行为测试、句子概率评分和概念表示分析三种方法评估LLMs。 Result: LLMs对部分-整体关系的理解是部分的，仅具备“准语义”能力，未能完全捕捉深层次推理特性。 Conclusion: LLMs在部分-整体关系上的语义能力有限，仍需进一步研究以提升其深度推理能力。 Abstract: Understanding the extent and depth of the semantic competence of \emph{Large Language Models} (LLMs) is at the center of the current scientific agenda in Artificial Intelligence (AI) and Computational Linguistics (CL). We contribute to this endeavor by investigating their knowledge of the \emph{part-whole} relation, a.k.a. \emph{meronymy}, which plays a crucial role in lexical organization, but it is significantly understudied. We used data from ConceptNet relations \citep{speer2016conceptnet} and human-generated semantic feature norms \citep{McRae:2005} to explore the abilities of LLMs to deal with \textit{part-whole} relations. We employed several methods based on three levels of analysis: i.) \textbf{behavioral} testing via prompting, where we directly queried the models on their knowledge of meronymy, ii.) sentence \textbf{probability} scoring, where we tested models' abilities to discriminate correct (real) and incorrect (asymmetric counterfactual) \textit{part-whole} relations, and iii.) \textbf{concept representation} analysis in vector space, where we proved the linear organization of the \textit{part-whole} concept in the embedding and unembedding spaces. These analyses present a complex picture that reveals that the LLMs' knowledge of this relation is only partial. They have just a ``\emph{quasi}-semantic'' competence and still fall short of capturing deep inferential properties.

Re-thinking Temporal Search for Long-Form Video Understanding

Jinhui Ye,Zihan Wang,Haosen Sun,Keshigeyan Chandrasegaran,Zane Durante,Cristobal Eyzaguirre,Yonatan Bisk,Juan Carlos Niebles,Ehsan Adeli,Li Fei-Fei,Jiajun Wu,Manling Li

Task: 研究长视频中的时间搜索问题，并提出轻量级关键帧搜索框架T*。

Motivation: 长视频理解在计算机视觉中仍具挑战性，现有方法在时间搜索能力上存在显著不足。

Details

Method: 将时间搜索问题重新定义为空间搜索问题，并提出自适应缩放机制。 Result: T*框架显著提升了现有方法在长视频理解任务中的性能。 Conclusion: T*框架为长视频时间搜索提供了高效解决方案，并验证了其有效性。 Abstract: Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: First, we formulate temporal search as a Long Video Haystack problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create LV-Haystack, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset. Next, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T*, which casts the expensive temporal search as a spatial search problem. T* leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T* significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-72B's performance from 56.5% to 62.4% on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material.

Scaling Analysis of Interleaved Speech-Text Language Models

Gallil Maimon,Michael Hassid,Amit Roth,Yossi Adi

Task: 分析交错语音语言模型（SLM）与无文本SLM在扩展效率上的差异。

Motivation: 现有研究表明SLM需要比文本更多的计算和数据，而现代SLM通过语音-文本交错初始化实现知识转移，但其扩展效率尚不明确。

Details

Method: 训练多个交错SLM并分析其扩展趋势，研究计算预算分配、合成数据及TextLM模型家族的作用。 Result: 交错SLM在计算效率上优于无文本SLM，且扩展动态显著不同，建议将更多计算预算用于增加模型规模而非训练token。 Conclusion: 交错SLM在减少计算和数据需求的同时，性能与领先模型相当，展示了其高效扩展潜力。 Abstract: Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - Do interleaved SLMs scale more efficiently than textless-SLMs? In this paper we answer a resounding, yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less compute and data than other approaches. We open source models, samples, and data - https://pages.cs.huji.ac.il/adiyoss-lab/sims.

WonderTurbo: Generating Interactive 3D World in 0.72 Seconds

Chaojun Ni,Xiaofeng Wang,Zheng Zhu,Weijie Wang,Haoyun Li,Guosheng Zhao,Jie Li,Wenkang Qin,Guan Huang,Wenjun Mei

Task: 提出一种实时交互式3D场景生成框架WonderTurbo，能够在0.72秒内生成3D场景的新视角。

Motivation: 当前3D生成技术难以实现实时交互性，限制了沉浸式虚拟体验的发展。

Details

Method: 通过StepSplat动态更新高效3D几何表示，QuickDepth提供深度输入，FastPaint实现快速外观建模。 Result: WonderTurbo比基线方法快15倍，同时保持高质量输出和空间一致性。 Conclusion: WonderTurbo解决了实时交互3D生成的挑战，为沉浸式体验提供了高效解决方案。 Abstract: Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15X speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.

DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers

Max Müller-Eberstein,Mike Zhang,Elisa Bassignana,Peter Brunsgaard Trolle,Rob van der Goot

Task: 评估大型语言模型在丹麦语中的文化适应能力。

Motivation: 尽管大型语言模型在多语言交互中表现广泛，但其文化意识不足，尤其是对非英语社区的响应常显得以英语为中心或不恰当。

Details

Method: 通过丹麦语母语者与不同模型的互动，收集1,038次交互数据，分析文化适应问题。 Result: 研究发现自动翻译数据不足以训练或衡量文化适应能力，而基于母语数据的训练可将响应接受率提高一倍以上。 Conclusion: 研究揭示了文化适应的开放挑战，并发布了首个丹麦语文化意识数据集DaKultur。 Abstract: Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

Wenzhuo Liu,Wenshuo Wang,Yicheng Qiao,Qiannan Guo,Jiayin Zhu,Pengfei Li,Zilong Chen,Huiming Yang,Zhiwei Li,Lening Wang,Tiao Tan,Huaping Liu

Task: 提出一个统一的多模态多任务学习框架MMTL-UniAD，用于同时识别驾驶员行为、情绪、车辆行为和交通环境。

Motivation: 现有研究忽视了联合学习驾驶员状态与交通环境的潜在优势，且多任务学习中存在负迁移问题。

Details

Method: 引入多轴区域注意力网络提取全局上下文敏感特征，以及双分支多模态嵌入学习任务共享和任务特定的特征。 Result: 在AIDE数据集上，MMTL-UniAD在四项任务中均优于现有方法。 Conclusion: MMTL-UniAD通过有效避免负迁移，提升了多任务学习的性能。 Abstract: Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks. The code is available on https://github.com/Wenzhuo-Liu/MMTL-UniAD.

AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology

Xiang Feng,Wentao Jiang,Zengmao Wang,Yong Luo,Pingbo Xu,Baosheng Yu,Hua Jin,Bo Du,Jing Zhang

Task: 系统评估大型语言模型（LLMs）在麻醉学领域的推理能力，并分析影响其性能的关键因素。

Motivation: 尽管LLMs在医学领域的应用受到广泛关注，但其在麻醉学等专业领域的推理能力尚未充分探索。

Details

Method: 引入AnesBench跨语言基准，评估麻醉学相关推理的三个层次（事实检索、混合推理和复杂决策），并通过实验分析模型特性、训练策略和推理技术的影响。 Result: 通过实验探索了模型规模、思维链长度、语言可迁移性等因素对推理性能的影响，并评估了不同训练策略和推理技术的有效性。 Conclusion: AnesBench及其相关数据集和代码将公开，以促进麻醉学领域LLMs的研究。 Abstract: The application of large language models (LLMs) in the medical field has gained significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. In this paper, we systematically evaluate the reasoning capabilities of LLMs in anesthesiology and analyze key factors influencing their performance. To this end, we introduce AnesBench, a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Through extensive experiments, we first explore how model characteristics, including model scale, Chain of Thought (CoT) length, and language transferability, affect reasoning performance. Then, we further evaluate the effectiveness of different training strategies, leveraging our curated anesthesiology-related dataset, including continuous pre-training (CPT) and supervised fine-tuning (SFT). Additionally, we also investigate how the test-time reasoning techniques, such as Best-of-N sampling and beam search, influence reasoning performance, and assess the impact of reasoning-enhanced model distillation, specifically DeepSeek-R1. We will publicly release AnesBench, along with our CPT and SFT training datasets and evaluation code at https://github.com/MiliLab/AnesBench.

MinkOcc: Towards real-time label-efficient semantic occupancy prediction

Samuel Sze,Daniele De Martini,Lars Kunze

Task: 开发一种多模态3D语义占用预测框架MinkOcc，以减少对密集3D标注的依赖。

Motivation: 密集3D标注成本高且资源密集，需要更高效的标注或无标注方法。

Details

Method: 采用两步半监督训练方法，结合少量3D标注和通过视觉基础模型标注的LiDAR扫描与图像数据。 Result: MinkOcc减少90%的手动标注需求，同时保持竞争性精度，并支持实时预测。 Conclusion: MinkOcc在监督和计算效率上的优势，有望推动3D语义占用预测在自动驾驶中的广泛应用。 Abstract: Developing 3D semantic occupancy prediction models often relies on dense 3D annotations for supervised learning, a process that is both labor and resource-intensive, underscoring the need for label-efficient or even label-free approaches. To address this, we introduce MinkOcc, a multi-modal 3D semantic occupancy prediction framework for cameras and LiDARs that proposes a two-step semi-supervised training procedure. Here, a small dataset of explicitly 3D annotations warm-starts the training process; then, the supervision is continued by simpler-to-annotate accumulated LiDAR sweeps and images -- semantically labelled through vision foundational models. MinkOcc effectively utilizes these sensor-rich supervisory cues and reduces reliance on manual labeling by 90\% while maintaining competitive accuracy. In addition, the proposed model incorporates information from LiDAR and camera data through early fusion and leverages sparse convolution networks for real-time prediction. With its efficiency in both supervision and computation, we aim to extend MinkOcc beyond curated datasets, enabling broader real-world deployment of 3D semantic occupancy prediction in autonomous driving.

Adapting Large Language Models for Multi-Domain Retrieval-Augmented-Generation

Alexandre Misrahi,Nadezhda Chirkova,Maxime Louis,Vassilina Nikoulina

Task: 引入一个多样化的基准测试并系统测试RAG调优策略的跨领域泛化能力。

Motivation: 解决多领域应用中缺乏多样化基准和跨领域泛化能力差的问题。

Details

Method: 使用包含8个来源和13个领域的多样化基准测试，并采用序列级蒸馏和教师生成标签的策略。 Result: 标准微调在跨领域泛化上表现不佳，但序列级蒸馏显著提升了性能。 Conclusion: 序列级蒸馏是提升多领域RAG鲁棒性的关键策略。 Abstract: Retrieval-Augmented Generation (RAG) enhances LLM factuality, but multi-domain applications face challenges like lack of diverse benchmarks and poor out-of-domain generalization. The first contribution of this work is to introduce a diverse benchmark comprising a variety of question-answering tasks from 8 sources and covering 13 domains. Our second contribution consists in systematically testing out-of-domain generalization for typical RAG tuning strategies. While our findings reveal that standard fine-tuning fails to generalize effectively, we show that sequence-level distillation with teacher-generated labels improves out-of-domain performance by providing more coherent supervision. Our findings highlight key strategies for improving multi-domain RAG robustness.

Generative Classifier for Domain Generalization

Shaocong Long,Qianyu Zhou,Xiangtai Li,Chenhao Ying,Yunhai Tong,Lizhuang Ma,Yuan Luo,Dacheng Tao

Task: 研究如何通过生成分类器驱动的方法提升计算机视觉模型在分布变化下的泛化能力。

Motivation: 现有领域泛化方法主要关注领域不变性，但忽视了领域特定信息的潜在价值，导致在面对多模态的领域特定信息时表现不佳。

Details

Method: 提出生成分类器驱动的领域泛化方法（GCDG），基于高斯混合模型（GMMs）设计，包含异质性学习分类器（HLC）、虚假相关性阻断（SCB）和多样性组件平衡（DCB）三个模块。 Result: GCDG在五个领域泛化基准和一个面部反欺骗数据集上表现出色，并能与现有方法无缝集成，带来一致性的性能提升。 Conclusion: GCDG通过有效捕捉领域特定信息的多样性分布，降低了目标风险并促进了平坦最小值，从而提升了模型的泛化能力。 Abstract: Domain generalization (DG) aims to improve the generalizability of computer vision models toward distribution shifts. The mainstream DG methods focus on learning domain invariance, however, such methods overlook the potential inherent in domain-specific information. While the prevailing practice of discriminative linear classifier has been tailored to domain-invariant features, it struggles when confronted with diverse domain-specific information, e.g., intra-class shifts, that exhibits multi-modality. To address these issues, we explore the theoretical implications of relying on domain invariance, revealing the crucial role of domain-specific information in mitigating the target risk for DG. Drawing from these insights, we propose Generative Classifier-driven Domain Generalization (GCDG), introducing a generative paradigm for the DG classifier based on Gaussian Mixture Models (GMMs) for each class across domains. GCDG consists of three key modules: Heterogeneity Learning Classifier~(HLC), Spurious Correlation Blocking~(SCB), and Diverse Component Balancing~(DCB). Concretely, HLC attempts to model the feature distributions and thereby capture valuable domain-specific information via GMMs. SCB identifies the neural units containing spurious correlations and perturbs them, mitigating the risk of HLC learning spurious patterns. Meanwhile, DCB ensures a balanced contribution of components in HLC, preventing the underestimation or neglect of critical components. In this way, GCDG excels in capturing the nuances of domain-specific information characterized by diverse distributions. GCDG demonstrates the potential to reduce the target risk and encourage flat minima, improving the generalizability. Extensive experiments show GCDG's comparable performance on five DG benchmarks and one face anti-spoofing dataset, seamlessly integrating into existing DG methods with consistent improvements.

Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

Chuanqi Cheng,Jian Guan,Wei Wu,Rui Yan

Task: 提出一种名为ViLaMP的分层视频语言模型，用于高效处理长视频内容。

Motivation: 现有方法在处理长视频时往往牺牲关键时间依赖或语义信息，导致性能下降。

Details

Method: 采用差分蒸馏原则，结合差分关键帧选择和差分特征合并机制，实现混合精度处理。 Result: ViLaMP在四个视频理解基准测试中表现优异，尤其擅长处理长视频内容，且计算效率高。 Conclusion: ViLaMP在保持高性能的同时，显著提升了长视频处理的效率。 Abstract: Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLaMP, a hierarchical video-language model that processes hour-long videos at ``mixed precision'' through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLaMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLaMP's superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLaMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance.

Beyond Conventional Transformers: The Medical X-ray Attention (MXA) Block for Improved Multi-Label Diagnosis Using Knowledge Distillation

Amit Rand,Hadi Ibrahim

Task: 提出一种针对X射线异常检测的专用注意力机制（MXA块），并将其嵌入EfficientViT架构中，以改进多标签分类性能。

Motivation: 医学影像（如X射线）通常需要同时检测多种异常，多标签分类对临床应用至关重要，但现有方法未能充分解决X射线检测的独特挑战。

Details

Method: 设计MXA块，增强传统多头自注意力（MHSA），结合局部细节和全局上下文；将其嵌入EfficientViT架构，并采用知识蒸馏。 Result: 在CheXpert数据集上，AUC达到0.85，比基线模型（AUC=0.66）提升0.19，相对随机猜测（AUC=0.5）提升约233%。 Conclusion: MXA块和EfficientViT的结合显著提升了X射线多标签分类性能，为临床诊断提供了更有效的工具。 Abstract: Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model's AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).

Cognitive Memory in Large Language Models

Lianlei Shan,Shixian Luo,Zezhou Zhu,Yu Yuan,Yong Wu

Task: 分析大型语言模型（LLM）中的记忆机制，包括其分类、实现方法及管理策略。

Motivation: 探讨记忆机制对LLM生成上下文丰富响应、减少幻觉和提高效率的重要性。

Details

Method: 将记忆分为感官记忆、短期记忆和长期记忆，并详细讨论了基于文本、KV缓存、参数和隐藏状态的记忆实现与管理方法。 Result: 提供了LLM记忆机制的全面分析，展示了其在模型性能提升中的作用。 Conclusion: 强调了记忆机制的重要性，并指出了未来研究方向。 Abstract: This paper examines memory mechanisms in Large Language Models (LLMs), emphasizing their importance for context-rich responses, reduced hallucinations, and improved efficiency. It categorizes memory into sensory, short-term, and long-term, with sensory memory corresponding to input prompts, short-term memory processing immediate context, and long-term memory implemented via external databases or structures. The text-based memory section covers acquisition (selection and summarization), management (updating, accessing, storing, and resolving conflicts), and utilization (full-text search, SQL queries, semantic search). The KV cache-based memory section discusses selection methods (regularity-based summarization, score-based approaches, special token embeddings) and compression techniques (low-rank compression, KV merging, multimodal compression), along with management strategies like offloading and shared attention mechanisms. Parameter-based memory methods (LoRA, TTT, MoE) transform memories into model parameters to enhance efficiency, while hidden-state-based memory approaches (chunk mechanisms, recurrent transformers, Mamba model) improve long-text processing by combining RNN hidden states with current methods. Overall, the paper offers a comprehensive analysis of LLM memory mechanisms, highlighting their significance and future research directions.

Trung Thanh Nguyen,Yasutomo Kawanishi,Vijay John,Takahiro Komamizu,Ichiro Ide

Task: 提出一种基于Transformer的多模态多视角传感器融合方法（MultiTSF），用于动作识别。

Motivation: 现有方法难以应对真实世界中的多样化环境条件、严格的传感器同步需求以及细粒度标注的挑战。

Details

Method: 利用Transformer动态建模视角间关系并捕捉多视角的时间依赖性，同时引入人体检测模块生成伪标签以优化特征学习。 Result: 在MultiSensor-Home和MM-Office数据集上的实验表明，MultiTSF在视频序列级和帧级动作识别中均优于现有方法。 Conclusion: MultiTSF通过动态建模和伪标签优化，显著提升了多模态多视角动作识别的性能。 Abstract: Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments. However, existing methods often fall short of addressing real-world challenges such as diverse environmental conditions, strict sensor synchronization, and the need for fine-grained annotations. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF). The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views. Additionally, we introduce a Human Detection Module to generate pseudo-ground-truth labels, enabling the model to prioritize frames containing human activity and enhance spatial feature learning. Comprehensive experiments conducted on our in-house MultiSensor-Home dataset and the existing MM-Office dataset demonstrate that MultiTSF outperforms state-of-the-art methods in both video sequence-level and frame-level action recognition settings.

Inference-Time Scaling for Generalist Reward Modeling

Zijun Liu,Peiyi Wang,Runxin Xu,Shirong Ma,Chong Ruan,Peng Li,Yang Liu,Yu Wu

Task: 研究如何通过改进奖励建模（RM）和学习方法，提升大型语言模型（LLMs）在推理时的可扩展性和性能。

Motivation: 强化学习（RL）在LLMs的后训练中广泛应用，但如何为不同领域的LLMs获取准确的奖励信号仍是一个关键挑战。

Details

Method: 采用点式生成奖励建模（GRM）和自原则批判调优（SPCT）方法，结合在线RL和并行采样技术。 Result: 提出的DeepSeek-GRM模型在多个RM基准测试中表现优异，优于现有方法，且未出现严重偏差。 Conclusion: SPCT显著提升了GRM的质量和可扩展性，未来需进一步研究通用奖励系统以解决部分任务中的挑战。 Abstract: Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the $\textbf{inference-time scalability of generalist RM}$, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in $\textbf{DeepSeek-GRM}$ models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.

Moment Quantization for Video Temporal Grounding

Xiaolong Sun,Le Wang,Sanping Zhou,Liushuai Shi,Kun Xia,Mengnan Liu,Yabing Wang,Gang Hua

Task: 提出一种基于时刻量化的视频时间定位方法（MQVTG），以增强相关与无关时刻的区分能力。

Motivation: 现有方法在区分前景和背景特征时表现较弱，MQVTG通过量化视频为离散向量来解决这一问题。

Details

Method: MQVTG使用可学习的时刻码本，将视频时刻与码字匹配，并通过聚类过程避免直接硬量化导致的信息损失。此外，采用先验初始化和联合投影策略优化码本。 Result: 在六个基准测试中显著优于现有方法，定性分析显示能有效分组相关特征并分离无关特征。 Conclusion: MQVTG是一种简单高效的插拔式组件，能显著提升视频时间定位任务的性能。 Abstract: Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective prior-initialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination.

UNDO: Understanding Distillation as Optimization

Kushal Jain,Piyushi Goyal,Kumar Shridhar

Task: 提出一种名为UNDO的迭代知识蒸馏框架，以优化学生模型的学习效果。

Motivation: 标准的一次性知识蒸馏方法因教师生成的解释与学生需求不匹配而效果不佳。

Details

Method: 通过迭代识别学生错误并提示教师调整解释，直接针对学生学习缺陷。 Result: 在数学和常识推理任务中，UNDO比标准方法性能提升高达20%，且教师生成的数据对其他学生模型也有效。 Conclusion: 将知识蒸馏重构为迭代师生互动，通过动态调整实现更优的知识传递。 Abstract: Knowledge distillation has emerged as an effective strategy for compressing large language models' (LLMs) knowledge into smaller, more efficient student models. However, standard one-shot distillation methods often produce suboptimal results due to a mismatch between teacher-generated rationales and the student's specific learning requirements. In this paper, we introduce the UNDO: UNderstanding Distillation as Optimization framework, designed to bridge this gap by iteratively identifying the student's errors and prompting the teacher to refine its explanations accordingly. Each iteration directly targets the student's learning deficiencies, motivating the teacher to provide tailored and enhanced rationales that specifically address these weaknesses. Empirical evaluations on various challenging mathematical and commonsense reasoning tasks demonstrate that our iterative distillation method, UNDO, significantly outperforms standard one-step distillation methods, achieving performance gains of up to 20%. Additionally, we show that teacher-generated data refined through our iterative process remains effective even when applied to different student models, underscoring the broad applicability of our approach. Our work fundamentally reframes knowledge distillation as an iterative teacher-student interaction, effectively leveraging dynamic refinement by the teacher for better knowledge distillation.

Trung Thanh Nguyen,Yasutomo Kawanishi,Vijay John,Takahiro Komamizu,Ichiro Ide

Task: 提出一种基于Transformer的多模态多视角传感器融合方法（MultiTSF）并引入MultiSensor-Home数据集，用于家庭环境中的全面动作识别。

Motivation: 当前数据集未能解决真实世界中的挑战（如广域环境条件、异步数据流和缺乏帧级标注），且现有方法在建模视角间关系和增强空间特征学习方面存在困难。

Details

Method: MultiTSF方法利用基于Transformer的融合机制动态建模视角间关系，并集成外部人体检测模块以增强空间特征学习。 Result: 在MultiSensor-Home和MM-Office数据集上的实验表明，MultiTSF优于现有方法。 Conclusion: MultiTSF方法在推动真实世界多模态多视角动作识别方面具有显著效果。 Abstract: Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area environmental conditions, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method and introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments. The MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the method also integrates a external human detection module to enhance spatial feature learning. Experiments on MultiSensor-Home and MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. The quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition.

Leveraging LLM For Synchronizing Information Across Multilingual Tables

Siddharth Khincha,Tushar Kataria,Ankita Anand,Dan Roth,Vivek Gupta

Task: 探索使用大型语言模型（LLMs）进行多语言信息同步，以零样本提示作为可扩展解决方案。

Motivation: 在线信息集中在高资源语言（如英语和法语），导致低资源语言的维基百科内容过时或不完整，现有规则方法难以处理复杂性和泛化问题。

Details

Method: 引入信息更新数据集，模拟更新过时维基百科表格的过程，并提出任务分解策略以提高连贯性和准确性。 Result: 提出的方法在信息更新（1.79%）和信息添加（20.58%）方面优于现有基线，展示了模型在动态更新和丰富数据方面的优势。 Conclusion: 大型语言模型在多语言信息同步中表现出色，任务分解策略显著提升了性能。 Abstract: The vast amount of online information today poses challenges for non-English speakers, as much of it is concentrated in high-resource languages such as English and French. Wikipedia reflects this imbalance, with content in low-resource languages frequently outdated or incomplete. Recent research has sought to improve cross-language synchronization of Wikipedia tables using rule-based methods. These approaches can be effective, but they struggle with complexity and generalization. This paper explores large language models (LLMs) for multilingual information synchronization, using zero-shot prompting as a scalable solution. We introduce the Information Updation dataset, simulating the real-world process of updating outdated Wikipedia tables, and evaluate LLM performance. Our findings reveal that single-prompt approaches often produce suboptimal results, prompting us to introduce a task decomposition strategy that enhances coherence and accuracy. Our proposed method outperforms existing baselines, particularly in Information Updation (1.79%) and Information Addition (20.58%), highlighting the model strength in dynamically updating and enriching data across architectures

OmniCam: Unified Multimodal Video Generation via Camera Control

Xiaoda Yang,Jiayang Xu,Kaixuan Luan,Xinyu Zhan,Hongshun Qiu,Shijun Shi,Hao Li,Shuai Yang,Li Zhang,Checheng Yu,Cewu Lu,Lixin Yang

Task: 提出OmniCam，一个统一的多模态相机控制框架，用于生成时空一致的视频。

Motivation: 现有相机控制方法存在交互复杂和控制能力有限的问题。

Details

Method: 结合大型语言模型和视频扩散模型，支持多种输入模态组合（文本、视频、图像等）作为相机路径引导和内容参考。 Result: 实验结果表明，OmniCam在高质量相机控制视频生成方面达到最先进性能。 Conclusion: OmniCam通过多模态输入实现了对相机运动的精确控制，解决了现有方法的局限性。 Abstract: Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.

Language Models reach higher Agreement than Humans in Historical Interpretation

Fabio Celli,Georgios Spathulas

Task: 比较人类和大型语言模型在历史注释中的表现。

Motivation: 探讨人类和大型语言模型在历史注释中的文化偏见和共识差异。

Details

Method: 通过对比人类和大型语言模型对历史事实的注释和解释。 Result: 大型语言模型在短文本的历史事实解释上达成更高共识，但存在信息遗漏或幻觉；人类则因个人偏见而分歧。 Conclusion: 研究为数字人文提供了大规模注释和定量分析的可能性，促进对历史解释和偏见的批判性思考。 Abstract: This paper compares historical annotations by humans and Large Language Models. The findings reveal that both exhibit some cultural bias, but Large Language Models achieve a higher consensus on the interpretation of historical facts from short texts. While humans tend to disagree on the basis of their personal biases, Large Models disagree when they skip information or produce hallucinations. These findings have significant implications for digital humanities, enabling large-scale annotation and quantitative analysis of historical data. This offers new educational and research opportunities to explore historical interpretations from different Language Models, fostering critical thinking about bias.

ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

Yuan Zhou,Shilong Jin,Litao Hua,Wanjun Lv,Haoran Duan,Jungong Han

Task: 提出ConsDreamer框架以解决零样本文本到3D生成中的视角偏差问题。

Motivation: 现有方法利用3D高斯飞溅和分数蒸馏技术，但受限于文本到图像模型的视角偏差，导致3D生成不一致，尤其是多面Janus问题。

Details

Method: ConsDreamer通过改进分数蒸馏过程中的条件项和无条件项，包括视角解耦模块（VDM）和基于相似性的偏序损失。 Result: 实验表明，ConsDreamer有效缓解了多面Janus问题，在视觉质量和一致性上优于现有方法。 Conclusion: ConsDreamer通过消除视角偏差，提升了文本到3D生成的一致性和质量。 Abstract: Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel framework that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise camera parameters; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer effectively mitigates the multi-face Janus problem in text-to-3D generation, outperforming existing methods in both visual quality and consistency.

LexPam: Legal Procedure Awareness-Guided Mathematical Reasoning

Kepu Zhang,Guofu Xie,Weijie Yu,Mingyue Xu,Xu Tang,Yaxin Li,Jun Xu

Task: 提出第一个中文法律数学推理数据集LexNum，并基于此测试现有法律LLM和推理LLM的性能，同时引入LexPam算法增强LLM在法律场景中的数学推理能力。

Motivation: 现有法律LLM缺乏法律数学推理能力，且缺乏相关数据集验证和提升这一能力，导致其在真实法律场景中的可信度受限。

Details

Method: 构建LexNum数据集，包含三种常见法律数学推理场景；测试现有模型性能；提出LexPam算法（基于法律程序意识的强化学习算法）训练LLM。 Result: 现有法律LLM和推理模型在法律数学推理任务中表现不佳，LexPam能显著提升LLM在此类任务中的能力。 Conclusion: LexNum和LexPam填补了法律数学推理领域的空白，为提升LLM在法律场景中的可信度提供了有效工具。 Abstract: The legal mathematical reasoning ability of LLMs is crucial when applying them to real-world scenarios, as it directly affects the credibility of the LLM. While existing legal LLMs can perform general judicial question answering, their legal mathematical reasoning capabilities have not been trained. Open-domain reasoning models, though able to generate detailed calculation steps, do not follow the reasoning logic required for legal scenarios. Additionally, there is currently a lack of legal mathematical reasoning datasets to help validate and enhance LLMs' reasoning abilities in legal contexts. To address these issues, we propose the first Chinese legal Mathematical Reasoning Dataset, LexNum, which includes three common legal mathematical reasoning scenarios: economic compensation, work injury compensation, and traffic accident compensation. Based on LexNum, we tested the performance of existing legal LLMs and reasoning LLMs, and introduced LexPam, a reinforcement learning algorithm guided by legal procedural awareness to train LLMs, enhancing their mathematical reasoning abilities in legal scenarios. Experiments on tasks in the three legal scenarios show that the performance of existing legal LLMs and reasoning models in legal mathematical reasoning tasks is unsatisfactory. LexPam can enhance the LLM's ability in these tasks.

X-Capture: An Open-Source Portable Device for Multi-Sensory Learning

Samuel Clarke,Suzannah Wistreich,Yanjie Ze,Jiajun Wu

Task: 开发一种低成本、便携式的多感官数据采集设备X-Capture，并利用其构建一个多样化的多感官数据集。

Motivation: 现有数据集多局限于受控环境或模拟对象，缺乏真实世界的多样性和多感官关联性，限制了AI和机器人系统的多感官学习能力。

Details

Method: 设计并实现X-Capture设备，能够同步采集RGBD图像、触觉数据和撞击音频，并在真实环境中构建包含500个日常物体的3,000个数据点的样本数据集。 Result: 实验表明，X-Capture采集的数据在预训练和微调多模态表示时具有显著价值，尤其在跨感官检索和重建等任务中表现突出。 Conclusion: X-Capture为AI系统提供了可扩展、易获取且适用于真实世界的多感官数据采集方案，推动了类人多感官表征的发展。 Abstract: Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.

LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect

Hedi Naouara,Jean-Pierre Lorré,Jérôme Louradour

Task: 开发适用于突尼斯阿拉伯方言的自动语音识别（ASR）系统。

Motivation: 突尼斯阿拉伯方言的语言复杂性及标注语音数据集的稀缺性导致ASR系统开发困难。

Details

Method: 提出LinTO音频和文本数据集，涵盖突尼斯阿拉伯方言的语音和词汇特征，包括多样化文本和真实音频样本。 Result: LinTO数据集提供了高质量音频和精确转录，为突尼斯阿拉伯方言的ASR系统开发和基准测试提供了材料。 Conclusion: LinTO数据集为解决突尼斯阿拉伯方言ASR系统的数据稀缺问题提供了有效资源。 Abstract: Developing Automatic Speech Recognition (ASR) systems for Tunisian Arabic Dialect is challenging due to the dialect's linguistic complexity and the scarcity of annotated speech datasets. To address these challenges, we propose the LinTO audio and textual datasets -- comprehensive resources that capture phonological and lexical features of Tunisian Arabic Dialect. These datasets include a variety of texts from numerous sources and real-world audio samples featuring diverse speakers and code-switching between Tunisian Arabic Dialect and English or French. By providing high-quality audio paired with precise transcriptions, the LinTO audio and textual datasets aim to provide qualitative material to build and benchmark ASR systems for the Tunisian Arabic Dialect. Keywords -- Tunisian Arabic Dialect, Speech-to-Text, Low-Resource Languages, Audio Data Augmentation

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Congpei Qiu,Yanhao Wu,Wei Ke,Xiuxiu Bai,Tong Zhang

Task: 提出一种名为空间相关性蒸馏（SCD）的框架，以增强CLIP在密集预测任务中的空间感知能力。

Motivation: CLIP在全局对齐语言和图像方面表现优异，但在需要精确空间理解的任务中表现不足，尤其是经过区域-语言对齐（RLA）微调后，空间感知能力显著下降。

Details

Method: 提出SCD框架以保留CLIP固有的空间结构，并引入轻量级Refiner从CLIP中提取高质量密集特征。 Result: 该方法在开放词汇密集预测基准测试中取得了最先进的结果。 Conclusion: SCD框架成功结合了视觉-语言和视觉中心的改进，显著提升了CLIP在密集任务中的性能。 Abstract: Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguing finding that CLIP naturally captures high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.

LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems

Zishuo Liu,Carlos Rabat Villarreal,Mostafa Rahgouy,Amit Das,Zheng Zhang,Chang Ren,Dongji Feng

Task: 探索大型语言模型（LLMs）在解决费米问题（FPs）中的能力和局限性。

Motivation: 费米问题因其涉及现实世界的不切实际性或模糊概念，对人类和AI都具有挑战性，但LLMs在此类任务中的表现尚未充分研究。

Details

Method: 使用公开可用的FP数据集评估三种先进LLMs的性能，并根据TELeR分类设计提示，包括零样本场景。 Result: 所有LLMs的fp_score均低于0.5，表明这些任务具有固有难度；LLMs在标准FPs上表现优于特定FPs。 Conclusion: LLMs在解决费米问题时表现有限，尤其在特定问题上，但标准问题的表现相对较好。 Abstract: Fermi Problems (FPs) are mathematical reasoning tasks that require human-like logic and numerical reasoning. Unlike other reasoning questions, FPs often involve real-world impracticalities or ambiguous concepts, making them challenging even for humans to solve. Despite advancements in AI, particularly with large language models (LLMs) in various reasoning tasks, FPs remain relatively under-explored. This work conducted an exploratory study to examine the capabilities and limitations of LLMs in solving FPs. We first evaluated the overall performance of three advanced LLMs using a publicly available FP dataset. We designed prompts according to the recently proposed TELeR taxonomy, including a zero-shot scenario. Results indicated that all three LLMs achieved a fp_score (range between 0 - 1) below 0.5, underscoring the inherent difficulty of these reasoning tasks. To further investigate, we categorized FPs into standard and specific questions, hypothesizing that LLMs would perform better on standard questions, which are characterized by clarity and conciseness, than on specific ones. Comparative experiments confirmed this hypothesis, demonstrating that LLMs performed better on standard FPs in terms of both accuracy and efficiency.

Evaluating and Enhancing Segmentation Model Robustness with Metamorphic Testing

Seif Mzoughi,Mohamed Elshafeia,Foutse Khomh

Task: 提出SegRMT方法，通过遗传算法优化空间和光谱变换序列，生成对抗样本以测试图像分割模型的鲁棒性。

Motivation: 图像分割模型在医学影像、增强现实等应用中至关重要，但缺乏鲁棒性，容易受到对抗性扰动的攻击。

Details

Method: 利用遗传算法（GA）优化变换序列，同时通过预设的PSNR阈值保持图像保真度，生成对抗样本。 Result: SegRMT将DeepLabV3的mIoU降至6.4%，优于其他基线方法（8.5%-21.7%），并在对抗训练中提升模型性能（mIoU达73%）。 Conclusion: SegRMT不仅能模拟真实图像失真，还能增强分割模型的鲁棒性，适用于安全关键应用。 Abstract: Image segmentation is critical for applications such as medical imaging, augmented reality, and video surveillance. However, segmentation models often lack robustness, making them vulnerable to adversarial perturbations from subtle image distortions. In this work, we propose SegRMT, a metamorphic testing approach that leverages genetic algorithms (GA) to optimize sequences of spatial and spectral transformations while preserving image fidelity via a predefined PSNR threshold. Using the Cityscapes dataset, our method generates adversarial examples that effectively challenge the DeepLabV3 segmentation model. Our experiments show that SegRMT reduces DeepLabV3's mean Intersection over Union (mIoU) to 6.4%, outperforming other adversarial baselines that decrease mIoU to between 8.5% and 21.7%. Furthermore, when used for adversarial training, SegRMT boosts model performance, achieving mIoU improvements up to 73% on dedicated adversarial datasets and increasing cross-adversarial mIoU to 53.8%, compared to only 2%-10% for other methods. These findings demonstrate that SegRMT not only simulates realistic image distortions but also enhances the robustness of segmentation models, making it a valuable tool for ensuring reliable performance in safety-critical applications.

Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole

Jacqueline Rowe,Edward Gow-Smith,Mark Hepple

Task: 构建并评估一个用于几内亚比绍克里奥尔语（Kiriol）机器翻译的新数据集。

Motivation: 解决低资源语言（如Kiriol）在机器翻译中数据稀缺的问题，并探索如何从宗教领域数据迁移到通用领域。

Details

Method: 使用约4万句平行语料（主要来自宗教文本，少量来自通用领域词典），训练多个基于Transformer的模型，研究领域迁移的效果。 Result: 添加少量目标领域数据（如300句）显著提升翻译性能；葡萄牙语到Kiriol的翻译表现最佳，可能与语言形态复杂性和词汇重叠有关。 Conclusion: 强调小规模数据收集对低资源语言的重要性，并希望推动对克里奥尔语机器翻译的研究。 Abstract: We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah's Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.

LPA3D: 3D Room-Level Scene Generation from In-the-Wild Images

Ming-Jia Yang,Yu-Xiao Guo,Yang Liu,Bin Zhou,Xin Tong

Task: 从单张RGB图像生成语义合理且细节丰富的室内场景。

Motivation: 现有基于NeRF的场景生成方法需要多视角、深度图像或语义引导等额外信息，而无法仅依赖单张RGB图像，主要因为相机位姿估计的困难。

Details

Method: 提出Local-Pose-Alignment (LPA)框架，并在此基础上开发LPA-GAN，一种结合相机位姿估计和场景生成的NeRF生成方法。 Result: 实验表明LPA-GAN在视角一致性和语义合理性上优于其他方法。 Conclusion: LPA-GAN通过重新定义全局位姿和联合优化，有效解决了单张RGB图像生成室内场景的挑战。 Abstract: Generating realistic, room-level indoor scenes with semantically plausible and detailed appearances from in-the-wild images is crucial for various applications in VR, AR, and robotics. The success of NeRF-based generative methods indicates a promising direction to address this challenge. However, unlike their success at the object level, existing scene-level generative methods require additional information, such as multiple views, depth images, or semantic guidance, rather than relying solely on RGB images. This is because NeRF-based methods necessitate prior knowledge of camera poses, which is challenging to approximate for indoor scenes due to the complexity of defining alignment and the difficulty of globally estimating poses from a single image, given the unseen parts behind the camera. To address this challenge, we redefine global poses within the framework of Local-Pose-Alignment (LPA) -- an anchor-based multi-local-coordinate system that uses a selected number of anchors as the roots of these coordinates. Building on this foundation, we introduce LPA-GAN, a novel NeRF-based generative approach that incorporates specific modifications to estimate the priors of camera poses under LPA. It also co-optimizes the pose predictor and scene generation processes. Our ablation study and comparisons with straightforward extensions of NeRF-based object generative methods demonstrate the effectiveness of our approach. Furthermore, visual comparisons with other techniques reveal that our method achieves superior view-to-view consistency and semantic normality.

The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

Nikhil Verma,Manasa Bharadwaj

Task: 分析大型语言模型在多语言环境中的对齐效果及其分布变化。

Motivation: 当前的对齐方法主要针对英语，导致多语言环境下对齐效果不明确，存在单语偏见问题。

Details

Method: 通过系统分析对齐前后LLM嵌入空间的分布变化，利用对齐诱导的安全空间分离作为量化工具，评估七种LLM在平衡毒性数据集和并行文本去毒基准上的表现。 Result: 揭示了高资源语言与低资源语言在潜在表示空间中的显著差异。 Conclusion: 需要针对特定语言进行微调，以确保多语言对齐的公平性、可靠性和鲁棒性，为开发真正安全的多语言LLM奠定基础。 Abstract: Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints. Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages.

SemiISP/SemiIE: Semi-Supervised Image Signal Processor and Image Enhancement Leveraging One-to-Many Mapping sRGB-to-RAW

Masakazu Yoshimura,Junji Otsuka,Radu Berdan,Takeshi Ohashi

Task: 实现基于半监督学习的图像信号处理器（ISP）和图像增强（IE）任务。

Motivation: 由于创建训练数据的成本高且个性化需求多样，半监督学习成为潜在解决方案。

Details

Method: 提出一种改进的sRGB-to-RAW方法，并结合半监督学习用于ISP和IE任务。 Result: 所提出的方法成功提升了多种模型在不同数据集上的图像质量。 Conclusion: 半监督学习结合改进的sRGB-to-RAW方法为ISP和IE任务提供了有效解决方案。 Abstract: DNN-based methods have been successful in Image Signal Processor (ISP) and image enhancement (IE) tasks. However, the cost of creating training data for these tasks is considerably higher than for other tasks, making it difficult to prepare large-scale datasets. Also, creating personalized ISP and IE with minimal training data can lead to new value streams since preferred image quality varies depending on the person and use case. While semi-supervised learning could be a potential solution in such cases, it has rarely been utilized for these tasks. In this paper, we realize semi-supervised learning for ISP and IE leveraging a RAW image reconstruction (sRGB-to-RAW) method. Although existing sRGB-to-RAW methods can generate pseudo-RAW image datasets that improve the accuracy of RAW-based high-level computer vision tasks such as object detection, their quality is not sufficient for ISP and IE tasks that require precise image quality definition. Therefore, we also propose a sRGB-to-RAW method that can improve the image quality of these tasks. The proposed semi-supervised learning with the proposed sRGB-to-RAW method successfully improves the image quality of various models on various datasets.

ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization

Kehua Feng,Keyan Ding,Jing Yu,Menghan Li,Yuhao Wang,Tong Xu,Xinda Wang,Qiang Zhang,Huajun Chen

Task: 提出一种名为Ex-Ante Reasoning Preference Optimization (ERPO)的新型安全对齐框架，以增强大型语言模型的安全性。

Motivation: 现有对齐方法在多样化的安全场景中表现不足，且易受对抗性攻击影响。

Details

Method: 通过三个阶段实现：1) 使用监督微调（SFT）为模型配备Ex-Ante推理能力；2) 通过直接偏好优化（DPO）增强安全性、实用性和效率；3) 采用长度控制的迭代偏好优化策略减少推理延迟。 Result: 实验表明，ERPO显著提升了安全性，同时保持了响应效率。 Conclusion: ERPO是一种有效的安全对齐框架，能够在保证效率的同时显著提升模型的安全性。 Abstract: Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy. Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency.

Agglomerating Large Vision Encoders via Distillation for VFSS Segmentation

Chengxi Zeng,Yuxuan Jiang,Fan Zhang,Alberto Gambaruto,Tilo Burghardt

Task: 提出一种新框架，通过知识蒸馏从多个大型医学基础模型中提升低复杂度模型的性能。

Motivation: 医学基础模型在医学影像中表现良好，但其训练和推理复杂度高，轻量级变体性能受限。

Details

Method: 通过从多个大型医学基础模型（如MedSAM、RAD-DINO、MedCLIP）中进行知识蒸馏，提升低复杂度模型的性能。 Result: 提出的模型在12个分割任务中表现出更好的泛化能力，平均Dice系数提升了2%。 Conclusion: 该方法在医学图像分割任务中实现了复杂度和性能的更好平衡。 Abstract: The deployment of foundation models for medical imaging has demonstrated considerable success. However, their training overheads associated with downstream tasks remain substantial due to the size of the image encoders employed, and the inference complexity is also significantly high. Although lightweight variants have been obtained for these foundation models, their performance is constrained by their limited model capacity and suboptimal training strategies. In order to achieve an improved tradeoff between complexity and performance, we propose a new framework to improve the performance of low complexity models via knowledge distillation from multiple large medical foundation models (e.g., MedSAM, RAD-DINO, MedCLIP), each specializing in different vision tasks, with the goal to effectively bridge the performance gap for medical image segmentation tasks. The agglomerated model demonstrates superior generalization across 12 segmentation tasks, whereas specialized models require explicit training for each task. Our approach achieved an average performance gain of 2\% in Dice coefficient compared to simple distillation.

Why do LLMs attend to the first token?

Federico Barbero,Álvaro Arroyo,Xiangming Gu,Christos Perivolaropoulos,Michael Bronstein,Petar Veličkovi ć,Razvan Pascanu

Task: 研究大型语言模型（LLMs）中注意力机制（attention sink）的形成原因及其作用。

Motivation: 尽管已有许多研究探讨了注意力机制的现象及其影响，但对其形成原因和实际用途的理解仍不够深入。

Details

Method: 通过理论和实证分析，探讨注意力机制如何帮助LLMs避免过度混合信息，并结合实验验证不同因素（如上下文长度、模型深度等）对其行为的影响。 Result: 实验验证了理论直觉，并揭示了注意力机制在LLMs中的实际用途。 Conclusion: 本研究为理解注意力机制的形成及其在训练中的作用提供了新的视角。 Abstract: Large Language Models (LLMs) tend to attend heavily to the first token in the sequence -- creating a so-called attention sink. Many works have studied this phenomenon in detail, proposing various ways to either leverage or alleviate it. Attention sinks have been connected to quantisation difficulties, security issues, and streaming attention. Yet, while many works have provided conditions in which they occur or not, a critical question remains shallowly answered: Why do LLMs learn such patterns and how are they being used? In this work, we argue theoretically and empirically that this mechanism provides a method for LLMs to avoid over-mixing, connecting this to existing lines of work that study mathematically how information propagates in Transformers. We conduct experiments to validate our theoretical intuitions and show how choices such as context length, depth, and data packing influence the sink behaviour. We hope that this study provides a new practical perspective on why attention sinks are useful in LLMs, leading to a better understanding of the attention patterns that form during training.

All-day Depth Completion via Thermal-LiDAR Fusion

Janghyun Kim,Minseong Kweon,Jinsun Park,Ukcheol Shin

Task: 利用热成像和LiDAR数据进行深度补全，以提升在恶劣环境（如低光照和雨天）下的性能。

Motivation: 现有方法在恶劣环境下表现不佳，且地面真实深度图存在缺失问题，而热成像相机在此类条件下表现可靠但研究不足。

Details

Method: 提出COPS框架，结合对比学习和伪监督，利用深度基础模型增强深度边界清晰度和补全精度。 Result: 在MS$^2$和ViViD数据集上进行了广泛基准测试，验证了方法的可行性和鲁棒性。 Conclusion: COPS框架有效解决了热成像-LiDAR深度补全中的关键挑战，为未来研究提供了方向。 Abstract: Depth completion, which estimates dense depth from sparse LiDAR and RGB images, has demonstrated outstanding performance in well-lit conditions. However, due to the limitations of RGB sensors, existing methods often struggle to achieve reliable performance in harsh environments, such as heavy rain and low-light conditions. Furthermore, we observe that ground truth depth maps often suffer from large missing measurements in adverse weather conditions such as heavy rain, leading to insufficient supervision. In contrast, thermal cameras are known for providing clear and reliable visibility in such conditions, yet research on thermal-LiDAR depth completion remains underexplored. Moreover, the characteristics of thermal images, such as blurriness, low contrast, and noise, bring unclear depth boundary problems. To address these challenges, we first evaluate the feasibility and robustness of thermal-LiDAR depth completion across diverse lighting (eg., well-lit, low-light), weather (eg., clear-sky, rainy), and environment (eg., indoor, outdoor) conditions, by conducting extensive benchmarks on the MS$^2$ and ViViD datasets. In addition, we propose a framework that utilizes COntrastive learning and Pseudo-Supervision (COPS) to enhance depth boundary clarity and improve completion accuracy by leveraging a depth foundation model in two key ways. First, COPS enforces a depth-aware contrastive loss between different depth points by mining positive and negative samples using a monocular depth foundation model to sharpen depth boundaries. Second, it mitigates the issue of incomplete supervision from ground truth depth maps by leveraging foundation model predictions as dense depth priors. We also provide in-depth analyses of the key challenges in thermal-LiDAR depth completion to aid in understanding the task and encourage future research.

Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study

Aryan Agrawal,Lisa Alazraki,Shahin Honarvar,Marek Rei

Task: 研究如何通过字符和单词级别的编辑增强大型语言模型（LLMs）对任务指令扰动的鲁棒性。

Motivation: 现有方法主要关注扰动数据样本，而对任务指令扰动的鲁棒性增强研究较少。

Details

Method: 采用自去噪和表示对齐等技术，测试不同模型（Llama 3和Flan-T5）、数据集（CoLA、QNLI、SST-2）和指令（任务导向和角色导向）。 Result: 自去噪方法（无论是冻结LLM还是微调模型）平均性能提升显著高于其他策略。 Conclusion: 自去噪是增强LLMs对任务指令扰动鲁棒性的有效方法。 Abstract: Large Language Models (LLMs) are highly vulnerable to input perturbations, as even a small prompt change may result in a substantially different output. Existing methods to enhance LLM robustness are primarily focused on perturbed data samples, whereas improving resiliency to perturbations of task-level instructions has remained relatively underexplored. In this work, we focus on character- and word-level edits of task-specific instructions, which substantially degrade downstream performance. We experiment with a variety of techniques to enhance the robustness of LLMs, including self-denoising and representation alignment, testing different models (Llama 3 and Flan-T5), datasets (CoLA, QNLI, SST-2) and instructions (both task-oriented and role-oriented). We find that, on average, self-denoising -- whether performed by a frozen LLM or a fine-tuned model -- achieves substantially higher performance gains than alternative strategies, including more complex baselines such as ensembling and supervised methods.

Brightness Perceiving for Recursive Low-Light Image Enhancement

Haodian Wang,Long Peng,Yuejin Sun,Zengyu Wan,Yang Wang,Yang Cao

Task: 提出一种基于亮度感知的递归增强框架，用于高动态范围低光图像增强。

Motivation: 由于真实低光场景的动态范围广泛，现有端到端方法难以将低光图像增强至正常曝光。

Details

Method: 采用递归增强框架，包含两个并行子网络（ACT-Net和BP-Net），并结合无监督训练策略。 Result: 在六个参考和非参考指标上达到SOTA性能，PSNR提升0.9 dB。 Conclusion: 所提方法能有效解决低光图像增强问题，性能优于现有方法。 Abstract: Due to the wide dynamic range in real low-light scenes, there will be large differences in the degree of contrast degradation and detail blurring of captured images, making it difficult for existing end-to-end methods to enhance low-light images to normal exposure. To address the above issue, we decompose low-light image enhancement into a recursive enhancement task and propose a brightness-perceiving-based recursive enhancement framework for high dynamic range low-light image enhancement. Specifically, our recursive enhancement framework consists of two parallel sub-networks: Adaptive Contrast and Texture enhancement network (ACT-Net) and Brightness Perception network (BP-Net). The ACT-Net is proposed to adaptively enhance image contrast and details under the guidance of the brightness adjustment branch and gradient adjustment branch, which are proposed to perceive the degradation degree of contrast and details in low-light images. To adaptively enhance images captured under different brightness levels, BP-Net is proposed to control the recursive enhancement times of ACT-Net by exploring the image brightness distribution properties. Finally, in order to coordinate ACT-Net and BP-Net, we design a novel unsupervised training strategy to facilitate the training procedure. To further validate the effectiveness of the proposed method, we construct a new dataset with a broader brightness distribution by mixing three low-light datasets. Compared with eleven existing representative methods, the proposed method achieves new SOTA performance on six reference and no reference metrics. Specifically, the proposed method improves the PSNR by 0.9 dB compared to the existing SOTA method.

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Jaap Jumelet,Leonie Weissweiler,Arianna Bisazza

Task: Introduce MultiBLiMP 1.0, a multilingual benchmark for evaluating linguistic minimal pairs across 101 languages.

Motivation: To assess the abilities of large language models (LLMs) at a multilingual scale and identify limitations in modeling low-resource languages.

Details

Method: Utilize an automated pipeline leveraging Universal Dependencies and UniMorph to create over 125,000 minimal pairs covering 6 linguistic phenomena. Result: MultiBLiMP 1.0 provides a comprehensive benchmark, revealing shortcomings in current LLMs for low-resource languages. Conclusion: The benchmark highlights the need for improved modeling techniques for low-resource languages. Abstract: We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages, 6 linguistic phenomena and containing more than 125,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin,Jeongsoo Choi,Puyuan Peng,Joon Son Chung,Tae-Hyun Oh,David Harwath

Task: 提出VoiceCraft-Dub，一种基于文本和面部线索自动合成高质量语音的视频配音方法。

Motivation: 该任务在电影制作、多媒体创作和辅助语音障碍人士方面有广泛应用。

Details

Method: 扩展神经编解码语言模型（NCLMs）的能力，通过融入视频特征，确保合成语音与面部动作时间同步且表达对齐，同时保持自然韵律。设计适配器对齐面部特征与NCLM标记空间，并引入音频-视觉融合层在NCLM框架内合并信息。 Result: 模型实现了高质量、清晰且自然的语音合成，具有准确的唇同步，在人类感知和客观评估中优于现有方法。 Conclusion: VoiceCraft-Dub展示了其在视频配音任务中的高效性，并适用于多种应用场景。 Abstract: We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.

A Framework for Robust Cognitive Evaluation of LLMs

Karin de Langis,Jong Inn Park,Bin Hu,Khanh Chi Le,Andreas Schramm,Michael C. Mensink,Andrew Elfenbein,Dongyeop Kang

Task: 开发CognitivEval框架，用于系统评估大型语言模型（LLMs）的人工认知能力。

Motivation: 尽管大型语言模型的认知能力已被广泛观察，但其本质和机制仍不清楚，且缺乏标准化的评估方法。

Details

Method: CognitivEval框架包括自动提示排列和同时收集生成结果与模型概率估计的测试方法。 Result: 实验表明该框架能提高实验结果的稳健性，并成功复现了五项经典认知科学实验。 Conclusion: CognitivEval框架具有通用性，可用于评估不同任务的LLMs认知能力，并将公开发布以促进合作。 Abstract: Emergent cognitive abilities in large language models (LLMs) have been widely observed, but their nature and underlying mechanisms remain poorly understood. A growing body of research draws on cognitive science to investigate LLM cognition, but standard methodologies and experimen-tal pipelines have not yet been established. To address this gap we develop CognitivEval, a framework for systematically evaluating the artificial cognitive capabilities of LLMs, with a particular emphasis on robustness in response collection. The key features of CognitivEval include: (i) automatic prompt permutations, and (ii) testing that gathers both generations and model probability estimates. Our experiments demonstrate that these features lead to more robust experimental outcomes. Using CognitivEval, we replicate five classic experiments in cognitive science, illustrating the framework's generalizability across various experimental tasks and obtaining a cognitive profile of several state of the art LLMs. CognitivEval will be released publicly to foster broader collaboration within the cognitive science community.

Marine Saliency Segmenter: Object-Focused Conditional Diffusion with Region-Level Semantic Knowledge Distillation

Laibin Chang,Yunke Wang,JiaXing Huang,Longxiang Deng,Bo Du,Chang Xu

Task: 提出一种基于扩散模型的海洋显著性分割方法DiffMSS，以解决现有技术在复杂水下环境中目标定位不准确和边界模糊的问题。

Motivation: 现有海洋分割技术因水下环境复杂导致目标定位不准确和边界模糊，且扩散模型在视觉分割中表现优异但仍有潜力通过上下文语义进一步提升区域级显著性目标的特征学习。

Details

Method: 设计了一种区域-词相似性匹配机制，从文本描述中提取显著性词汇，并通过语义知识蒸馏指导条件特征学习网络生成准确的扩散条件；同时开发了共识确定性采样以优化细粒度结构的分割。 Result: DiffMSS在定量和定性评估中均优于现有最先进方法。 Conclusion: DiffMSS通过语义知识蒸馏和共识确定性采样，显著提升了海洋显著性分割的准确性和边界清晰度。 Abstract: Marine Saliency Segmentation (MSS) plays a pivotal role in various vision-based marine exploration tasks. However, existing marine segmentation techniques face the dilemma of object mislocalization and imprecise boundaries due to the complex underwater environment. Meanwhile, despite the impressive performance of diffusion models in visual segmentation, there remains potential to further leverage contextual semantics to enhance feature learning of region-level salient objects, thereby improving segmentation outcomes. Building on this insight, we propose DiffMSS, a novel marine saliency segmenter based on the diffusion model, which utilizes semantic knowledge distillation to guide the segmentation of marine salient objects. Specifically, we design a region-word similarity matching mechanism to identify salient terms at the word level from the text descriptions. These high-level semantic features guide the conditional feature learning network in generating salient and accurate diffusion conditions with semantic knowledge distillation. To further refine the segmentation of fine-grained structures in unique marine organisms, we develop the dedicated consensus deterministic sampling to suppress overconfident missegmentations. Comprehensive experiments demonstrate the superior performance of DiffMSS over state-of-the-art methods in both quantitative and qualitative evaluations.

Zhuohan Ge,Nicole Hu,Darian Li,Yubo Wang,Shihao Qi,Yuming Xu,Han Shi,Jason Zhang

Task: 探索大型语言模型（LLMs）在社交媒体数据分析中用于心理健康问题检测的潜力。

Motivation: 社交媒体数据是心理健康研究的重要资源，但如何利用LLMs进行心理健康问题检测仍面临挑战。

Details

Method: 从文本数据分析和心理健康障碍检测等多个维度总结LLM的应用方法，并分析当前研究的主要挑战和不足。 Result: 揭示了LLMs在心理健康检测中的巨大潜力，并提供了流行数据集和评估指标的概述。 Conclusion: 本文为心理健康领域的研究者提供了全面的参考框架，展示了LLMs在未来心理健康干预中的进一步应用潜力。 Abstract: The detection and intervention of mental health issues represent a critical global research focus, and social media data has been recognized as an important resource for mental health research. However, how to utilize Large Language Models (LLMs) for mental health problem detection on social media poses significant challenges. Hence, this paper aims to explore the potential of LLM applications in social media data analysis, focusing not only on the most common psychological disorders such as depression and anxiety but also incorporating psychotic disorders and externalizing disorders, summarizing the application methods of LLM from different dimensions, such as text data analysis and detection of mental disorders, and revealing the major challenges and shortcomings of current research. In addition, the paper provides an overview of popular datasets, and evaluation metrics. The survey in this paper provides a comprehensive frame of reference for researchers in the field of mental health, while demonstrating the great potential of LLMs in mental health detection to facilitate the further application of LLMs in future mental health interventions.

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

Boseung Jeong,Jicheol Park,Sungyeon Kim,Suha Kwak

Task: 提出一种新颖的视频-文本检索框架AVIGATE，通过门控注意力机制有效利用音频线索，并采用自适应边距对比损失优化视频-文本对齐。

Motivation: 现有方法主要依赖视觉和文本特征，忽略音频的作用；而传统模型盲目使用音频输入，导致视频表示不理想。

Details

Method: 提出AVIGATE框架，利用门控注意力机制选择性过滤无信息音频信号，并采用自适应边距对比损失处理视频与文本间模糊的正负关系。 Result: 在公开基准测试中，AVIGATE实现了最先进的性能。 Conclusion: AVIGATE通过有效利用音频和优化对齐方法，显著提升了视频-文本检索的性能。 Abstract: Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.

MegaMath: Pushing the Limits of Open Math Corpora

Fan Zhou,Zengzhi Wang,Nikhil Ranjan,Zhoujun Cheng,Liping Tang,Guowei He,Zhengzhong Liu,Eric P. Xing

Task: 构建一个开放、大规模、高质量的数学预训练数据集MegaMath。

Motivation: 数学推理是人工智能的核心能力之一，但目前缺乏适合数学预训练的大规模高质量数据集。

Details

Method: 通过重新提取网络数据、筛选高质量数学相关代码以及合成数据三种策略构建数据集。 Result: MegaMath提供了371B tokens，是目前规模最大、质量最高的开放数学预训练数据集。 Conclusion: MegaMath填补了数学预训练数据集的空白，为数学推理研究提供了重要资源。 Abstract: Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.

Hyperspectral Remote Sensing Images Salient Object Detection: The First Benchmark Dataset and Baseline

Peifu Liu,Huiyan Bai,Tingfa Xu,Jihui Wang,Huan Chen,Jianan Li

Task: 高光谱遥感图像显著目标检测（HRSI-SOD），旨在识别与背景具有显著光谱对比的目标或区域。

Motivation: 该领域在实际应用中具有重要潜力，但缺乏专用数据集和方法限制了进展。

Details

Method: 提出了首个HRSI-SOD数据集HRSSD，并设计了Deep Spectral Saliency Network（DSSN）模型，其核心是Cross-level Saliency Assessment Block和High-resolution Fusion Module。 Result: 实验验证了DSSN在HRSSD数据集上的优越性，并在其他数据集上展示了泛化能力。 Conclusion: 强调了专用数据集和方法在该领域的重要性，数据集和代码已公开。 Abstract: The objective of hyperspectral remote sensing image salient object detection (HRSI-SOD) is to identify objects or regions that exhibit distinct spectrum contrasts with the background. This area holds significant promise for practical applications; however, progress has been limited by a notable scarcity of dedicated datasets and methodologies. To bridge this gap and stimulate further research, we introduce the first HRSI-SOD dataset, termed HRSSD, which includes 704 hyperspectral images and 5327 pixel-level annotated salient objects. The HRSSD dataset poses substantial challenges for salient object detection algorithms due to large scale variation, diverse foreground-background relations, and multi-salient objects. Additionally, we propose an innovative and efficient baseline model for HRSI-SOD, termed the Deep Spectral Saliency Network (DSSN). The core of DSSN is the Cross-level Saliency Assessment Block, which performs pixel-wise attention and evaluates the contributions of multi-scale similarity maps at each spatial location, effectively reducing erroneous responses in cluttered regions and emphasizes salient regions across scales. Additionally, the High-resolution Fusion Module combines bottom-up fusion strategy and learned spatial upsampling to leverage the strengths of multi-scale saliency maps, ensuring accurate localization of small objects. Experiments on the HRSSD dataset robustly validate the superiority of DSSN, underscoring the critical need for specialized datasets and methodologies in this domain. Further evaluations on the HSOD-BIT and HS-SOD datasets demonstrate the generalizability of the proposed method. The dataset and source code are publicly available at https://github.com/laprf/HRSSD.

Generative Evaluation of Complex Reasoning in Large Language Models

Haowei Lin,Xiangyu Wang,Ruilin Yan,Baizhou Huang,Haotian Ye,Jianhua Zhu,Zihao Wang,James Zou,Jianzhu Ma,Yitao Liang

Task: 评估大型语言模型（LLMs）是否真正具备推理能力，而非仅依赖训练数据中的记忆。

Motivation: 现有基准测试一旦被纳入LLMs的训练数据，其可靠性会因数据污染而降低，需要一种新的评估框架来测试模型的真实推理能力。

Details

Method: 提出KUMO框架，结合LLMs与符号引擎动态生成多样化的多轮推理任务，并通过自动化流程持续生成新任务。 Result: 评估了23个先进LLMs在5,000个任务上的表现，发现部分LLMs在简单推理任务上已超越大学生水平，而推理优化的LLMs在复杂任务上达到大学生水平。 Conclusion: KUMO框架能有效评估LLMs的真实推理能力，且其表现与真实世界推理基准高度相关，证明了其作为评估工具的稳健性和持久性。 Abstract: With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.

Leveraging Static Relationships for Intra-Type and Inter-Type Message Passing in Video Question Answering

Lili Liang,Guanglu Sun

Task: 提出一种基于静态关系的类型内和类型间消息传递推理方法，以提高视频问答的准确性。

Motivation: 现有基于静态关系推理的方法在静态关系识别和表示上存在不足，且未充分利用视频中的静态关系信息进行深入推理。

Details

Method: 构建双图进行类型内消息传递推理，构建基于静态关系的异构图进行类型间消息传递推理，结合两者线索推断答案。 Result: 在ANetQA和Next-QA数据集上的实验证明了该方法的有效性。 Conclusion: 该方法通过静态关系的类型内和类型间消息传递推理，显著提升了视频问答的准确性。 Abstract: Video Question Answering (VideoQA) is an important research direction in the field of artificial intelligence, enabling machines to understand video content and perform reasoning and answering based on natural language questions. Although methods based on static relationship reasoning have made certain progress, there are still deficiencies in the accuracy of static relationship recognition and representation, and they have not fully utilized the static relationship information in videos for in-depth reasoning and analysis. Therefore, this paper proposes a reasoning method for intra-type and inter-type message passing based on static relationships. This method constructs a dual graph for intra-type message passing reasoning and builds a heterogeneous graph based on static relationships for inter-type message passing reasoning. The intra-type message passing reasoning model captures the neighborhood information of targets and relationships related to the question in the dual graph, updating the dual graph to obtain intra-type clues for answering the question. The inter-type message passing reasoning model captures the neighborhood information of targets and relationships from different categories related to the question in the heterogeneous graph, updating the heterogeneous graph to obtain inter-type clues for answering the question. Finally, the answers are inferred by combining the intra-type and inter-type clues based on static relationships. Experimental results on the ANetQA and Next-QA datasets demonstrate the effectiveness of this method.

LLMs Working in Harmony: A Survey on the Technological Aspects of Building Effective LLM-Based Multi Agent Systems

R. M. Aratchige,W. M. K. S. Ilmini

Task: 调查大型语言模型（LLM）多智能体系统的关键技术。

Motivation: 优化多智能体系统在协作和动态环境中的性能。

Details

Method: 分析架构、记忆、规划和框架四个关键领域的最新进展及其局限性。 Result: 总结了当前技术的优势与挑战，并提出了提升系统可扩展性、协作性和适应性的建议。 Conclusion: 为未来研究提供了路线图，支持构建更高效、鲁棒的多智能体系统。 Abstract: This survey investigates foundational technologies essential for developing effective Large Language Model (LLM)-based multi-agent systems. Aiming to answer how best to optimize these systems for collaborative, dynamic environments, we focus on four critical areas: Architecture, Memory, Planning, and Technologies/Frameworks. By analyzing recent advancements and their limitations - such as scalability, real-time response challenges, and agent coordination constraints, we provide a detailed view of the technological landscape. Frameworks like the Mixture of Agents architecture and the ReAct planning model exemplify current innovations, showcasing improvements in role assignment and decision-making. This review synthesizes key strengths and persistent challenges, offering practical recommendations to enhance system scalability, agent collaboration, and adaptability. Our findings provide a roadmap for future research, supporting the creation of robust, efficient multi-agent systems that advance both individual agent performance and collective system resilience.

OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication

Zhongjian Wang,Peng Zhang,Jinwei Qi,Guangyuan Wang Sheng Xu,Bang Zhang,Liefeng Bo

Task: 提出一个端到端的统一框架OmniTalker，用于从文本和参考视频中实时生成同步的语音和说话头部视频。

Motivation: 解决现有方法中系统复杂性、延迟、异步视听输出以及语音与视觉表达风格不一致的问题。

Details

Method: 采用双分支扩散变换器架构，包括音频分支和视觉分支，并引入音频-视觉融合模块和上下文参考学习模块。 Result: OmniTalker在生成质量上超越现有方法，尤其在风格保持和音视频同步方面表现优异，实时推理速度达到25 FPS。 Conclusion: OmniTalker是首个在零样本设置下联合建模语音风格和面部风格的统一框架，具有高效和高质量的生成能力。 Abstract: Recent years have witnessed remarkable advances in talking head generation, owing to its potential to revolutionize the human-AI interaction from text interfaces into realistic video chats. However, research on text-driven talking heads remains underexplored, with existing methods predominantly adopting a cascaded pipeline that combines TTS systems with audio-driven talking head models. This conventional pipeline not only introduces system complexity and latency overhead but also fundamentally suffers from asynchronous audiovisual output and stylistic discrepancies between generated speech and visual expressions. To address these limitations, we introduce OmniTalker, an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios, while preserving both speech style and facial styles. The framework employs a dual-branch diffusion transformer architecture: the audio branch synthesizes mel-spectrograms from text, while the visual branch predicts fine-grained head poses and facial dynamics. To bridge modalities, we introduce a novel audio-visual fusion module that integrates cross-modal information to ensure temporal synchronization and stylistic coherence between audio and visual outputs. Furthermore, our in-context reference learning module effectively captures both speech and facial style characteristics from a single reference video without introducing an extra style extracting module. To the best of our knowledge, OmniTalker presents the first unified framework that jointly models speech style and facial style in a zero-shot setting, achieving real-time inference speed of 25 FPS. Extensive experiments demonstrate that our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.

Urban Computing in the Era of Large Language Models

Zhonghang Li,Lianghao Xia,Xubin Ren,Jiabin Tang,Tianyi Chen,Yong Xu,Chao Huang

Task: 探讨大型语言模型（LLMs）在城市计算中的应用及其潜力。

Motivation: 传统方法在城市计算中存在泛化性、可扩展性和上下文理解的局限性，LLMs为解决这些问题提供了新的可能性。

Details

Method: 综述LLMs的核心技术及其在城市计算中的应用，包括数据处理、决策支持和市民参与等方面。 Result: 总结了LLMs在城市交通、公共安全和环境监测等关键领域的功能角色和实施模式，并提出了潜在的解决方案。 Conclusion: 讨论了当前方法的局限性，并展望了LLMs在城市计算中的未来发展方向。 Abstract: Urban computing has emerged as a multidisciplinary field that harnesses data-driven technologies to address challenges and improve urban living. Traditional approaches, while beneficial, often face challenges with generalization, scalability, and contextual understanding. The advent of Large Language Models (LLMs) offers transformative potential in this domain. This survey explores the intersection of LLMs and urban computing, emphasizing the impact of LLMs in processing and analyzing urban data, enhancing decision-making, and fostering citizen engagement. We provide a concise overview of the evolution and core technologies of LLMs. Additionally, we survey their applications across key urban domains, such as transportation, public safety, and environmental monitoring, summarizing essential tasks and prior works in various urban contexts, while highlighting LLMs' functional roles and implementation patterns. Building on this, we propose potential LLM-based solutions to address unresolved challenges. To facilitate in-depth research, we compile a list of available datasets and tools applicable to diverse urban scenarios. Finally, we discuss the limitations of current approaches and outline future directions for advancing LLMs in urban computing.

SkyReels-A2: Compose Anything in Video Diffusion Transformers

Zhengcong Fei,Debang Li,Di Qiu,Jiahua Wang,Yikun Dou,Rui Wang,Jingtao Xu,Mingyuan Fan,Guibin Chen,Yang Li,Yahui Zhou

Task: 提出SkyReels-A2框架，实现基于文本提示和参考图像的元素到视频（E2V）生成任务。

Motivation: 解决在视频生成中保持参考元素保真度、场景连贯性和输出自然性的挑战。

Details

Method: 设计数据管道构建训练三元组，提出图像-文本联合嵌入模型，优化推理流程，并创建A2 Bench基准。 Result: 实验表明SkyReels-A2能生成高质量、多样化的视频，性能优于闭源商业模型。 Conclusion: SkyReels-A2是首个开源商业级E2V模型，有望推动可控视频生成的创意应用。 Abstract: This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element. We term this task elements-to-video (E2V), whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs. To address these, we first design a comprehensive data pipeline to construct prompt-reference-video triplets for model training. Next, we propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment. We also optimize the inference pipeline for both speed and output stability. Moreover, we introduce a carefully curated benchmark for systematic evaluation, i.e, A2 Bench. Experiments demonstrate that our framework can generate diverse, high-quality videos with precise element control. SkyReels-A2 is the first open-source commercial grade model for the generation of E2V, performing favorably against advanced closed-source commercial models. We anticipate SkyReels-A2 will advance creative applications such as drama and virtual e-commerce, pushing the boundaries of controllable video generation.

Self-Resource Allocation in Multi-Agent LLM Systems

Alfonso Amayuelas,Jingbo Yang,Saaket Agashe,Ashwin Nagarajan,Antonis Antoniades,Xin Eric Wang,William Wang

Task: 探索如何利用LLMs在多智能体系统中有效分配计算任务。

Motivation: 随着LLMs作为智能体的发展，多智能体系统在任务分配和协调中的作用日益重要，研究如何优化资源分配以提高效率和性能。

Details

Method: 比较LLMs作为协调者（orchestrator）和规划者（planner）在任务分配中的有效性，并通过实验验证其性能。 Result: 实验表明，LLMs在资源分配任务中具有高有效性和准确性，规划者方法在处理并发动作时优于协调者方法，且提供明确的工人能力信息能优化分配策略。 Conclusion: LLMs在多智能体系统中作为规划者能更高效地分配任务，尤其是在处理非最优工人时，提供能力信息能进一步提升性能。 Abstract: With the development of LLMs as agents, there is a growing interest in connecting multiple agents into multi-agent systems to solve tasks concurrently, focusing on their role in task assignment and coordination. This paper explores how LLMs can effectively allocate computational tasks among multiple agents, considering factors such as cost, efficiency, and performance. In this work, we address key questions, including the effectiveness of LLMs as orchestrators and planners, comparing their effectiveness in task assignment and coordination. Our experiments demonstrate that LLMs can achieve high validity and accuracy in resource allocation tasks. We find that the planner method outperforms the orchestrator method in handling concurrent actions, resulting in improved efficiency and better utilization of agents. Additionally, we show that providing explicit information about worker capabilities enhances the allocation strategies of planners, particularly when dealing with suboptimal workers.

MonoGS++: Fast and Accurate Monocular RGB Gaussian SLAM

Renwu Li,Wenjing Ke,Dong Li,Lu Tian,Emad Barsoum

Task: 提出一种基于3D高斯表示且仅需RGB输入的快速准确SLAM方法MonoGS++。

Motivation: 减少对深度传感器的依赖，仅需RGB输入，并通过在线视觉里程计生成稀疏点云。

Details

Method: 引入动态3D高斯插入、清晰度增强的高斯密集化模块和平面正则化。 Result: 在合成Replica和真实TUM-RGBD数据集上实现精确相机跟踪，帧率提升5.57倍。 Conclusion: MonoGS++在减少硬件依赖的同时，显著提升了速度和重建质量。 Abstract: We present MonoGS++, a novel fast and accurate Simultaneous Localization and Mapping (SLAM) method that leverages 3D Gaussian representations and operates solely on RGB inputs. While previous 3D Gaussian Splatting (GS)-based methods largely depended on depth sensors, our approach reduces the hardware dependency and only requires RGB input, leveraging online visual odometry (VO) to generate sparse point clouds in real-time. To reduce redundancy and enhance the quality of 3D scene reconstruction, we implemented a series of methodological enhancements in 3D Gaussian mapping. Firstly, we introduced dynamic 3D Gaussian insertion to avoid adding redundant Gaussians in previously well-reconstructed areas. Secondly, we introduced clarity-enhancing Gaussian densification module and planar regularization to handle texture-less areas and flat surfaces better. We achieved precise camera tracking results both on the synthetic Replica and real-world TUM-RGBD datasets, comparable to those of the state-of-the-art. Additionally, our method realized a significant 5.57x improvement in frames per second (fps) over the previous state-of-the-art, MonoGS.

TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

Jeffrey Li,Mohammadreza Armandpour,Iman Mirzadeh,Sachin Mehta,Vaishaal Shankar,Raviteja Vemulapalli,Samy Bengio,Oncel Tuzel,Mehrdad Farajtabar,Hadi Pouransari,Fartash Faghri

Task: 研究大型语言模型（LLMs）在数据更新时的评估策略和更新方法。

Motivation: 由于LLMs基于历史网络数据训练，容易过时，需要探索如何有效更新模型以适应新数据。

Details

Method: 引入一个基于114个Common Crawl（CC）数据集的网络规模数据集，设计时间分层评估方法，比较不同持续学习方法的效果。 Result: 在通用CC数据上，结合自回归元调度和固定比例重放旧数据的方法，可以达到与从头训练相当的损失，同时计算量减少2.6倍。 Conclusion: 不同领域的数据需要不同的新旧数据平衡策略，通用数据需更多重放以避免遗忘，而特定领域则不然。 Abstract: Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.

HGFormer: Topology-Aware Vision Transformer with HyperGraph Learning

Hao Wang,Shuo Zhang,Biao Leng

Task: 提出一种名为HGFormer的拓扑感知视觉Transformer，用于解决传统视觉Transformer在区域上下文和空间拓扑建模上的不足。

Motivation: 传统视觉Transformer的排列不变性和全连接交互破坏了区域上下文和空间拓扑，偏离了感知组织的原则。

Details

Method: 提出CS-KNN算法用于超图构建的语义引导，以及拓扑感知的HyperGraph Attention机制。 Result: HGFormer在多个视觉基准测试中表现出竞争力，提供了详细和独特的场景描述。 Conclusion: HGFormer通过超图拓扑感知机制，有效提升了视觉Transformer的性能和表达能力。 Abstract: The computer vision community has witnessed an extensive exploration of vision transformers in the past two years. Drawing inspiration from traditional schemes, numerous works focus on introducing vision-specific inductive biases. However, the implicit modeling of permutation invariance and fully-connected interaction with individual tokens disrupts the regional context and spatial topology, further hindering higher-order modeling. This deviates from the principle of perceptual organization that emphasizes the local groups and overall topology of visual elements. Thus, we introduce the concept of hypergraph for perceptual exploration. Specifically, we propose a topology-aware vision transformer called HyperGraph Transformer (HGFormer). Firstly, we present a Center Sampling K-Nearest Neighbors (CS-KNN) algorithm for semantic guidance during hypergraph construction. Secondly, we present a topology-aware HyperGraph Attention (HGA) mechanism that integrates hypergraph topology as perceptual indications to guide the aggregation of global and unbiased information during hypergraph messaging. Using HGFormer as visual backbone, we develop an effective and unitive representation, achieving distinct and detailed scene depictions. Empirical experiments show that the proposed HGFormer achieves competitive performance compared to the recent SoTA counterparts on various visual benchmarks. Extensive ablation and visualization studies provide comprehensive explanations of our ideas and contributions.

Exploring LLM Reasoning Through Controlled Prompt Variations

Giannis Chatziveroglou,Richard Yun,Maura Kelleher

Task: 研究大型语言模型（LLMs）在数学问题解决任务中面对系统性输入扰动时的推理鲁棒性。

Motivation: 评估当前先进模型在面对不同类型的提示扰动时，如何保持逻辑一致性和正确性，以揭示其在实际应用中的潜在脆弱性。

Details

Method: 使用GSM8K数据集作为受控测试平台，对13个开源和闭源LLMs进行实验，测试其在四种提示扰动下的表现。 Result: 引入无关上下文显著降低模型性能，且性能下降与任务复杂性或模型大小无严格相关性；某些扰动意外触发类似链式推理行为。 Conclusion: 当前LLMs在区分关键与无关信息方面存在挑战，需提升对噪声、误导和上下文密集输入的鲁棒性，以实现更可靠的现实应用。 Abstract: This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations: irrelevant context, pathological instructions, factually relevant but non-essential context, and a combination of the latter two. Our experiments, conducted on thirteen open-source and closed-source LLMs, reveal that introducing irrelevant context within the model's context window significantly degrades performance, suggesting that distinguishing essential from extraneous details remains a pressing challenge. Surprisingly, performance regressions are relatively insensitive to the complexity of the reasoning task, as measured by the number of steps required, and are not strictly correlated with model size. Moreover, we observe that certain perturbations inadvertently trigger chain-of-thought-like reasoning behaviors, even without explicit prompting. Our findings highlight critical vulnerabilities in current LLMs and underscore the need for improved robustness against noisy, misleading, and contextually dense inputs, paving the way for more resilient and reliable reasoning in real-world applications.

ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer

Jiayi Gao,Zijin Yin,Changcheng Hua,Yuxin Peng,Kongming Liang,Zhanyu Ma,Jun Guo,Yang Liu

Task: 提出一种零样本框架ConMo，用于解决多主体视频中运动传递的准确性和多样性问题。

Motivation: 当前方法在多主体视频中难以传递特定主体运动，且无法保持运动多样性及准确性。

Details

Method: 通过分离和重组主体与背景的运动线索，结合软引导控制运动保留，实现更精确的运动控制。 Result: ConMo在运动保真度和语义一致性上显著优于现有方法。 Conclusion: ConMo为多主体视频运动传递提供了更灵活和准确的解决方案。 Abstract: The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce \textbf{ConMo}, a zero-shot framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them for target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and semantic consistency. The code is available at https://github.com/Andyplus1/ConMo.

Achieving Unanimous Consensus in Decision Making Using Multi-Agents

Apurba Pokharel,Ram Dantu,Shakila Zaman,Sirisha Talapuru,Vinh Quach

Task: 提出一种基于大型语言模型（LLMs）的审议共识机制，用于区块链网络中的决策制定。

Motivation: 传统的PoW和PoS共识机制在需要每个参与者意见而非简单多数或加权共识的决策场景中适应性不足。

Details

Method: 利用LLMs作为理性代理，通过分级共识和多轮审议过程实现一致共识和分级置信度。 Result: 实验证明该方法在一致性、协议、活跃性和确定性方面保持区块链特性，并展示了其可行性和准确性。 Conclusion: 该审议共识机制为区块链网络中的决策制定提供了新思路，同时解决了退化思维、幻觉、恶意模型等挑战。 Abstract: Blockchain consensus mechanisms have relied on algorithms such as Proof-of-Work (PoW) and Proof-of-Stake (PoS) to ensure network functionality and integrity. However, these approaches struggle with adaptability for decision-making where the opinions of each matter rather than reaching an agreement based on honest majority or weighted consensus. This paper introduces a novel deliberation-based consensus mechanism where Large Language Models (LLMs) act as rational agents engaging in structured discussions to reach a unanimous consensus. By leveraging graded consensus and a multi-round deliberation process, our approach ensures both unanimous consensus for definitive problems and graded confidence for prioritized decisions and policies. We provide a formalization of our system and use it to show that the properties of blockchains: consistency, agreement, liveness, and determinism are maintained. Moreover, experimental results demonstrate our system's feasibility, showcasing how our deliberation method's convergence, block properties, and accuracy enable decision-making on blockchain networks. We also address key challenges with this novel approach such as degeneration of thoughts, hallucinations, malicious models and nodes, resource consumption, and scalability.

Taylor Series-Inspired Local Structure Fitting Network for Few-shot Point Cloud Semantic Segmentation

Changshuo Wang,Shuting He,Xiang Fang,Meiqing Wu,Siew-Kei Lam,Prayag Tiwari

Task: 提出一种无需预训练的局部结构拟合网络（TaylorSeg）用于少样本点云语义分割。

Motivation: 解决基于预训练方法的时间开销大和忽略点云局部结构表示的问题。

Details

Method: 受泰勒级数启发，将点云局部结构表示视为多项式拟合问题，提出TaylorConv卷积，并构建TaylorSeg-NN和TaylorSeg-PN两种变体。 Result: 在S3DIS和ScanNet数据集上，TaylorSeg-PN在2-way 1-shot设置下分别提升2.28%和4.37% mIoU。 Conclusion: TaylorSeg在少样本点云语义分割中表现出色，无需预训练且性能优越。 Abstract: Few-shot point cloud semantic segmentation aims to accurately segment "unseen" new categories in point cloud scenes using limited labeled data. However, pretraining-based methods not only introduce excessive time overhead but also overlook the local structure representation among irregular point clouds. To address these issues, we propose a pretraining-free local structure fitting network for few-shot point cloud semantic segmentation, named TaylorSeg. Specifically, inspired by Taylor series, we treat the local structure representation of irregular point clouds as a polynomial fitting problem and propose a novel local structure fitting convolution, called TaylorConv. This convolution learns the low-order basic information and high-order refined information of point clouds from explicit encoding of local geometric structures. Then, using TaylorConv as the basic component, we construct two variants of TaylorSeg: a non-parametric TaylorSeg-NN and a parametric TaylorSeg-PN. The former can achieve performance comparable to existing parametric models without pretraining. For the latter, we equip it with an Adaptive Push-Pull (APP) module to mitigate the feature distribution differences between the query set and the support set. Extensive experiments validate the effectiveness of the proposed method. Notably, under the 2-way 1-shot setting, TaylorSeg-PN achieves improvements of +2.28% and +4.37% mIoU on the S3DIS and ScanNet datasets respectively, compared to the previous state-of-the-art methods. Our code is available at https://github.com/changshuowang/TaylorSeg.

Towards Interpretable Soft Prompts

Oam Patel,Jason Wang,Nikhil Shivakumar Nayak,Suraj Srinivas,Himabindu Lakkaraju

Task: 提出一种评估可训练提示解释性的新理论框架。

Motivation: 软提示等方法虽能提升任务性能，但其解释性不足，缺乏与提示的直接可解释联系。

Details

Method: 基于忠实性和可审查性两个标准，提出理论框架，并设计新的解释性导向目标函数，测试于PEZ和RLPrompt两种提示调优方法。 Result: 实验发现可训练提示的解释性与任务性能之间存在根本性权衡，揭示了优化解释性代理时的异常行为。 Conclusion: 研究揭示了软提示解释性问题的难度，并为未来设计解释性导向的提示方法提供了方向。 Abstract: Soft prompts have been popularized as a cheap and easy way to improve task-specific LLM performance beyond few-shot prompts. Despite their origin as an automated prompting method, however, soft prompts and other trainable prompts remain a black-box method with no immediately interpretable connections to prompting. We create a novel theoretical framework for evaluating the interpretability of trainable prompts based on two desiderata: faithfulness and scrutability. We find that existing methods do not naturally satisfy our proposed interpretability criterion. Instead, our framework inspires a new direction of trainable prompting methods that explicitly optimizes for interpretability. To this end, we formulate and test new interpretability-oriented objective functions for two state-of-the-art prompt tuners: Hard Prompts Made Easy (PEZ) and RLPrompt. Our experiments with GPT-2 demonstrate a fundamental trade-off between interpretability and the task-performance of the trainable prompt, explicating the hardness of the soft prompt interpretability problem and revealing odd behavior that arises when one optimizes for an interpretability proxy.

CornerPoint3D: Look at the Nearest Corner Instead of the Center

Ruixiao Zhang,Runwei Guan,Xiangyu Chen,Adam Prugel-Bennett,Xiaohao Cai

Task: 研究跨域3D物体检测中的定位精度问题，并提出新的评估指标和方法。

Motivation: LiDAR仅捕捉物体近侧点云，导致中心检测器在跨域任务中定位精度差，且现有评估指标易过拟合。

Details

Method: 提出两个新指标评估模型检测LiDAR传感器近侧表面的能力，并设计EdgeHead和CornerPoint3D检测器。 Result: 新方法在跨域任务中优于传统中心检测器，平衡了整体检测质量与近侧定位精度。 Conclusion: 通过新指标和检测器，实现了更实用且鲁棒的跨域3D物体检测解决方案。 Abstract: 3D object detection aims to predict object centers, dimensions, and rotations from LiDAR point clouds. Despite its simplicity, LiDAR captures only the near side of objects, making center-based detectors prone to poor localization accuracy in cross-domain tasks with varying point distributions. Meanwhile, existing evaluation metrics designed for single-domain assessment also suffer from overfitting due to dataset-specific size variations. A key question arises: Do we really need models to maintain excellent performance in the entire 3D bounding boxes after being applied across domains? Actually, one of our main focuses is on preventing collisions between vehicles and other obstacles, especially in cross-domain scenarios where correctly predicting the sizes is much more difficult. To address these issues, we rethink cross-domain 3D object detection from a practical perspective. We propose two new metrics that evaluate a model's ability to detect objects' closer-surfaces to the LiDAR sensor. Additionally, we introduce EdgeHead, a refinement head that guides models to focus more on learnable closer surfaces, significantly improving cross-domain performance under both our new and traditional BEV/3D metrics. Furthermore, we argue that predicting the nearest corner rather than the object center enhances robustness. We propose a novel 3D object detector, coined as CornerPoint3D, which is built upon CenterPoint and uses heatmaps to supervise the learning and detection of the nearest corner of each object. Our proposed methods realize a balanced trade-off between the detection quality of entire bounding boxes and the locating accuracy of closer surfaces to the LiDAR sensor, outperforming the traditional center-based detector CenterPoint in multiple cross-domain tasks and providing a more practically reasonable and robust cross-domain 3D object detection solution.

Neural Style Transfer for Synthesising a Dataset of Ancient Egyptian Hieroglyphs

Lewis Matheson Creed

Task: 提出一种利用神经风格迁移（NST）生成古埃及象形文字数据集的新方法。

Motivation: 低资源语言（如古埃及语）的训练数据稀缺，限制了机器学习技术的应用。

Details

Method: 通过将NST应用于数字字体，生成古埃及象形文字数据集。 Result: 实验表明，基于NST生成的数据和真实照片训练的模型在分类任务中表现相当，且能泛化到未见过的真实象形文字图像。 Conclusion: NST是一种有效的解决低资源语言数据稀缺问题的方法。 Abstract: The limited availability of training data for low-resource languages makes applying machine learning techniques challenging. Ancient Egyptian is one such language with few resources. However, innovative applications of data augmentation methods, such as Neural Style Transfer, could overcome these barriers. This paper presents a novel method for generating datasets of ancient Egyptian hieroglyphs by applying NST to a digital typeface. Experimental results found that image classification models trained on NST-generated examples and photographs demonstrate equal performance and transferability to real unseen images of hieroglyphs.

Semantic segmentation of forest stands using deep learning

Håkon Næss Sandum,Hans Ole Ørka,Oliver Tomic,Erik Næsset,Terje Gobakken

Task: 提出一种基于U-Net深度学习框架的多类分割方法，用于自动化森林林分边界划分。

Motivation: 传统的手动解释方法耗时且主观，限制了操作效率并引入不一致性，而现有自动化方法仍依赖手动解释。

Details

Method: 将林分划分问题转化为多类分割任务，使用U-Net深度学习框架，结合多光谱图像、ALS数据和专家绘制的林分地图进行训练和评估。 Result: 模型在独立数据上的总体准确率为0.73，显示出深度学习在自动化林分划分中的潜力。 Conclusion: 深度学习在林分自动化划分中具有潜力，但在复杂森林环境中仍存在挑战。 Abstract: Forest stands are the fundamental units in forest management inventories, silviculture, and financial analysis within operational forestry. Over the past two decades, a common method for mapping stand borders has involved delineation through manual interpretation of stereographic aerial images. This is a time-consuming and subjective process, limiting operational efficiency and introducing inconsistencies. Substantial effort has been devoted to automating the process, using various algorithms together with aerial images and canopy height models constructed from airborne laser scanning (ALS) data, but manual interpretation remains the preferred method. Deep learning (DL) methods have demonstrated great potential in computer vision, yet their application to forest stand delineation remains unexplored in published research. This study presents a novel approach, framing stand delineation as a multiclass segmentation problem and applying a U-Net based DL framework. The model was trained and evaluated using multispectral images, ALS data, and an existing stand map created by an expert interpreter. Performance was assessed on independent data using overall accuracy, a standard metric for classification tasks that measures the proportions of correctly classified pixels. The model achieved an overall accuracy of 0.73. These results demonstrate strong potential for DL in automated stand delineation. However, a few key challenges were noted, especially for complex forest environments.

Jacy Reese Anthis,Ryan Liu,Sean M. Richardson,Austin C. Kozlowski,Bernard Koch,James Evans,Erik Brynjolfsson,Michael Bernstein

Task: 探讨如何通过解决五个可操作性挑战，实现大型语言模型（LLM）在人类行为模拟中的准确性和可验证性。

Motivation: LLM模拟人类研究对象为理解人类行为和训练新AI系统提供了潜在数据源，但目前成果有限且社会科学家采用较少。

Details

Method: 基于文献综述，分析LLM与人类研究对象的实证比较、相关评论及工作，提出提示、微调和互补方法等方向。 Result: LLM社会模拟已可用于探索性研究（如心理学、经济学、社会学和市场营销的预实验），未来随着LLM能力提升可能更广泛应用。 Conclusion: 研究者应优先开发概念模型和评估方法，以跟上AI技术进步的步伐，推动LLM社会模拟的广泛应用。 Abstract: Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted these methods. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a literature survey of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions with prompting, fine-tuning, and complementary methods. We believe that LLM social simulations can already be used for exploratory research, such as pilot experiments for psychology, economics, sociology, and marketing. More widespread use may soon be possible with rapidly advancing LLM capabilities, and researchers should prioritize developing conceptual models and evaluations that can be iteratively deployed and refined at pace with ongoing AI advances.

MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

Bizhu Wu,Jinheng Xie,Keming Shen,Zhe Kong,Jianfeng Ren,Ruibin Bai,Rong Qu,Linlin Shen

Task: 提出MG-MotionLLM，一个统一的多粒度运动理解和生成的运动-语言模型。

Motivation: 现有方法主要关注粗粒度的运动-文本建模，无法处理细粒度的运动相关任务，如理解和控制特定身体部位的运动。

Details

Method: 引入多粒度训练方案，包括定位运动片段的时间边界和运动详细描述等辅助任务，以促进不同粒度下的运动-文本建模。 Result: MG-MotionLLM在经典文本到运动和运动到文本任务中表现优异，并在细粒度运动理解和编辑任务中展现出潜力。 Conclusion: MG-MotionLLM通过多粒度建模，显著提升了运动理解和生成的能力。 Abstract: Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing approaches primarily focus on coarse-grained motion-text modeling, where text describes the overall semantics of an entire motion sequence in just a few words. This limits their ability to handle fine-grained motion-relevant tasks, such as understanding and controlling the movements of specific body parts. To overcome this limitation, we pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We further introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks, such as localizing temporal boundaries of motion segments via detailed text as well as motion detailed captioning, to facilitate mutual reinforcement for motion-text modeling across various levels of granularity. Extensive experiments show that our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks, and exhibits potential in novel fine-grained motion comprehension and editing tasks. Project page: CVI-SZU/MG-MotionLLM

Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data

Waris Gill,Justin Cechmanek,Tyler Hutcherson,Srijith Rajamohan,Jen Agarwal,Muhammad Ali Gulzar,Manvinder Singh,Benoit Dion

Task: 研究如何通过使用专门优化的嵌入模型来提升语义缓存的有效性。

Motivation: 语义缓存依赖嵌入相似性而非精确键匹配，因此在平衡精度、查询延迟和计算效率方面存在独特挑战。

Details

Method: 提出使用小型、领域特定的嵌入模型，并通过真实世界和合成生成的数据集进行优化。 Result: 实验表明，经过专门数据集优化的紧凑嵌入模型在精度和召回率上显著优于现有开源和专有方案。 Conclusion: 该方法有效平衡了计算开销和准确性，为实际语义缓存实现提供了可行且高效的策略。 Abstract: This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in balancing precision, query latency, and computational efficiency. We propose leveraging smaller, domain-specific embedding models, fine-tuned with targeted real-world and synthetically generated datasets. Our empirical evaluations demonstrate that compact embedding models fine-tuned for just one epoch on specialized datasets significantly surpass both state-of-the-art open-source and proprietary alternatives in precision and recall. Moreover, we introduce a novel synthetic data generation pipeline for the semantic cache that mitigates the challenge of limited domain-specific annotated data, further boosting embedding performance. Our approach effectively balances computational overhead and accuracy, establishing a viable and efficient strategy for practical semantic caching implementations.

Graph Attention-Driven Bayesian Deep Unrolling for Dual-Peak Single-Photon Lidar Imaging

Kyungmin Choi,JaKeoung Koo,Stephen McLaughlin,Abderrahim Halimi

Task: 提出一种深度展开算法用于双峰单光子激光雷达成像。

Motivation: 解决单光子激光雷达在噪声环境和多目标场景中的挑战，结合统计方法和深度学习的优势。

Details

Method: 采用分层贝叶斯模型和深度展开神经网络，结合双深度图表示和几何深度学习。 Result: 在合成和真实数据上展示了优于现有方法的性能，并提供不确定性信息。 Conclusion: 该方法结合了统计方法和深度学习的优势，在精度和不确定性量化方面表现优异。 Abstract: Single-photon Lidar imaging offers a significant advantage in 3D imaging due to its high resolution and long-range capabilities, however it is challenging to apply in noisy environments with multiple targets per pixel. To tackle these challenges, several methods have been proposed. Statistical methods demonstrate interpretability on the inferred parameters, but they are often limited in their ability to handle complex scenes. Deep learning-based methods have shown superior performance in terms of accuracy and robustness, but they lack interpretability or they are limited to a single-peak per pixel. In this paper, we propose a deep unrolling algorithm for dual-peak single-photon Lidar imaging. We introduce a hierarchical Bayesian model for multiple targets and propose a neural network that unrolls the underlying statistical method. To support multiple targets, we adopt a dual depth maps representation and exploit geometric deep learning to extract features from the point cloud. The proposed method takes advantages of statistical methods and learning-based methods in terms of accuracy and quantifying uncertainty. The experimental results on synthetic and real data demonstrate the competitive performance when compared to existing methods, while also providing uncertainty information.

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

Abhay Kumar,Louis Owen,Nilabhra Roy Chowdhury,Fabian Güra

Task: 提出一种自适应梯度裁剪算法ZClip，以解决大型语言模型训练中的梯度不稳定和损失峰值问题。

Motivation: 传统梯度裁剪方法依赖固定阈值或启发式方法，无法有效应对梯度不稳定和损失峰值，导致学习效率低下和频繁手动干预。

Details

Method: ZClip通过基于梯度范数的统计特性动态调整裁剪阈值，利用z-score异常检测识别和缓解大梯度峰值。 Result: ZClip能够预防恶性损失峰值，同时不影响模型的收敛性。 Conclusion: ZClip是一种无需先验假设的自适应梯度裁剪方法，能有效提升大型语言模型训练的稳定性。 Abstract: Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.

Semiconductor Wafer Map Defect Classification with Tiny Vision Transformers

Faisal Mohammad,Duksan Ryu

Task: 提出一种轻量级Vision Transformer框架ViT-Tiny，用于半导体晶圆缺陷分类。

Motivation: 传统CNN模型在晶圆缺陷分类中存在类别不平衡和多重叠缺陷类型识别困难的问题。

Details

Method: 使用ViT-Tiny框架，并在WM-38k数据集上进行训练，通过消融实验确定最佳patch大小为16。 Result: ViT-Tiny在四类、二类和三类缺陷分类中分别以98.4%的F1分数、2.86%的召回率和3.13%的精确度超越现有SOTA模型。 Conclusion: ViT-Tiny是一种计算高效且可靠的半导体缺陷检测解决方案，尤其在有限标注数据条件下表现优异。 Abstract: Semiconductor wafer defect classification is critical for ensuring high precision and yield in manufacturing. Traditional CNN-based models often struggle with class imbalances and recognition of the multiple overlapping defect types in wafer maps. To address these challenges, we propose ViT-Tiny, a lightweight Vision Transformer (ViT) framework optimized for wafer defect classification. Trained on the WM-38k dataset. ViT-Tiny outperforms its ViT-Base counterpart and state-of-the-art (SOTA) models, such as MSF-Trans and CNN-based architectures. Through extensive ablation studies, we determine that a patch size of 16 provides optimal performance. ViT-Tiny achieves an F1-score of 98.4%, surpassing MSF-Trans by 2.94% in four-defect classification, improving recall by 2.86% in two-defect classification, and increasing precision by 3.13% in three-defect classification. Additionally, it demonstrates enhanced robustness under limited labeled data conditions, making it a computationally efficient and reliable solution for real-world semiconductor defect detection.

Reasoning Inconsistencies and How to Mitigate Them in Deep Learning

Erik Arakelyan

Task: 提出新方法以检测、量化和缓解深度学习模型中的推理不一致性。

Motivation: 尽管深度学习模型性能显著提升，但其内部推理过程的不透明性导致系统性不一致或逻辑错误，可能引发偏见或不可靠的输出。

Details

Method: 开发了针对知识图谱、自然语言和图像处理的模型推理不一致检测技术，以及数据高效采样和合成数据集生成方法。 Result: 提出了两种检测和量化不一致性的技术，以及优化复杂推理任务的方法，提升了模型的鲁棒性、公平性和可解释性。 Conclusion: 通过综合框架改进了深度学习模型在多样任务和模态中的鲁棒性、公平性和可解释性。 Abstract: The recent advancements in Deep Learning models and techniques have led to significant strides in performance across diverse tasks and modalities. However, while the overall capabilities of models show promising growth, our understanding of their internal reasoning processes remains limited, particularly concerning systematic inconsistencies or errors patterns of logical or inferential flaws. These inconsistencies may manifest as contradictory outputs, failure to generalize across similar tasks, or erroneous conclusions in specific contexts. Even detecting and measuring such reasoning discrepancies is challenging, as they may arise from opaque internal procedures, biases and imbalances in training data, or the inherent complexity of the task. Without effective methods to detect, measure, and mitigate these errors, there is a risk of deploying models that are biased, exploitable, or logically unreliable. This thesis aims to address these issues by producing novel methods for deep learning models that reason over knowledge graphs, natural language, and images. The thesis contributes two techniques for detecting and quantifying predictive inconsistencies originating from opaque internal procedures in natural language and image processing models. To mitigate inconsistencies from biases in training data, this thesis presents a data efficient sampling method to improve fairness and performance and a synthetic dataset generation approach in low resource scenarios. Finally, the thesis offers two techniques to optimize the models for complex reasoning tasks. These methods enhance model performance while allowing for more faithful and interpretable exploration and exploitation during inference. Critically, this thesis provides a comprehensive framework to improve the robustness, fairness, and interpretability of deep learning models across diverse tasks and modalities.

Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention

Jiuniu Wang,Wenjia Xu,Qingzhong Wang,Antoni B. Chan

Task: 提出一种基于组的差分独特描述方法（Group-based Differential Distinctive Captioning Method），以增强图像描述的区别性。

Motivation: 现有图像描述模型在传统指标上表现良好，但生成的描述在区分目标图像与相似图像方面的能力不足。

Details

Method: 引入基于组的差分记忆注意力模块（GDMA），通过视觉比较相似图像组中的图像并突出每张图像的独特性。 Result: 提出的方法显著提升了基线模型的描述区别性，并在不显著牺牲准确性的情况下达到最先进性能。 Conclusion: 通过GDMA模块和新的评估指标DisWordRate，该方法有效提升了图像描述的区别性。 Abstract: Recent advances in image captioning have focused on enhancing accuracy by substantially increasing the dataset and model size. While conventional captioning models exhibit high performance on established metrics such as BLEU, CIDEr, and SPICE, the capability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employed contrastive learning or re-weighted the ground-truth captions. However, these approaches often overlook the relationships among objects in a similar image group (e.g., items or properties within the same album or fine-grained events). In this paper, we introduce a novel approach to enhance the distinctiveness of image captions, namely Group-based Differential Distinctive Captioning Method, which visually compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we introduce a Group-based Differential Memory Attention (GDMA) module, designed to identify and emphasize object features in an image that are uniquely distinguishable within its image group, i.e., those exhibiting low similarity with objects in other images. This mechanism ensures that such unique object features are prioritized during caption generation for the image, thereby enhancing the distinctiveness of the resulting captions. To further refine this process, we select distinctive words from the ground-truth captions to guide both the language decoder and the GDMA module. Additionally, we propose a new evaluation metric, the Distinctive Word Rate (DisWordRate), to quantitatively assess caption distinctiveness. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves state-of-the-art performance on distinctiveness while not excessively sacrificing accuracy...

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Yan Ma,Steffi Chern,Xuyang Shen,Yiran Zhong,Pengfei Liu

Task: 提出一个透明、从零开始的强化学习框架，用于视觉语言模型（VLM），并验证其有效性。

Motivation: 现有强化学习在视觉语言模型中的应用依赖复杂框架，缺乏标准化评估协议，难以复现和比较结果。

Details

Method: 设计了一个最小但功能完整的四步流程，并在多个模型和数据集上验证，同时提出标准化评估方案。 Result: 实验发现响应长度对随机种子敏感，反思行为与输出长度相关，强化学习在泛化能力上优于监督微调。 Conclusion: 提出的框架和发现旨在为基于强化学习的视觉语言模型研究提供可复现的基准，促进更广泛的参与。 Abstract: Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.

APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers

Zhuguanyu Wu,Jiayi Zhang,Jiaxin Chen,Jinyang Guo,Di Huang,Yunhong Wang

Task: 提出一种基于平均扰动Hessian（APH）重要性估计的后训练量化方法APHQ-ViT，用于解决Vision Transformers（ViTs）在超低位量化时的性能下降问题。

Motivation: ViTs在量化部署时，尤其是超低位后训练量化（PTQ）下，性能显著下降，现有方法无法有效解决这一问题。

Details

Method: 提出改进的平均扰动Hessian损失，并设计MLP重建（MR）方法，通过替换GELU函数为ReLU并基于APH损失在小规模无标签校准集上重建。 Result: APHQ-ViT在3位和4位量化下，显著优于现有PTQ方法，适用于多种视觉任务。 Conclusion: APHQ-ViT是一种有效的PTQ方法，显著提升了ViTs在超低位量化时的性能。 Abstract: Vision Transformers (ViTs) have become one of the most commonly used backbones for vision tasks. Despite their remarkable performance, they often suffer significant accuracy drops when quantized for practical deployment, particularly by post-training quantization (PTQ) under ultra-low bits. Recently, reconstruction-based PTQ methods have shown promising performance in quantizing Convolutional Neural Networks (CNNs). However, they fail when applied to ViTs, primarily due to the inaccurate estimation of output importance and the substantial accuracy degradation in quantizing post-GELU activations. To address these issues, we propose \textbf{APHQ-ViT}, a novel PTQ approach based on importance estimation with Average Perturbation Hessian (APH). Specifically, we first thoroughly analyze the current approximation approaches with Hessian loss, and propose an improved average perturbation Hessian loss. To deal with the quantization of the post-GELU activations, we design an MLP Reconstruction (MR) method by replacing the GELU function in MLP with ReLU and reconstructing it by the APH loss on a small unlabeled calibration set. Extensive experiments demonstrate that APHQ-ViT using linear quantizers outperforms existing PTQ methods by substantial margins in 3-bit and 4-bit across different vision tasks. The source code is available at https://github.com/GoatWu/APHQ-ViT.

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan,Zhirong Huang,Wei Liu,Hanwu Chen,Linhao Zhang,Shulin Xin,Lu Chen,Qi Liu,Xiaojian Zhong,Aoyan Li,Siyao Liu,Yongsheng Xiao,Liangqiang Chen,Yuyu Zhang,Jing Su,Tianyu Liu,Rui Long,Kai Shen,Liang Xiang

Task: 构建一个多语言的代码问题修复基准测试（Multi-SWE-bench），用于评估大型语言模型（LLMs）在不同软件生态系统中的表现。

Motivation: 现有基准测试（如SWE-bench）主要针对Python，无法全面评估LLMs在多样化软件生态系统中的能力。

Details

Method: 通过专家标注从2,456个候选实例中筛选出1,632个高质量实例，覆盖Java、TypeScript等七种语言，并评估三种代表性方法（Agentless、SWE-agent、OpenHands）。 Result: 提出了Multi-SWE-bench基准测试，并基于此对多种先进模型进行了全面分析。同时，开源了4,723个结构化实例和完整的数据生产流程。 Conclusion: Multi-SWE-bench和Multi-SWE-RL社区将推动强化学习（RL）的发展，为通用人工智能（AGI）的实现奠定基础。 Abstract: The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

Towards Generalizing Temporal Action Segmentation to Unseen Views

Emad Bahrami,Olga Zatsarynna,Gianpiero Francesca,Juergen Gall

Task: 提出一种针对未见视角的动作分割方法。

Motivation: 现有方法在未见视角的动作分割上表现不佳，需要解决视角变化带来的挑战。

Details

Method: 通过共享序列和片段级别的表示，结合序列损失和动作损失，减少视角差异的影响。 Result: 在Assembly101、IkeaASM和EgoExoLearn数据集上，F1@50指标显著提升，未见外中心视角提升12.8%，未见自我中心视角提升54%。 Conclusion: 该方法有效提升了未见视角下的动作分割性能。 Abstract: While there has been substantial progress in temporal action segmentation, the challenge to generalize to unseen views remains unaddressed. Hence, we define a protocol for unseen view action segmentation where camera views for evaluating the model are unavailable during training. This includes changing from top-frontal views to a side view or even more challenging from exocentric to egocentric views. Furthermore, we present an approach for temporal action segmentation that tackles this challenge. Our approach leverages a shared representation at both the sequence and segment levels to reduce the impact of view differences during training. We achieve this by introducing a sequence loss and an action loss, which together facilitate consistent video and action representations across different views. The evaluation on the Assembly101, IkeaASM, and EgoExoLearn datasets demonstrate significant improvements, with a 12.8% increase in F1@50 for unseen exocentric views and a substantial 54% improvement for unseen egocentric views.

Efficient Model Editing with Task-Localized Sparse Fine-tuning

Leonardo Iurada,Marco Ciccone,Tatiana Tommasi

Task: 提出一种名为TaLoS的方法，用于构建稀疏任务向量，以提升模型编辑的效率和效果。

Motivation: 现有方法依赖网络线性化生成任务向量，导致计算瓶颈且无法确保权重解耦，影响任务向量的无冲突组合。

Details

Method: 通过识别预训练模型中梯度敏感性低的参数子集，并仅稀疏更新这些参数，以促进权重解耦。 Result: TaLoS在训练和推理效率上优于现有方法，并在任务添加和否定任务中表现更优。 Conclusion: TaLoS通过模块化参数编辑，为实际应用中适应性基础模型的部署提供了可行方案。 Abstract: Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.

Exploration-Driven Generative Interactive Environments

Nedko Savov,Naser Kazemi,Mohammad Mahdi,Danda Pani Paudel,Xi Wang,Luc Van Gool

Task: 提出一个仅使用随机代理在虚拟环境中训练的多环境世界模型框架。

Motivation: 减少对昂贵人工演示数据的依赖，简化训练过程。

Details

Method: 提出AutoExplore Agent，基于世界模型的不确定性进行探索，并结合环境行为相似性分组数据。 Result: 模型在新环境中快速适应，视频保真度和可控性得到提升。 Conclusion: 通过自动探索和分组数据，实现了高效的多环境世界模型训练。 Abstract: Modern world models require costly and time-consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates simulation abilities of many environments with shared behavior. Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. While the model trained in this manner exhibits good controls, it is limited by the random exploration possibilities. To address this limitation, we propose AutoExplore Agent - an exploration agent that entirely relies on the uncertainty of the world model, delivering diverse data from which it can learn the best. Our agent is fully independent of environment-specific rewards and thus adapts easily to new environments. With this approach, the pretrained multi-environment model can quickly adapt to new environments achieving video fidelity and controllability improvement. In order to obtain automatically large-scale interaction datasets for pretraining, we group environments with similar behavior and controls. To this end, we annotate the behavior and controls of 974 virtual environments - a dataset that we name RetroAct. For building our model, we first create an open implementation of Genie - GenieRedux and apply enhancements and adaptations in our version GenieRedux-G. Our code and data are available at https://github.com/insait-institute/GenieRedux.

Affordable AI Assistants with Knowledge Graph of Thoughts

Maciej Besta,Lorenzo Paleari,Jia Hao Andrea Jiang,Robert Gerstenberger,You Wu,Patrick Iff,Ales Kubicek,Piotr Nyczyk,Diana Khimey,Jón Gunnar Hannesson,Grzegorz Kwaśniewski,Marcin Copik,Hubert Niewiadomski,Torsten Hoefler

Task: 提出一种名为Knowledge Graph of Thoughts (KGoT)的新型AI助手架构，以解决当前LLM驱动代理的高成本和低成功率问题。

Motivation: 当前先进的LLM驱动代理在复杂基准测试（如GAIA）中面临高成本和低成功率的挑战。

Details

Method: KGoT通过将LLM推理与动态构建的知识图谱（KGs）结合，利用外部工具（如数学求解器、网络爬虫和Python脚本）迭代增强任务相关知识的结构化表示。 Result: KGoT在GAIA基准测试中任务成功率提高了29%，成本降低了36倍以上，并在其他推理模型（如Qwen2.5-32B和Deepseek-R1-70B）上也有显著改进。 Conclusion: KGoT为AI助手提供了一种可扩展、经济高效且高性能的解决方案。 Abstract: Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose the Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini, while reducing costs by over 36x compared to GPT-4o. Improvements for recent reasoning models are similar, e.g., 36% and 37.5% for Qwen2.5-32B and Deepseek-R1-70B, respectively. KGoT offers a scalable, affordable, and high-performing solution for AI assistants.

MultiNeRF: Multiple Watermark Embedding for Neural Radiance Fields

Yash Kulthe,Andrew Gilbert,John Collomosse

Task: 提出一种名为MultiNeRF的3D水印方法，能够在单个NeRF模型渲染的图像中嵌入多个唯一密钥水印，同时保持高视觉质量。

Motivation: 扩展TensoRF NeRF模型，通过引入专用水印网格，提高水印容量且不干扰场景内容，解决现有单水印方法的局限性。

Details

Method: 采用基于FiLM的条件调制机制，动态激活水印，支持多水印嵌入与提取，无需重新训练模型。 Result: 在NeRF-Synthetic和LLFF数据集上验证，显著提升鲁棒容量且不影响渲染质量。 Conclusion: MultiNeRF为3D内容提供了一种灵活、可扩展的多水印解决方案。 Abstract: We present MultiNeRF, a 3D watermarking method that embeds multiple uniquely keyed watermarks within images rendered by a single Neural Radiance Field (NeRF) model, whilst maintaining high visual quality. Our approach extends the TensoRF NeRF model by incorporating a dedicated watermark grid alongside the existing geometry and appearance grids. This extension ensures higher watermark capacity without entangling watermark signals with scene content. We propose a FiLM-based conditional modulation mechanism that dynamically activates watermarks based on input identifiers, allowing multiple independent watermarks to be embedded and extracted without requiring model retraining. MultiNeRF is validated on the NeRF-Synthetic and LLFF datasets, with statistically significant improvements in robust capacity without compromising rendering quality. By generalizing single-watermark NeRF methods into a flexible multi-watermarking framework, MultiNeRF provides a scalable solution for 3D content. attribution.

A Framework for Situating Innovations, Opportunities, and Challenges in Advancing Vertical Systems with Large AI Models

Gaurav Verma,Jiawei Zhou,Mohit Chandra,Srijan Kumar,Munmun De Choudhury

Task: 提出一个框架，解决大型AI模型在现实应用中的局限性问题。

Motivation: 大型AI模型在标准化基准测试中表现优异，但在实际高风险领域（如医疗、教育、法律）中暴露出脆弱性、决策缺乏上下文等问题，需要跨学科创新来满足实际需求。

Details

Method: 通过分层抽象的创新框架，结合案例研究，展示如何将大型模型转化为实用的垂直系统。 Result: 框架帮助研究者和从业者优化创新定位、发现被忽视的机会，并促进跨学科交流。 Conclusion: 该框架为大型AI模型在现实应用中的问题提供了解决方案，并推动了跨学科合作。 Abstract: Large artificial intelligence (AI) models have garnered significant attention for their remarkable, often "superhuman", performance on standardized benchmarks. However, when these models are deployed in high-stakes verticals such as healthcare, education, and law, they often reveal notable limitations. For instance, they exhibit brittleness to minor variations in input data, present contextually uninformed decisions in critical settings, and undermine user trust by confidently producing or reproducing inaccuracies. These challenges in applying large models necessitate cross-disciplinary innovations to align the models' capabilities with the needs of real-world applications. We introduce a framework that addresses this gap through a layer-wise abstraction of innovations aimed at meeting users' requirements with large models. Through multiple case studies, we illustrate how researchers and practitioners across various fields can operationalize this framework. Beyond modularizing the pipeline of transforming large models into useful "vertical systems", we also highlight the dynamism that exists within different layers of the framework. Finally, we discuss how our framework can guide researchers and practitioners to (i) optimally situate their innovations (e.g., when vertical-specific insights can empower broadly impactful vertical-agnostic innovations), (ii) uncover overlooked opportunities (e.g., spotting recurring problems across verticals to develop practically useful foundation models instead of chasing benchmarks), and (iii) facilitate cross-disciplinary communication of critical challenges (e.g., enabling a shared vocabulary for AI developers, domain experts, and human-computer interaction scholars).

Data-Driven Object Tracking: Integrating Modular Neural Networks into a Kalman Framework

Christian Alexander Holz,Christian Bader,Markus Enzweiler,Matthias Drüppel

Task: 提出三种神经网络模型（SPENT、SANT、MANTa）用于多目标跟踪（MOT），以满足高级驾驶辅助系统（ADAS）的复杂性和精度需求。

Motivation: 应对多目标跟踪中日益增长的复杂性和精度需求，特别是在ADAS中的应用。

Details

Method: 将三种神经网络模型（SPENT、SANT、MANTa）集成到传统的卡尔曼滤波器（KF）框架中，保持系统模块化。 Result: 在KITTI数据集上评估，SPENT将RMSE降低50%，SANT和MANTa在目标关联中达到95%的准确率。 Conclusion: 任务特定的神经网络能显著提升传统跟踪系统的性能和鲁棒性，同时保持模块化和可维护性。 Abstract: This paper presents novel Machine Learning (ML) methodologies for Multi-Object Tracking (MOT), specifically designed to meet the increasing complexity and precision demands of Advanced Driver Assistance Systems (ADAS). We introduce three Neural Network (NN) models that address key challenges in MOT: (i) the Single-Prediction Network (SPENT) for trajectory prediction, (ii) the Single-Association Network (SANT) for mapping individual Sensor Object (SO) to existing tracks, and (iii) the Multi-Association Network (MANTa) for associating multiple SOs to multiple tracks. These models are seamlessly integrated into a traditional Kalman Filter (KF) framework, maintaining the system's modularity by replacing relevant components without disrupting the overall architecture. Importantly, all three networks are designed to be run in a realtime, embedded environment. Each network contains less than 50k trainable parameters. Our evaluation, conducted on the public KITTI tracking dataset, demonstrates significant improvements in tracking performance. SPENT reduces the Root Mean Square Error (RMSE) by 50% compared to a standard KF, while SANT and MANTa achieve up to 95% accuracy in sensor object-to-track assignments. These results underscore the effectiveness of incorporating task-specific NNs into traditional tracking systems, boosting performance and robustness while preserving modularity, maintainability, and interpretability.

Concept Lancet: Image Editing with Compositional Representation Transplant

Jinqi Luo,Tianjiao Ding,Kwan Ho Ryan Chan,Hancheng Min,Chris Callison-Burch,René Vidal

Task: 提出一种零样本即插即用框架（CoLan），用于扩散模型中的图像编辑任务。

Motivation: 现有编辑方法在文本嵌入或分数空间中设计编辑方向时，常面临编辑强度难以平衡的问题，导致视觉一致性受损或编辑任务失败。

Details

Method: 通过将输入图像在潜在空间中分解为视觉概念的稀疏线性组合，准确估计概念存在，并根据编辑任务（替换/添加/移除）执行定制化的概念移植过程。 Result: 实验表明，配备CoLan的方法在编辑效果和一致性保持方面达到最先进水平。 Conclusion: CoLan提供了一种原则性的表示操纵方法，显著提升了扩散模型图像编辑的性能。 Abstract: Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.

Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment

Fatemeh Behrad,Tinne Tuytelaars,Johan Wagemans

Task: 提出一种名为Charm的新型标记化方法，用于提升视觉变换器（ViTs）在可变尺寸输入下的性能。

Motivation: ViTs通常因计算复杂性和批量处理限制而只能处理固定尺寸的小图像，导致信息丢失，影响图像美学评估等任务。

Details

Method: Charm通过保留图像的组成、高分辨率、宽高比和多尺度信息，优先处理特定区域的高分辨率细节，同时缩减其他区域，生成固定长度的输入序列。 Result: 实验表明，Charm在多个图像美学和质量评估数据集上显著提升了性能（最高提升8.1%）。 Conclusion: Charm在不改变宽高比或裁剪图像的情况下，有效提升了ViTs的性能和泛化能力。 Abstract: The capacity of Vision transformers (ViTs) to handle variable-sized inputs is often constrained by computational complexity and batch processing limitations. Consequently, ViTs are typically trained on small, fixed-size images obtained through downscaling or cropping. While reducing computational burden, these methods result in significant information loss, negatively affecting tasks like image aesthetic assessment. We introduce Charm, a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously. Charm prioritizes high-resolution details in specific regions while downscaling others, enabling shorter fixed-size input sequences for ViTs while incorporating essential information. Charm is designed to be compatible with pre-trained ViTs and their learned positional embeddings. By providing multiscale input and introducing variety to input tokens, Charm improves ViT performance and generalizability for image aesthetic assessment. We avoid cropping or changing the aspect ratio to further preserve information. Extensive experiments demonstrate significant performance improvements on various image aesthetic and quality assessment datasets (up to 8.1 %) using a lightweight ViT backbone. Code and pre-trained models are available at https://github.com/FBehrad/Charm.

SelfMedHPM: Self Pre-training With Hard Patches Mining Masked Autoencoders For Medical Image Segmentation

Yunhao Lv,Lingyu Chen,Jian Wang,Yangxi Li,Fang Chen

Task: 提出一种基于MIM自训练框架的CT多器官分割方法（selfMedHPM），通过挖掘难重建区域提升性能。

Motivation: 现有基于MIM的CT多器官分割方法未能有效识别最难重建区域，限制了性能。

Details

Method: 使用ViT自预训练，引入辅助损失预测器动态确定掩码位置。 Result: 在腹部和全身CT多器官分割任务中优于现有方法。 Conclusion: selfMedHPM通过挖掘难重建区域，显著提升了CT多器官分割性能。 Abstract: In recent years, deep learning methods such as convolutional neural network (CNN) and transformers have made significant progress in CT multi-organ segmentation. However, CT multi-organ segmentation methods based on masked image modeling (MIM) are very limited. There are already methods using MAE for CT multi-organ segmentation task, we believe that the existing methods do not identify the most difficult areas to reconstruct. To this end, we propose a MIM self-training framework with hard patches mining masked autoencoders for CT multi-organ segmentation tasks (selfMedHPM). The method performs ViT self-pretraining on the training set of the target data and introduces an auxiliary loss predictor, which first predicts the patch loss and determines the location of the next mask. SelfMedHPM implementation is better than various competitive methods in abdominal CT multi-organ segmentation and body CT multi-organ segmentation. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for abdomen mult-organ segmentation and the SinoMed Whole Body (SMWB) dataset for body multi-organ segmentation tasks.

Delineate Anything: Resolution-Agnostic Field Boundary Delineation on Satellite Imagery

Mykola Lavreniuk,Nataliia Kussul,Andrii Shelestov,Bohdan Yailymov,Yevhenii Salii,Volodymyr Kuzin,Zoltan Szantoi

Task: 通过实例分割方法准确划分卫星图像中的农田边界。

Motivation: 现有方法因数据集规模小、分辨率差异和环境多样性而面临挑战。

Details

Method: 提出FBIS-22M数据集和Delineate Anything模型。 Result: 模型在mAP@0.5和mAP@0.5:0.95上分别提升88.5%和103%，推理速度更快且具有零样本泛化能力。 Conclusion: FBIS-22M数据集和Delineate Anything模型显著提升了农田边界划分的准确性和效率。 Abstract: The accurate delineation of agricultural field boundaries from satellite imagery is vital for land management and crop monitoring. However, current methods face challenges due to limited dataset sizes, resolution discrepancies, and diverse environmental conditions. We address this by reformulating the task as instance segmentation and introducing the Field Boundary Instance Segmentation - 22M dataset (FBIS-22M), a large-scale, multi-resolution dataset comprising 672,909 high-resolution satellite image patches (ranging from 0.25 m to 10 m) and 22,926,427 instance masks of individual fields, significantly narrowing the gap between agricultural datasets and those in other computer vision domains. We further propose Delineate Anything, an instance segmentation model trained on our new FBIS-22M dataset. Our proposed model sets a new state-of-the-art, achieving a substantial improvement of 88.5% in mAP@0.5 and 103% in mAP@0.5:0.95 over existing methods, while also demonstrating significantly faster inference and strong zero-shot generalization across diverse image resolutions and unseen geographic regions. Code, pre-trained models, and the FBIS-22M dataset are available at https://lavreniuk.github.io/Delineate-Anything.

A Sensorimotor Vision Transformer

Konrad Gadzicki,Kerstin Schill,Christoph Zetzsche

Task: 提出了一种受人类眼动启发的视觉模型（Sensorimotor Transformer, SMT），通过优先处理高显著性区域来提高计算效率和减少内存消耗。

Motivation: 传统模型均匀处理所有图像块，而人类视觉系统通过选择性聚焦优化信息获取，SMT旨在模拟这一机制。

Details

Method: SMT基于二维特征（如角点和遮挡）选择显著性区域，仅处理信息量高的图像块，结合视觉变换器架构。 Result: 在Imagenet-1k上，SMT在保持竞争力的top-1准确率的同时显著降低了内存消耗和计算复杂度。 Conclusion: SMT为资源受限的应用提供了一种高效的图像分析方法，并为生物启发的架构提供了新思路。 Abstract: This paper presents the Sensorimotor Transformer (SMT), a vision model inspired by human saccadic eye movements that prioritize high-saliency regions in visual input to enhance computational efficiency and reduce memory consumption. Unlike traditional models that process all image patches uniformly, SMT identifies and selects the most salient patches based on intrinsic two-dimensional (i2D) features, such as corners and occlusions, which are known to convey high-information content and align with human fixation patterns. The SMT architecture uses this biological principle to leverage vision transformers to process only the most informative patches, allowing for a substantial reduction in memory usage that scales with the sequence length of selected patches. This approach aligns with visual neuroscience findings, suggesting that the human visual system optimizes information gathering through selective, spatially dynamic focus. Experimental evaluations on Imagenet-1k demonstrate that SMT achieves competitive top-1 accuracy while significantly reducing memory consumption and computational complexity, particularly when a limited number of patches is used. This work introduces a saccade-like selection mechanism into transformer-based vision models, offering an efficient alternative for image analysis and providing new insights into biologically motivated architectures for resource-constrained applications.

Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

Fa-Ting Hong,Zunnan Xu,Zixiang Zhou,Jun Zhou,Xiu Li,Qin Lin,Qinglin Lu,Dan Xu

Task: 提出一种支持多信号和单信号控制的端到端视频扩散框架（ACTalker）用于说话头视频生成。

Motivation: 现有方法通常仅支持单一模态控制，限制了实际应用。

Details

Method: 设计并行mamba结构，每个分支利用独立驱动信号控制特定面部区域，并引入门控机制和mask-drop策略。 Result: 实验结果表明，该方法能生成自然的面部视频，且mamba层能无缝整合多种驱动模态。 Conclusion: ACTalker在多信号控制下表现出色，解决了现有方法的局限性。 Abstract: Talking head synthesis is vital for virtual avatars and human-computer interaction. However, most existing methods are typically limited to accepting control from a single primary modality, restricting their practical utility. To this end, we introduce \textbf{ACTalker}, an end-to-end video diffusion framework that supports both multi-signals control and single-signal control for talking head video generation. For multiple control, we design a parallel mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions. A gate mechanism is applied across all branches, providing flexible control over video generation. To ensure natural coordination of the controlled video both temporally and spatially, we employ the mamba structure, which enables driving signals to manipulate feature tokens across both dimensions in each branch. Additionally, we introduce a mask-drop strategy that allows each driving signal to independently control its corresponding facial region within the mamba structure, preventing control conflicts. Experimental results demonstrate that our method produces natural-looking facial videos driven by diverse signals and that the mamba layer seamlessly integrates multiple driving modalities without conflict.

MAD: Makeup All-in-One with Cross-Domain Diffusion Model

Bo-Kai Ruan,Hong-Han Shuai

Task: 使用单一模型处理多种化妆任务，包括美颜滤镜、化妆转移和卸妆。

Motivation: 现有方法需要多个模型处理不同任务且缺乏文本引导的化妆试妆功能，增加了复杂性和用户不便。

Details

Method: 将不同化妆任务视为跨域翻译，利用跨域扩散模型，通过不同域嵌入实现域控制。 Result: 提出了一种单一模型方法，减少了额外模块依赖，并引入了MT-Text数据集支持文本到化妆应用。 Conclusion: 该方法简化了化妆任务的实现，提升了实用性和用户友好性。 Abstract: Existing makeup techniques often require designing multiple models to handle different inputs and align features across domains for different makeup tasks, e.g., beauty filter, makeup transfer, and makeup removal, leading to increased complexity. Another limitation is the absence of text-guided makeup try-on, which is more user-friendly without needing reference images. In this study, we make the first attempt to use a single model for various makeup tasks. Specifically, we formulate different makeup tasks as cross-domain translations and leverage a cross-domain diffusion model to accomplish all tasks. Unlike existing methods that rely on separate encoder-decoder configurations or cycle-based mechanisms, we propose using different domain embeddings to facilitate domain control. This allows for seamless domain switching by merely changing embeddings with a single model, thereby reducing the reliance on additional modules for different tasks. Moreover, to support precise text-to-makeup applications, we introduce the MT-Text dataset by extending the MT dataset with textual annotations, advancing the practicality of makeup technologies.

Noise Calibration and Spatial-Frequency Interactive Network for STEM Image Enhancement

Hesong Li,Ziqi Wu,Ruiwen Shao,Tao Zhang,Ying Fu

Task: 开发噪声校准、数据合成和增强方法以提升STEM图像质量。

Motivation: 现有STEM图像增强方法忽视频域特征，且数据集缺乏真实性和通用性。

Details

Method: 提出噪声校准方法合成更真实的STEM图像，构建通用数据集，并设计空间-频率交互网络进行图像增强。 Result: 实验表明，合成数据更接近真实STEM图像，且网络增强效果更优。 Conclusion: 提出的方法在STEM图像增强方面表现出色，代码已开源。 Abstract: Scanning Transmission Electron Microscopy (STEM) enables the observation of atomic arrangements at sub-angstrom resolution, allowing for atomically resolved analysis of the physical and chemical properties of materials. However, due to the effects of noise, electron beam damage, sample thickness, etc, obtaining satisfactory atomic-level images is often challenging. Enhancing STEM images can reveal clearer structural details of materials. Nonetheless, existing STEM image enhancement methods usually overlook unique features in the frequency domain, and existing datasets lack realism and generality. To resolve these issues, in this paper, we develop noise calibration, data synthesis, and enhancement methods for STEM images. We first present a STEM noise calibration method, which is used to synthesize more realistic STEM images. The parameters of background noise, scan noise, and pointwise noise are obtained by statistical analysis and fitting of real STEM images containing atoms. Then we use these parameters to develop a more general dataset that considers both regular and random atomic arrangements and includes both HAADF and BF mode images. Finally, we design a spatial-frequency interactive network for STEM image enhancement, which can explore the information in the frequency domain formed by the periodicity of atomic arrangement. Experimental results show that our data is closer to real STEM images and achieves better enhancement performances together with our network. Code will be available at https://github.com/HeasonLee/SFIN}{https://github.com/HeasonLee/SFIN.

Rip Current Segmentation: A Novel Benchmark and YOLOv8 Baseline Results

Andrei Dumitriu,Florin Tatui,Florin Miron,Radu Tudor Ionescu,Radu Timofte

Task: 提出一种新的任务：裂流实例分割，并构建相关数据集。

Motivation: 裂流是全球海滩上致命事故和伤害的主要原因，自动检测这些危险水流至关重要。

Details

Method: 使用YOLOv8的不同版本进行实例分割训练，并在静态图像和视频数据集上评估性能。 Result: YOLOv8-nano模型表现最佳，验证集mAP50为88.94%，测试集宏平均为81.21%。 Conclusion: 工作为裂流分割研究提供了基准，贡献了详细标注的数据集和深度学习模型，代码和数据集已公开。 Abstract: Rip currents are the leading cause of fatal accidents and injuries on many beaches worldwide, emphasizing the importance of automatically detecting these hazardous surface water currents. In this paper, we address a novel task: rip current instance segmentation. We introduce a comprehensive dataset containing $2,466$ images with newly created polygonal annotations for instance segmentation, used for training and validation. Additionally, we present a novel dataset comprising $17$ drone videos (comprising about $24K$ frames) captured at $30 FPS$, annotated with both polygons for instance segmentation and bounding boxes for object detection, employed for testing purposes. We train various versions of YOLOv8 for instance segmentation on static images and assess their performance on the test dataset (videos). The best results were achieved by the YOLOv8-nano model (runnable on a portable device), with an mAP50 of $88.94%$ on the validation dataset and $81.21%$ macro average on the test dataset. The results provide a baseline for future research in rip current segmentation. Our work contributes to the existing literature by introducing a detailed, annotated dataset, and training a deep learning model for instance segmentation of rip currents. The code, training details and the annotated dataset are made publicly available at https://github.com/Irikos/rip_currents.

L-LBVC: Long-Term Motion Estimation and Prediction for Learned Bi-Directional Video Compression

Yongqi Zhai,Luyang Tang,Wei Jiang,Jiayu Yang,Ronggang Wang

Task: 提出一种新型的LBVC框架（L-LBVC），以解决学习型双向视频压缩（LBVC）在长时运动估计和预测中的性能不足问题。

Motivation: LBVC在长时运动估计和预测上表现不佳，尤其是在大运动场景中，导致性能落后于传统双向编码。

Details

Method: 提出自适应运动估计模块和自适应运动预测模块，分别处理短时和长时运动，并通过递归累积局部流估计长时流，同时通过自适应下采样参考帧优化运动编码。 Result: L-LBVC在随机访问配置下显著优于现有LVC方法，甚至在某些测试数据集上超越VVC（VTM）。 Conclusion: L-LBVC通过改进长时运动估计和预测，显著提升了LBVC的性能，填补了与传统方法的差距。 Abstract: Recently, learned video compression (LVC) has shown superior performance under low-delay configuration. However, the performance of learned bi-directional video compression (LBVC) still lags behind traditional bi-directional coding. The performance gap mainly arises from inaccurate long-term motion estimation and prediction of distant frames, especially in large motion scenes. To solve these two critical problems, this paper proposes a novel LBVC framework, namely L-LBVC. Firstly, we propose an adaptive motion estimation module that can handle both short-term and long-term motions. Specifically, we directly estimate the optical flows for adjacent frames and non-adjacent frames with small motions. For non-adjacent frames with large motions, we recursively accumulate local flows between adjacent frames to estimate long-term flows. Secondly, we propose an adaptive motion prediction module that can largely reduce the bit cost for motion coding. To improve the accuracy of long-term motion prediction, we adaptively downsample reference frames during testing to match the motion ranges observed during training. Experiments show that our L-LBVC significantly outperforms previous state-of-the-art LVC methods and even surpasses VVC (VTM) on some test datasets under random access configuration.

Leveraging Sparse Annotations for Leukemia Diagnosis on the Large Leukemia Dataset

Abdul Rehman,Talha Meraj,Aiman Mahmood Minhas,Ayisha Imran,Mohsen Ali,Waqas Sultani,Mubarak Shah

Task: 提出一个大规模的白血病数据集（LLD）和新型方法，用于检测白细胞及其属性。

Motivation: 现有的白血病分析数据集规模小且缺乏多样性，限制了实际应用。

Details

Method: 收集多源数据并标注7种形态属性，提出多任务模型和稀疏标注方法。 Result: 提供了一个大规模、多样化的数据集，并提出高效的白细胞检测与属性分析方法。 Conclusion: LLD数据集和方法可提升白血病诊断的可解释性和准确性，适用于多种显微图像分析挑战。 Abstract: Leukemia is 10th most frequently diagnosed cancer and one of the leading causes of cancer related deaths worldwide. Realistic analysis of Leukemia requires White Blook Cells (WBC) localization, classification, and morphological assessment. Despite deep learning advances in medical imaging, leukemia analysis lacks a large, diverse multi-task dataset, while existing small datasets lack domain diversity, limiting real world applicability. To overcome dataset challenges, we present a large scale WBC dataset named Large Leukemia Dataset (LLD) and novel methods for detecting WBC with their attributes. Our contribution here is threefold. First, we present a large-scale Leukemia dataset collected through Peripheral Blood Films (PBF) from several patients, through multiple microscopes, multi cameras, and multi magnification. To enhance diagnosis explainability and medical expert acceptance, each leukemia cell is annotated at 100x with 7 morphological attributes, ranging from Cell Size to Nuclear Shape. Secondly, we propose a multi task model that not only detects WBCs but also predicts their attributes, providing an interpretable and clinically meaningful solution. Third, we propose a method for WBC detection with attribute analysis using sparse annotations. This approach reduces the annotation burden on hematologists, requiring them to mark only a small area within the field of view. Our method enables the model to leverage the entire field of view rather than just the annotated regions, enhancing learning efficiency and diagnostic accuracy. From diagnosis explainability to overcoming domain shift challenges, presented datasets could be used for many challenging aspects of microscopic image analysis. The datasets, code, and demo are available at: https://im.itu.edu.pk/sparse-leukemiaattri/

Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation

Jiwoo Chung,Sangeek Hyun,Hyunjun Kim,Eunseo Koh,MinKyu Lee,Jae-Pil Heo

Task: 提出一种基于视觉自回归模型（VAR）的主题驱动生成方法，解决扩散模型计算开销大、语言漂移和多样性降低的问题。

Motivation: 扩散模型虽然生成高质量图像，但计算开销大，限制了实际应用；VAR模型推理速度快，但直接微调会导致计算开销、语言漂移和多样性降低。

Details

Method: 引入选择性层调优以减少复杂性，先验蒸馏以缓解语言漂移，并提出尺度加权调优以优先处理粗分辨率。 Result: 实验表明，该方法在多个指标上显著优于基于扩散的基线，并展示了实际应用价值。 Conclusion: 提出的VAR方法在主题驱动生成中高效且实用，解决了扩散模型的局限性。 Abstract: Recent advances in text-to-image generative models have enabled numerous practical applications, including subject-driven generation, which fine-tunes pretrained models to capture subject semantics from only a few examples. While diffusion-based models produce high-quality images, their extensive denoising steps result in significant computational overhead, limiting real-world applicability. Visual autoregressive~(VAR) models, which predict next-scale tokens rather than spatially adjacent ones, offer significantly faster inference suitable for practical deployment. In this paper, we propose the first VAR-based approach for subject-driven generation. However, na\"{\i}ve fine-tuning VAR leads to computational overhead, language drift, and reduced diversity. To address these challenges, we introduce selective layer tuning to reduce complexity and prior distillation to mitigate language drift. Additionally, we found that the early stages have a greater influence on the generation of subject than the latter stages, which merely synthesize local details. Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information instead of local details. Extensive experiments validate that our method significantly outperforms diffusion-based baselines across various metrics and demonstrates its practical usage.

PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation

Lihua Liu,Jiehong Lin,Zhenxin Liu,Kui Jia

Task: 从RGB图像中进行新颖物体的6D姿态估计，实现零样本泛化。

Motivation: 解决在训练中未见过的物体CAD模型与RGB观测之间的6D变换估计问题。

Details

Method: 提出PicoPose框架，通过三阶段像素到像素的对应学习过程：特征匹配、全局2D仿射变换回归和局部对应偏移学习。 Result: 在BOP基准测试的七个核心数据集上达到最先进性能，表现出对CAD模型或参考图像表示的新颖物体的优异泛化能力。 Conclusion: PicoPose通过逐步细化对应关系显著提高了姿态估计的准确性，适用于零样本场景。 Abstract: Novel object pose estimation from RGB images presents a significant challenge for zero-shot generalization, as it involves estimating the relative 6D transformation between an RGB observation and a CAD model of an object that was not seen during training. In this paper, we introduce PicoPose, a novel framework designed to tackle this task using a three-stage pixel-to-pixel correspondence learning process. Firstly, PicoPose matches features from the RGB observation with those from rendered object templates, identifying the best-matched template and establishing coarse correspondences. Secondly, PicoPose smooths the correspondences by globally regressing a 2D affine transformation, including in-plane rotation, scale, and 2D translation, from the coarse correspondence map. Thirdly, PicoPose applies the affine transformation to the feature map of the best-matched template and learns correspondence offsets within local regions to achieve fine-grained correspondences. By progressively refining the correspondences, PicoPose significantly improves the accuracy of object poses computed via PnP/RANSAC. PicoPose achieves state-of-the-art performance on the seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects represented by CAD models or object reference images. Code and models are available at https://github.com/foollh/PicoPose.

Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation

Xingguang Zhang,Nicholas Chimitt,Xijun Wang,Yu Yuan,Stanley H. Chan

Task: 提出一种基于选择性状态空间模型（MambaTM）和潜在相位失真学习（LPD）的湍流抑制方法。

Motivation: 现有深度学习方法在湍流抑制中计算复杂度高、泛化能力差，且传统方法在空间和时间维度上存在局限性。

Details

Method: 结合MambaTM（提供全局感受野并保持线性计算复杂度）和LPD（改进相位失真表示以减少问题的不适定性）。 Result: 在合成和真实湍流抑制基准测试中优于现有方法，且推理速度更快。 Conclusion: MambaTM和LPD的结合显著提升了湍流抑制的性能和效率。 Abstract: Atmospheric turbulence is a major source of image degradation in long-range imaging systems. Although numerous deep learning-based turbulence mitigation (TM) methods have been proposed, many are slow, memory-hungry, and do not generalize well. In the spatial domain, methods based on convolutional operators have a limited receptive field, so they cannot handle a large spatial dependency required by turbulence. In the temporal domain, methods relying on self-attention can, in theory, leverage the lucky effects of turbulence, but their quadratic complexity makes it difficult to scale to many frames. Traditional recurrent aggregation methods face parallelization challenges. In this paper, we present a new TM method based on two concepts: (1) A turbulence mitigation network based on the Selective State Space Model (MambaTM). MambaTM provides a global receptive field in each layer across spatial and temporal dimensions while maintaining linear computational complexity. (2) Learned Latent Phase Distortion (LPD). LPD guides the state space model. Unlike classical Zernike-based representations of phase distortion, the new LPD map uniquely captures the actual effects of turbulence, significantly improving the model's capability to estimate degradation by reducing the ill-posedness. Our proposed method exceeds current state-of-the-art networks on various synthetic and real-world TM benchmarks with significantly faster inference speed. The code is available at http://github.com/xg416/MambaTM.

HQViT: Hybrid Quantum Vision Transformer for Image Classification

Hui Zhang,Qinglin Zhao,Mengchu Zhou,Li Feng

Task: 提出一种混合量子视觉Transformer（HQViT），结合量子计算加速模型训练并提升性能。

Motivation: 解决传统Transformer在视觉任务中因自注意力机制的高计算复杂度导致的训练成本高问题。

Details

Method: 利用量子计算处理关键步骤，其余部分采用经典方法，减少量子资源需求，同时通过振幅编码保留全局图像信息。 Result: HQViT在多个计算机视觉数据集上表现优异，最高提升10.9%（MNIST任务），并显著降低计算负载。 Conclusion: 展示了量子与经典计算结合在复杂图像分类任务中的巨大潜力。 Abstract: Transformer-based architectures have revolutionized the landscape of deep learning. In computer vision domain, Vision Transformer demonstrates remarkable performance on par with or even surpassing that of convolutional neural networks. However, the quadratic computational complexity of its self-attention mechanism poses challenges for classical computing, making model training with high-dimensional input data, e.g., images, particularly expensive. To address such limitations, we propose a Hybrid Quantum Vision Transformer (HQViT), that leverages the principles of quantum computing to accelerate model training while enhancing model performance. HQViT introduces whole-image processing with amplitude encoding to better preserve global image information without additional positional encoding. By leveraging quantum computation on the most critical steps and selectively handling other components in a classical way, we lower the cost of quantum resources for HQViT. The qubit requirement is minimized to $O(log_2N)$ and the number of parameterized quantum gates is only $O(log_2d)$, making it well-suited for Noisy Intermediate-Scale Quantum devices. By offloading the computationally intensive attention coefficient matrix calculation to the quantum framework, HQViT reduces the classical computational load by $O(T^2d)$. Extensive experiments across various computer vision datasets demonstrate that HQViT outperforms existing models, achieving a maximum improvement of up to $10.9\%$ (on the MNIST 10-classification task) over the state of the art. This work highlights the great potential to combine quantum and classical computing to cope with complex image classification tasks.

MD-ProjTex: Texturing 3D Shapes with Multi-Diffusion Projection

Ahmet Burak Yildirim,Mustafa Utku Aydogdu,Duygu Ceylan,Aysegul Dundar

Task: 提出一种快速且一致的文本引导3D形状纹理生成方法MD-ProjTex。

Motivation: 解决现有方法依赖优化或顺序视图合成导致的计算效率低和一致性差的问题。

Details

Method: 利用预训练的文本到图像扩散模型，通过UV空间的多视角一致性机制，融合多视角噪声预测并联合更新去噪方向。 Result: MD-ProjTex在计算效率上优于现有方法，并取得更好的定量和定性结果。 Conclusion: MD-ProjTex是一种高效且一致的方法，适用于文本引导的3D纹理生成。 Abstract: We introduce MD-ProjTex, a method for fast and consistent text-guided texture generation for 3D shapes using pretrained text-to-image diffusion models. At the core of our approach is a multi-view consistency mechanism in UV space, which ensures coherent textures across different viewpoints. Specifically, MD-ProjTex fuses noise predictions from multiple views at each diffusion step and jointly updates the per-view denoising directions to maintain 3D consistency. In contrast to existing state-of-the-art methods that rely on optimization or sequential view synthesis, MD-ProjTex is computationally more efficient and achieves better quantitative and qualitative results.

CanonNet: Canonical Ordering and Curvature Learning for Point Cloud Analysis

Benjy Friedmann,Michael Werman

Task: 提出一种轻量级神经网络CanonNet，用于解决点云处理中的点排序一致性和细粒度几何特征学习问题。

Motivation: 当前架构依赖复杂操作，限制了表达能力且难以捕捉详细的表面几何特征。

Details

Method: CanonNet由两部分组成：(1)预处理管道创建规范点排序和方向，(2)几何学习框架从具有精确曲率值的合成表面学习。 Result: 在曲率估计任务中达到最先进性能，在几何描述符任务中表现竞争性，且参数数量显著减少（100倍）。 Conclusion: CanonNet的高效性使其适用于计算资源有限的现实应用，表明数学预处理可以有效补充神经架构用于点云分析。 Abstract: Point cloud processing poses two fundamental challenges: establishing consistent point ordering and effectively learning fine-grained geometric features. Current architectures rely on complex operations that limit expressivity while struggling to capture detailed surface geometry. We present CanonNet, a lightweight neural network composed of two complementary components: (1) a preprocessing pipeline that creates a canonical point ordering and orientation, and (2) a geometric learning framework where networks learn from synthetic surfaces with precise curvature values. This modular approach eliminates the need for complex transformation-invariant architectures while effectively capturing local geometric properties. Our experiments demonstrate state-of-the-art performance in curvature estimation and competitive results in geometric descriptor tasks with significantly fewer parameters (\textbf{100X}) than comparable methods. CanonNet's efficiency makes it particularly suitable for real-world applications where computational resources are limited, demonstrating that mathematical preprocessing can effectively complement neural architectures for point cloud analysis. The code for the project is publicly available \hyperlink{https://benjyfri.github.io/CanonNet/}{https://benjyfri.github.io/CanonNet/}.

Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model

Shengjun Zhang,Jinzhao Li,Xin Fei,Hao Liu,Yueqi Duan

Task: 提出一种基于动量的视频扩散方法Scene Splatter，用于从单张图像生成通用场景。

Motivation: 现有方法在生成新视角时存在视频长度有限和场景不一致的问题，导致重建时出现伪影和失真。

Details

Method: 通过构建噪声样本作为动量增强视频细节和保持场景一致性，并引入像素级动量以恢复未见区域。 Result: 实验表明该方法在高保真和一致性场景生成方面具有优越性能。 Conclusion: Scene Splatter通过级联动量机制有效解决了视频长度限制和场景一致性问题。 Abstract: In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.

TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection

Yoon Gyo Jung,Jaewoo Park,Jaeho Yoon,Kuan-Chuan Peng,Wonchul Kim,Andrew Beng Jin Teoh,Octavia Camps

Task: 解决在正常数据集被缺陷区域污染且类别分布未知的长尾环境下的无监督异常检测问题。

Motivation: 现有模型在噪声和长尾类别之间存在性能权衡，无法同时兼顾噪声鲁棒性和长尾类别性能。

Details

Method: 提出TailSampler预测类别基数，独立处理长尾类别和噪声样本，并构建基于记忆的异常检测模型TailedCore。 Result: TailedCore在无监督长尾噪声异常检测场景中表现优于现有方法。 Conclusion: 通过独立处理长尾类别和噪声样本，TailedCore实现了更好的异常检测性能。 Abstract: We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.

Multi-Head Adaptive Graph Convolution Network for Sparse Point Cloud-Based Human Activity Recognition

Vincent Gbouna Zakka,Luis J. Manso,Zhuangzhuang Dai

Task: 提出一种基于毫米波雷达点云数据的自适应图卷积方法，用于人类活动识别。

Motivation: 解决传统图像方法在隐私和低光条件下的局限性，以及现有图卷积方法中固定核的不足。

Details

Method: 提出多头部自适应核（MAK）模块，动态生成多个核以适应点云数据的局部几何特征。 Result: 在基准数据集上实现了最先进的性能。 Conclusion: MAK-GCN方法有效提升了人类活动识别的准确性和适应性。 Abstract: Human activity recognition is increasingly vital for supporting independent living, particularly for the elderly and those in need of assistance. Domestic service robots with monitoring capabilities can enhance safety and provide essential support. Although image-based methods have advanced considerably in the past decade, their adoption remains limited by concerns over privacy and sensitivity to low-light or dark conditions. As an alternative, millimetre-wave (mmWave) radar can produce point cloud data which is privacy-preserving. However, processing the sparse and noisy point clouds remains a long-standing challenge. While graph-based methods and attention mechanisms show promise, they predominantly rely on "fixed" kernels; kernels that are applied uniformly across all neighbourhoods, highlighting the need for adaptive approaches that can dynamically adjust their kernels to the specific geometry of each local neighbourhood in point cloud data. To overcome this limitation, we introduce an adaptive approach within the graph convolutional framework. Instead of a single shared weight function, our Multi-Head Adaptive Kernel (MAK) module generates multiple dynamic kernels, each capturing different aspects of the local feature space. By progressively refining local features while maintaining global spatial context, our method enables convolution kernels to adapt to varying local features. Experimental results on benchmark datasets confirm the effectiveness of our approach, achieving state-of-the-art performance in human activity recognition. Our source code is made publicly available at: https://github.com/Gbouna/MAK-GCN

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

Zhiyuan Yan,Junyan Ye,Weijia Li,Zilong Huang,Shenghai Yuan,Xiangyang He,Kaiqing Lin,Jun He,Conghui He,Li Yuan

Task: 评估GPT-4o在图像生成和编辑中的性能，并提出其架构的推测。

Motivation: OpenAI的GPT-4o模型在图像生成和编辑方面表现出色，但缺乏系统的评估和对其架构的深入理解。

Details

Method: 提出GPT-ImgEval基准，定量和定性评估GPT-4o在生成质量、编辑能力和语义合成三个维度的表现，并通过分类模型推测其架构。 Result: GPT-4o在图像生成和编辑任务中表现优异，推测其架构为自回归与扩散模型的结合，并识别了其局限性。 Conclusion: 该研究为GPT-4o的性能和架构提供了深入分析，为未来研究提供了基准和指导。 Abstract: The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.

Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence

Anita Rau,Mark Endo,Josiah Aklilu,Jaewoo Heo,Khaled Saab,Alberto Paderno,Jeffrey Jopling,F. Christopher Holsinger,Serena Yeung-Levy

Task: 分析11种最先进的视觉语言模型（VLM）在17种外科AI视觉理解任务中的表现。

Motivation: 探索视觉语言模型在医学领域，尤其是手术干预中的实际应用潜力，尤其是在专家标注数据稀缺的情况下。

Details

Method: 使用13个数据集，涵盖腹腔镜、机器人和开放手术，对11种VLM进行全面分析，包括上下文学习。 Result: VLM表现出良好的泛化能力，在某些情况下优于监督模型，上下文学习可提升性能三倍，但空间或时间推理任务仍具挑战性。 Conclusion: VLM在复杂动态场景（如临床和现实应用）中具有潜力，但需进一步改进空间和时间推理能力。 Abstract: Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training. This flexibility holds particular promise across medicine, where expert-annotated data is scarce. Yet, VLMs' practical utility in intervention-focused domains--especially surgery, where decision-making is subjective and clinical scenarios are variable--remains uncertain. Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI--from anatomy recognition to skill assessment--using 13 datasets spanning laparoscopic, robotic, and open procedures. In our experiments, VLMs demonstrate promising generalizability, at times outperforming supervised models when deployed outside their training setting. In-context learning, incorporating examples during testing, boosted performance up to three-fold, suggesting adaptability as a key strength. Still, tasks requiring spatial or temporal reasoning remained difficult. Beyond surgery, our findings offer insights into VLMs' potential for tackling complex and dynamic scenarios in clinical and broader real-world applications.

F-ViTA: Foundation Model Guided Visible to Thermal Translation

Jay N. Paranjape,Celso de Melo,Vishal M. Patel

Task: 提出一种名为F-ViTA的新方法，利用基础模型中的通用知识指导扩散过程，实现可见光到热成像的翻译。

Motivation: 收集大规模热成像数据集成本高且耗时，现有方法（如GANs或DMs）难以同时学习模态分布偏移和物理原理。

Details

Method: 基于InstructPix2Pix扩散模型，结合基础模型（如SAM和Grounded DINO）的零样本掩码和标签，学习场景对象与其热成像特征之间的关联。 Result: 在五个公开数据集上，F-ViTA表现优于现有方法，并能泛化到分布外场景，支持多种红外波段翻译。 Conclusion: F-ViTA通过利用基础模型知识，显著提升了可见光到热成像的翻译效果，并具备多波段生成能力。 Abstract: Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: https://github.com/JayParanjape/F-ViTA/tree/master.

BOP Challenge 2024 on Model-Based and Model-Free 6D Object Pose Estimation

Van Nguyen Nguyen,Stephen Tyree,Andrew Guo,Mederic Fourmy,Anas Gouda,Taeyeop Lee,Sungphill Moon,Hyeontae Son,Lukas Ranftl,Jonathan Tremblay,Eric Brachmann,Bertram Drost,Vincent Lepetit,Carsten Rother,Stan Birchfield,Jiri Matas,Yann Labbe,Martin Sundermeyer,Tomas Hodan

Task: 评估BOP Challenge 2024的方法论、数据集和结果，推动6D物体姿态估计及相关任务的发展。

Motivation: 将BOP从实验室环境过渡到真实场景，引入更实用的任务和数据集。

Details

Method: 引入无模型任务、新6D物体检测任务和BOP-H3数据集，支持模型和无模型任务。 Result: 2024年最佳方法在未见物体6D定位上比2023年方法提升22%，更实用的方法Co-op速度提升25倍且准确率提高13%。 Conclusion: BOP Challenge 2024在真实场景中取得了显著进展，但仍存在速度和准确率的挑战。 Abstract: We present the evaluation methodology, datasets and results of the BOP Challenge 2024, the sixth in a series of public competitions organized to capture the state of the art in 6D object pose estimation and related tasks. In 2024, our goal was to transition BOP from lab-like setups to real-world scenarios. First, we introduced new model-free tasks, where no 3D object models are available and methods need to onboard objects just from provided reference videos. Second, we defined a new, more practical 6D object detection task where identities of objects visible in a test image are not provided as input. Third, we introduced new BOP-H3 datasets recorded with high-resolution sensors and AR/VR headsets, closely resembling real-world scenarios. BOP-H3 include 3D models and onboarding videos to support both model-based and model-free tasks. Participants competed on seven challenge tracks, each defined by a task, object onboarding setup, and dataset group. Notably, the best 2024 method for model-based 6D localization of unseen objects (FreeZeV2.1) achieves 22% higher accuracy on BOP-Classic-Core than the best 2023 method (GenFlow), and is only 4% behind the best 2023 method for seen objects (GPose2023) although being significantly slower (24.9 vs 2.7s per image). A more practical 2024 method for this task is Co-op which takes only 0.8s per image and is 25X faster and 13% more accurate than GenFlow. Methods have a similar ranking on 6D detection as on 6D localization but higher run time. On model-based 2D detection of unseen objects, the best 2024 method (MUSE) achieves 21% relative improvement compared to the best 2023 method (CNOS). However, the 2D detection accuracy for unseen objects is still noticealy (-53%) behind the accuracy for seen objects (GDet2023). The online evaluation system stays open and is available at http://bop.felk.cvut.cz/

Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization

Kangle Deng,Hsueh-Ti Derek Liu,Yiheng Zhu,Xiaoxia Sun,Chong Shang,Kiran Bhat,Deva Ramanan,Jun-Yan Zhu,Maneesh Agrawala,Tinghui Zhou

Task: 提出一种基于八叉树的自适应标记化框架，用于根据3D形状的复杂性调整潜在表示的维度。

Motivation: 现有方法将所有形状编码为固定大小的标记，忽略了3D数据在尺度和复杂性上的固有变化，导致潜在表示效率低下，影响下游生成任务。

Details

Method: 通过基于二次误差的分割准则构建自适应八叉树结构，并使用基于查询的变换器为每个八叉树单元分配形状潜在向量，开发了一种基于八叉树的自回归生成模型。 Result: 实验表明，该方法在保持视觉质量的同时，将标记数量减少了50%；在相似标记长度下，生成的形状质量显著更高；下游生成模型生成的3D内容更详细多样。 Conclusion: 提出的自适应标记化框架显著提高了3D形状生成的效率和质量。 Abstract: Many 3D generative models rely on variational autoencoders (VAEs) to learn compact shape representations. However, existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data. This leads to inefficient latent representations that can compromise downstream generation. We address this challenge by introducing Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. Our approach constructs an adaptive octree structure guided by a quadric-error-based subdivision criterion and allocates a shape latent vector to each octree cell using a query-based transformer. Building upon this tokenization, we develop an octree-based autoregressive generative model that effectively leverages these variable-sized representations in shape generation. Extensive experiments demonstrate that our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality. When using a similar token length, our method produces significantly higher-quality shapes. When incorporated with our downstream generative model, our method creates more detailed and diverse 3D content than existing approaches.

GMR-Conv: An Efficient Rotation and Reflection Equivariant Convolution Kernel Using Gaussian Mixture Rings

Yuexi Du,Jiazhen Zhang,Nicha C. Dvornek,John A. Onofrey

Task: 设计一种高效的卷积核（GMR-Conv），以支持旋转和反射等变性，同时避免信息损失和计算开销。

Motivation: 传统卷积神经网络（CNNs）仅支持平移等变性，而扩展到旋转和反射等变性时面临效率和信息损失的挑战。

Details

Method: 提出高斯混合环卷积（GMR-Conv），通过高斯加权环平滑径向对称性，减少离散化误差，并优化参数化和计算策略以提高效率。 Result: 在八个分类和一个分割数据集上的实验表明，GMR-Conv不仅性能与传统CNNs相当，在无方向性数据中表现更优，且比现有等变学习方法更鲁棒高效。 Conclusion: GMR-Conv展示了径向对称性的有效应用，为解决信息损失问题提供了新思路，推动了等变网络架构的发展。 Abstract: Symmetry, where certain features remain invariant under geometric transformations, can often serve as a powerful prior in designing convolutional neural networks (CNNs). While conventional CNNs inherently support translational equivariance, extending this property to rotation and reflection has proven challenging, often forcing a compromise between equivariance, efficiency, and information loss. In this work, we introduce Gaussian Mixture Ring Convolution (GMR-Conv), an efficient convolution kernel that smooths radial symmetry using a mixture of Gaussian-weighted rings. This design mitigates discretization errors of circular kernels, thereby preserving robust rotation and reflection equivariance without incurring computational overhead. We further optimize both the space and speed efficiency of GMR-Conv via a novel parameterization and computation strategy, allowing larger kernels at an acceptable cost. Extensive experiments on eight classification and one segmentation datasets demonstrate that GMR-Conv not only matches conventional CNNs' performance but can also surpass it in applications with orientation-less data. GMR-Conv is also proven to be more robust and efficient than the state-of-the-art equivariant learning methods. Our work provides inspiring empirical evidence that carefully applied radial symmetry can alleviate the challenges of information loss, marking a promising advance in equivariant network architectures. The code is available at https://github.com/XYPB/GMR-Conv.

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Mateusz Pach,Shyamgopal Karthik,Quentin Bouniot,Serge Belongie,Zeynep Akata

Task: 将稀疏自编码器（SAEs）应用于视觉语言模型（VLMs）以增强其可解释性和可控性。

Motivation: 稀疏自编码器在大型语言模型中已显示出提升可解释性和可控性的潜力，但其在视觉语言模型中的应用尚未充分探索。

Details

Method: 在视觉语言模型（如CLIP）上训练SAEs，并引入一个评估视觉表示单义性的框架。 Result: SAEs显著提升了视觉表示的单义性，并展现出与专家定义结构（如iNaturalist分类法）一致的分层表示。此外，SAEs可直接干预CLIP视觉编码器，无需修改底层模型即可控制多模态LLM的输出。 Conclusion: SAEs是一种无监督方法，能有效增强视觉语言模型的可解释性和可控性。 Abstract: Sparse Autoencoders (SAEs) have recently been shown to enhance interpretability and steerability in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons while also exhibiting hierarchical representations that align well with expert-defined structures (e.g., iNaturalist taxonomy). Most notably, we demonstrate that applying SAEs to intervene on a CLIP vision encoder, directly steer output from multimodal LLMs (e.g., LLaVA) without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised approach for enhancing both the interpretability and control of VLMs.

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

Divya Velayudhan,Abdelfatah Ahmed,Mohamad Alansari,Neha Gour,Abderaouf Behouch,Taimur Hassan,Syed Talal Wasim,Nabil Maalej,Muzammal Naseer,Juergen Gall,Mohammed Bennamoun,Ernesto Damiani,Naoufel Werghi

Task: 开发一个多模态X射线行李安检数据集STCray，并训练一个领域感知的视觉AI助手STING-BEE，以支持多种视觉语言任务。

Motivation: 当前数据集在表示真实世界复杂威胁和隐藏策略方面有限，且现有方法受限于预定义标签的封闭集范式。

Details

Method: 通过X射线扫描仪生成46,642张图像-标题配对的扫描图像，涵盖21种威胁类别，并开发领域感知的多模态指令跟随数据。 Result: STING-BEE在多种视觉语言任务中表现优异，并在跨域设置中展示了最先进的泛化能力。 Conclusion: STCray和STING-BEE为X射线行李安检领域的多模态学习设立了新的基准。 Abstract: Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at https://divs1159.github.io/STING-BEE/.

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Xiangyu Zhao,Peiyuan Zhang,Kexian Tang,Hao Li,Zicheng Zhang,Guangtao Zhai,Junchi Yan,Hua Yang,Xue Yang,Haodong Duan

Task: 提出并评估RISEBench，一个用于评估推理感知视觉编辑（RISE）的首个基准。

Motivation: 解决大型多模态模型（LMMs）在通用视觉编辑中面临的复杂指令遵循、外观一致性和灵活输入格式支持的挑战。

Details

Method: 通过构建包含时间、因果、空间和逻辑推理四种关键推理类型的高质量测试案例，并提出一个结合人工评估和LMM-as-a-judge方法的评估框架。 Result: 实验显示，尽管GPT-4o-Native显著优于其他开源和专有模型，但在逻辑推理任务上仍存在困难。 Conclusion: RISEBench旨在为推理感知视觉编辑提供基础性见解，并推动未来研究，未来将持续扩展和优化基准。 Abstract: Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.

Concept Lancet: Image Editing with Compositional Representation Transplant

Jinqi Luo,Tianjiao Ding,Kwan Ho Ryan Chan,Hancheng Min,Chris Callison-Burch,René Vidal

Task: 提出一种名为Concept Lancet (CoLan)的零样本即插即用框架，用于在基于扩散模型的图像编辑中进行原则性的表示操作。

Motivation: 现有图像编辑方法在表示操作过程中常面临编辑强度难以平衡的问题，过强或过弱都会影响效果，且针对不同图像需要手动调整强度，效率低下。

Details

Method: 通过将源输入在潜在空间（文本嵌入或扩散分数空间）分解为收集的视觉概念表示的稀疏线性组合，准确估计概念在图像中的存在，并根据编辑任务（替换/添加/移除）执行定制的概念移植过程。 Result: 在多个基于扩散模型的图像编辑基准测试中，配备CoLan的方法在编辑效果和一致性保持方面达到了最先进的性能。 Conclusion: CoLan框架通过稀疏线性组合和概念移植，有效解决了编辑强度难以平衡的问题，提升了图像编辑的效率和效果。 Abstract: Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.

CaLiV: LiDAR-to-Vehicle Calibration of Arbitrary Sensor Setups via Object Reconstruction

Ilir Tahiraj,Markus Edinger,Dominik Kulmer,Markus Lienkamp

Task: 提出一种名为CaLiV的目标标定技术，用于多激光雷达系统的外参标定（传感器间和传感器到车辆的标定）。

Motivation: 现有激光雷达标定方法通常需要重叠视野、外部传感设备或特征丰富的环境，且多数不支持传感器到车辆的标定。

Details

Method: 通过运动产生视野重叠，利用无迹卡尔曼滤波获取车辆位姿，使用GMMCalib框架对齐点云，最后将标定问题转化为最小化问题求解。 Result: 该方法能高精度解决传感器间的平移和旋转误差，以及传感器到车辆的旋转角度标定。 Conclusion: CaLiV算法在无需外部设备或重叠视野的情况下，实现了高效且高精度的多激光雷达系统标定，并通过实验验证了其有效性。 Abstract: In autonomous systems, sensor calibration is essential for a safe and efficient navigation in dynamic environments. Accurate calibration is a prerequisite for reliable perception and planning tasks such as object detection and obstacle avoidance. Many existing LiDAR calibration methods require overlapping fields of view, while others use external sensing devices or postulate a feature-rich environment. In addition, Sensor-to-Vehicle calibration is not supported by the vast majority of calibration algorithms. In this work, we propose a novel target-based technique for extrinsic Sensor-to-Sensor and Sensor-to-Vehicle calibration of multi-LiDAR systems called CaLiV. This algorithm works for non-overlapping FoVs, as well as arbitrary calibration targets, and does not require any external sensing devices. First, we apply motion to produce FoV overlaps and utilize a simple unscented Kalman filter to obtain vehicle poses. Then, we use the Gaussian mixture model-based registration framework GMMCalib to align the point clouds in a common calibration frame. Finally, we reduce the task of recovering the sensor extrinsics to a minimization problem. We show that both translational and rotational Sensor-to-Sensor errors can be solved accurately by our method. In addition, all Sensor-to-Vehicle rotation angles can also be calibrated with high accuracy. We validate the simulation results in real-world experiments. The code is open source and available on https://github.com/TUMFTM/CaLiV.

Distance Estimation to Support Assistive Drones for the Visually Impaired using Robust Calibration

Suman Raj,Bhavani A Madhabhavi,Madhav Kumar,Prabhav Gupta,Yogesh Simmhan

Task: 利用深度地图和动态更新方法，为视觉障碍人士（VIPs）在户外环境中导航时提供障碍物距离估计。

Motivation: 通过无人机和深度学习技术，帮助视觉障碍人士在户外环境中自主导航，避免障碍物。

Details

Method: 提出NOVA技术，结合深度地图和动态更新方法，适应对抗性场景，并与现有深度地图方法及基线模型进行比较。 Result: NOVA在预测VIP距离时误差小于30厘米，对其他障碍物（如汽车、自行车）的最大误差为60厘米，优于基线模型和现有深度地图方法（性能提升5.3-14.6倍）。 Conclusion: NOVA是一种鲁棒且通用的方法，能够有效帮助视觉障碍人士在动态环境中导航。 Abstract: Autonomous navigation by drones using onboard sensors, combined with deep learning and computer vision algorithms, is impacting a number of domains. We examine the use of drones to autonomously assist Visually Impaired People (VIPs) in navigating outdoor environments while avoiding obstacles. Here, we present NOVA, a robust calibration technique using depth maps to estimate absolute distances to obstacles in a campus environment. NOVA uses a dynamic-update method that can adapt to adversarial scenarios. We compare NOVA with SOTA depth map approaches, and with geometric and regression-based baseline models, for distance estimation to VIPs and other obstacles in diverse and dynamic conditions. We also provide exhaustive evaluations to validate the robustness and generalizability of our methods. NOVA predicts distances to VIP with an error <30cm and to different obstacles like cars and bicycles with a maximum of 60cm error, which are better than the baselines. NOVA also clearly out-performs SOTA depth map methods, by upto 5.3-14.6x.

A Concise Survey on Lane Topology Reasoning for HD Mapping

Yi Yao,Miao Fan,Shengtong Xu,Haoyi Xiong,Xiangzeng Liu,Wenbo Hu,Wenbing Huang

Task: 系统回顾和分类车道拓扑推理方法的演变和现状。

Motivation: 车道拓扑推理在高清地图和自动驾驶中至关重要，但缺乏对这些工作的全面综述。

Details

Method: 将方法分为三类：基于程序建模的方法、基于航空影像的方法和基于车载传感器的方法，并分析从早期规则方法到现代学习方法的演进。 Result: 总结了标准化评估指标和基准数据集上的性能比较，并指出了数据集可用性和模型效率等关键挑战。 Conclusion: 为研究人员和实践者提供了车道拓扑推理的理论框架、实际实现和新兴趋势的全面见解。 Abstract: Lane topology reasoning techniques play a crucial role in high-definition (HD) mapping and autonomous driving applications. While recent years have witnessed significant advances in this field, there has been limited effort to consolidate these works into a comprehensive overview. This survey systematically reviews the evolution and current state of lane topology reasoning methods, categorizing them into three major paradigms: procedural modeling-based methods, aerial imagery-based methods, and onboard sensors-based methods. We analyze the progression from early rule-based approaches to modern learning-based solutions utilizing transformers, graph neural networks (GNNs), and other deep learning architectures. The paper examines standardized evaluation metrics, including road-level measures (APLS and TLTS score), and lane-level metrics (DET and TOP score), along with performance comparisons on benchmark datasets such as OpenLane-V2. We identify key technical challenges, including dataset availability and model efficiency, and outline promising directions for future research. This comprehensive review provides researchers and practitioners with insights into the theoretical frameworks, practical implementations, and emerging trends in lane topology reasoning for HD mapping applications.

Khizar Anjum,Parul Pandey,Vidyasagar Sadhu,Roberto Tron,Dario Pompili

Task: 提出一种新颖的马尔可夫决策过程（MDP）框架，以减少计算机视觉（CV）算法在自主导航中的计算负担。

Motivation: 现有的自主导航方法依赖于构建和处理几何3D点云，计算成本高；而基于语义信息（如交通标志）的导航虽然简单，但计算机视觉算法（如目标检测）对资源有限的设备（如无人机）来说负担较重。

Details

Method: 引入一种新颖的马尔可夫决策过程（MDP）框架，应用于基于特征和神经网络的目标检测任务，并通过开环和闭环仿真以及硬件在环仿真进行测试。 Result: 实验表明，与基于静态特征和神经网络的方法相比，该框架在能耗和速度上有显著优势，同时仅带来有限的精度损失。 Conclusion: 提出的MDP框架能够有效降低计算机视觉算法的计算负担，适用于资源受限的自主导航设备。 Abstract: Most applications in autonomous navigation using mounted cameras rely on the construction and processing of geometric 3D point clouds, which is an expensive process. However, there is another simpler way to make a space navigable quickly: to use semantic information (e.g., traffic signs) to guide the agent. However, detecting and acting on semantic information involves Computer Vision~(CV) algorithms such as object detection, which themselves are demanding for agents such as aerial drones with limited onboard resources. To solve this problem, we introduce a novel Markov Decision Process~(MDP) framework to reduce the workload of these CV approaches. We apply our proposed framework to both feature-based and neural-network-based object-detection tasks, using open-loop and closed-loop simulations as well as hardware-in-the-loop emulations. These holistic tests show significant benefits in energy consumption and speed with only a limited loss in accuracy compared to models based on static features and neural networks.

WorldPrompter: Traversable Text-to-Scene Generation

Zhaoyang Zhang,Yannick Hold-Geoffroy,Miloš Hašan,Chen Ziwen,Fujun Luan,Julie Dorsey,Yiwei Hu

Task: 从文本提示生成可遍历的3D场景。

Motivation: 现有方法大多只能生成部分场景且导航自由度有限，需要一种能够生成完整可遍历3D场景的新方法。

Details

Method: 利用全景视频作为中间表示，结合条件性360度全景视频生成器和快速前馈3D重建器，生成高斯样条表示的3D场景。 Result: 实验表明，全景视频生成模型在帧间保持一致的视角，支持高质量的高斯样条重建，并在场景区域内实现遍历。 Conclusion: WorldPrompter在360度视频生成和3D场景生成方面优于现有技术。 Abstract: Scene-level 3D generation is a challenging research topic, with most existing methods generating only partial scenes and offering limited navigational freedom. We introduce WorldPrompter, a novel generative pipeline for synthesizing traversable 3D scenes from text prompts. We leverage panoramic videos as an intermediate representation to model the 360{\deg} details of a scene. WorldPrompter incorporates a conditional 360{\deg} panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model achieves convincing view consistency across frames, enabling high-quality panoramic Gaussian splat reconstruction and facilitating traversal over an area of the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360{\deg} video generators and 3D scene generation models.

Evaluation of Flight Parameters in UAV-based 3D Reconstruction for Rooftop Infrastructure Assessment

Nick Chodura,Melissa Greeff,Joshua Woods

Task: 系统评估无人机飞行参数（地面采样距离和图像重叠率）以优化复杂屋顶基础设施的3D重建。

Motivation: 现有方法需要高图像重叠率和长飞行时间以确保模型精度，本研究旨在优化这些参数以提高效率。

Details

Method: 通过控制无人机飞行，在不同地面采样距离和图像重叠率下采集数据，并使用Reality Capture软件处理，与基于LiDAR和TLS的地面真实模型对比评估。 Result: 实验结果表明，地面采样距离0.75-1.26厘米和85%图像重叠率可在保证模型精度的同时减少图像数量和飞行时间。 Conclusion: 研究结果为规划自主无人机飞行路径提供了指导，以实现高效的屋顶评估。 Abstract: Rooftop 3D reconstruction using UAV-based photogrammetry offers a promising solution for infrastructure assessment, but existing methods often require high percentages of image overlap and extended flight times to ensure model accuracy when using autonomous flight paths. This study systematically evaluates key flight parameters-ground sampling distance (GSD) and image overlap-to optimize the 3D reconstruction of complex rooftop infrastructure. Controlled UAV flights were conducted over a multi-segment rooftop at Queen's University using a DJI Phantom 4 Pro V2, with varied GSD and overlap settings. The collected data were processed using Reality Capture software and evaluated against ground truth models generated from UAV-based LiDAR and terrestrial laser scanning (TLS). Experimental results indicate that a GSD range of 0.75-1.26 cm combined with 85% image overlap achieves a high degree of model accuracy, while minimizing images collected and flight time. These findings provide guidance for planning autonomous UAV flight paths for efficient rooftop assessments.

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Ezzeldin Shereen,Dan Ristea,Burak Hasircioglu,Shae McFadden,Vasilios Mavroudis,Chris Hicks

Task: 研究针对多模态检索增强生成（M-RAG）系统的投毒攻击，特别是针对视觉文档检索应用的攻击。

Motivation: M-RAG系统通过知识库抑制大型多模态模型的幻觉，但也引入了新的攻击面，攻击者可能通过注入恶意条目破坏系统。

Details

Method: 提出一种针对M-RAG系统的投毒攻击方法，目标是制作一个能被多种查询检索到的恶意图像，从而影响生成模型的输出。 Result: 攻击对多种主流检索器和生成模型有效，但对某些鲁棒的嵌入模型无效。 Conclusion: M-RAG系统易受投毒攻击，且这种攻击揭示了系统在良性设置下可能存在的性能问题。 Abstract: Multimodal retrieval augmented generation (M-RAG) has recently emerged as a method to inhibit hallucinations of large multimodal models (LMMs) through a factual knowledge base (KB). However, M-RAG also introduces new attack vectors for adversaries that aim to disrupt the system by injecting malicious entries into the KB. In this work, we present a poisoning attack against M-RAG targeting visual document retrieval applications, where the KB contains images of document pages. Our objective is to craft a single image that is retrieved for a variety of different user queries, and consistently influences the output produced by the generative model, thus creating a universal denial-of-service (DoS) attack against the M-RAG system. We demonstrate that while our attack is effective against a diverse range of widely-used, state-of-the-art retrievers (embedding models) and generators (LMMs), it can also be ineffective against robust embedding models. Our attack not only highlights the vulnerability of M-RAG pipelines to poisoning attacks, but also sheds light on a fundamental weakness that potentially hinders their performance even in benign settings.

Multivariate Temporal Regression at Scale: A Three-Pillar Framework Combining ML, XAI, and NLP

Jiztom Kavalakkatt Francis,Matthew J Darr

Task: 探索高维数据分析中的挑战并提出简化模型的方法。

Motivation: 传统数据分析方法在处理高维数据时可能忽略复杂关系，且计算成本高，需要更高效且易于理解的方法。

Details

Method: 通过变量移除、统计分析和合成数据等技术，提出一种全局特征提取方法以简化模型。 Result: 在真实和合成数据集上验证了方法的有效性，能够揭示新的输入与输出关系并简化模型。 Conclusion: 提出的方法能够简化高维数据分析，提高模型可解释性，并揭示潜在的数据关系。 Abstract: The rapid use of artificial intelligence (AI) in processes such as coding, image processing, and data prediction means it is crucial to understand and validate the data we are working with fully. This paper dives into the hurdles of analyzing high-dimensional data, especially when it gets too complex. Traditional methods in data analysis often look at direct connections between input variables, which can miss out on the more complicated relationships within the data. To address these issues, we explore several tested techniques, such as removing specific variables to see their impact and using statistical analysis to find connections between multiple variables. We also consider the role of synthetic data and how information can sometimes be redundant across different sensors. These analyses are typically very computationally demanding and often require much human effort to make sense of the results. A common approach is to treat the entire dataset as one unit and apply advanced models to handle it. However, this can become problematic with larger, noisier datasets and more complex models. So, we suggest methods to identify overall patterns that can help with tasks like classification or regression based on the idea that more straightforward approaches might be more understandable. Our research looks at two datasets: a real-world dataset and a synthetic one. The goal is to create a methodology that highlights key features on a global scale that lead to predictions, making it easier to validate or quantify the data set. By reducing the dimensionality with this method, we can simplify the models used and thus clarify the insights we gain. Furthermore, our method can reveal unexplored relationships between specific inputs and outcomes, providing a way to validate these new connections further.

Preference-Driven Active 3D Scene Representation for Robotic Inspection in Nuclear Decommissioning

Zhen Meng,Kan Chen,Xiangmin Xu,Erwin Jose Lopez Pulgarin,Emma Li,Philip G. Zhao,David Flynn

Task: 提出一种将专家操作者偏好融入主动3D场景表示的新框架。

Motivation: 传统方法主要优化几何保真度或渲染精度，但忽略了操作者特定的目标（如安全关键覆盖或任务驱动视角），导致在受限环境中的视角选择不理想。

Details

Method: 采用基于人类反馈的强化学习（RLHF）指导机器人路径规划，并通过交互式选择实验捕捉操作者特定优先级。 Result: 在核退役场景中验证，RLHF策略优于基线方法，优化了轨迹效率并提升了场景表示。 Conclusion: 该工作为自适应、安全关键的机器人感知系统奠定了基础，推动了核退役等高危环境中自动化的进步。 Abstract: Active 3D scene representation is pivotal in modern robotics applications, including remote inspection, manipulation, and telepresence. Traditional methods primarily optimize geometric fidelity or rendering accuracy, but often overlook operator-specific objectives, such as safety-critical coverage or task-driven viewpoints. This limitation leads to suboptimal viewpoint selection, particularly in constrained environments such as nuclear decommissioning. To bridge this gap, we introduce a novel framework that integrates expert operator preferences into the active 3D scene representation pipeline. Specifically, we employ Reinforcement Learning from Human Feedback (RLHF) to guide robotic path planning, reshaping the reward function based on expert input. To capture operator-specific priorities, we conduct interactive choice experiments that evaluate user preferences in 3D scene representation. We validate our framework using a UR3e robotic arm for reactor tile inspection in a nuclear decommissioning scenario. Compared to baseline methods, our approach enhances scene representation while optimizing trajectory efficiency. The RLHF-based policy consistently outperforms random selection, prioritizing task-critical details. By unifying explicit 3D geometric modeling with implicit human-in-the-loop optimization, this work establishes a foundation for adaptive, safety-critical robotic perception systems, paving the way for enhanced automation in nuclear decommissioning, remote maintenance, and other high-risk environments.

Neural Style Transfer for Synthesising a Dataset of Ancient Egyptian Hieroglyphs

Lewis Matheson Creed

Task: 提出一种利用神经风格迁移（NST）生成古埃及象形文字数据集的新方法。

Motivation: 低资源语言（如古埃及语）的训练数据有限，限制了机器学习技术的应用。

Details

Method: 通过将NST应用于数字字体，生成古埃及象形文字数据集。 Result: 实验表明，基于NST生成的数据和真实照片训练的模型在分类任务中表现相当，且能泛化到未见过的真实象形文字图像。 Conclusion: NST是一种有效的数据增强方法，可用于解决低资源语言的数据稀缺问题。 Abstract: The limited availability of training data for low-resource languages makes applying machine learning techniques challenging. Ancient Egyptian is one such language with few resources. However, innovative applications of data augmentation methods, such as Neural Style Transfer, could overcome these barriers. This paper presents a novel method for generating datasets of ancient Egyptian hieroglyphs by applying NST to a digital typeface. Experimental results found that image classification models trained on NST-generated examples and photographs demonstrate equal performance and transferability to real unseen images of hieroglyphs.

Image Coding for Machines via Feature-Preserving Rate-Distortion Optimization

Samuel Fernández-Menduiña,Eduardo Pavez,Antonio Ortega

Task: 优化图像和视频压缩方法，以同时兼顾视觉质量和下游计算机视觉任务性能。

Motivation: 许多图像和视频主要由计算机视觉算法处理，压缩时需同时考虑视觉质量和任务性能。

Details

Method: 通过泰勒展开和块近似简化率失真优化（RDO），提出输入依赖平方误差（IDSE）和雅可比矩阵近似。 Result: 在AVC模拟中，相比基于SSE的RDO，节省10%比特率且解码器复杂度不变，仅增加7%编码器复杂度。 Conclusion: 提出的方法有效平衡了压缩效率和计算机视觉任务性能。 Abstract: Many images and videos are primarily processed by computer vision algorithms, involving only occasional human inspection. When this content requires compression before processing, e.g., in distributed applications, coding methods must optimize for both visual quality and downstream task performance. We first show that, given the features obtained from the original and the decoded images, an approach to reduce the effect of compression on a task loss is to perform rate-distortion optimization (RDO) using the distance between features as a distortion metric. However, optimizing directly such a rate-distortion trade-off requires an iterative workflow of encoding, decoding, and feature evaluation for each coding parameter, which is computationally impractical. We address this problem by simplifying the RDO formulation to make the distortion term computable using block-based encoders. We first apply Taylor's expansion to the feature extractor, recasting the feature distance as a quadratic metric with the Jacobian matrix of the neural network. Then, we replace the linearized metric with a block-wise approximation, which we call input-dependent squared error (IDSE). To reduce computational complexity, we approximate IDSE using Jacobian sketches. The resulting loss can be evaluated block-wise in the transform domain and combined with the sum of squared errors (SSE) to address both visual quality and computer vision performance. Simulations with AVC across multiple feature extractors and downstream neural networks show up to 10% bit-rate savings for the same computer vision accuracy compared to RDO based on SSE, with no decoder complexity overhead and just a 7% encoder complexity increase.

APSeg: Auto-Prompt Model with Acquired and Injected Knowledge for Nuclear Instance Segmentation and Classification

Liying Xu,Hongliang He,Wei Han,Hanbin Huang,Siwei Feng,Guohong Fu

Task: 提出一种名为APSeg的自动提示模型，用于核实例分割和分类。

Motivation: SAM模型在核分割中依赖精确提示且无法分类，因此需要改进提示生成以提升定位和分类准确性。

Details

Method: APSeg包含两个知识感知模块：DG-POM（分布引导的提议偏移模块）和CK-SIM（类别知识语义注入模块）。 Result: 在PanNuke和CoNSeP数据集上的实验验证了方法的有效性。 Conclusion: APSeg通过知识注入提升了核实例分割和分类的准确性。 Abstract: Nuclear instance segmentation and classification provide critical quantitative foundations for digital pathology diagnosis. With the advent of the foundational Segment Anything Model (SAM), the accuracy and efficiency of nuclear segmentation have improved significantly. However, SAM imposes a strong reliance on precise prompts, and its class-agnostic design renders its classification results entirely dependent on the provided prompts. Therefore, we focus on generating prompts with more accurate localization and classification and propose \textbf{APSeg}, \textbf{A}uto-\textbf{P}rompt model with acquired and injected knowledge for nuclear instance \textbf{Seg}mentation and classification. APSeg incorporates two knowledge-aware modules: (1) Distribution-Guided Proposal Offset Module (\textbf{DG-POM}), which learns distribution knowledge through density map guided, and (2) Category Knowledge Semantic Injection Module (\textbf{CK-SIM}), which injects morphological knowledge derived from category descriptions. We conducted extensive experiments on the PanNuke and CoNSeP datasets, demonstrating the effectiveness of our approach. The code will be released upon acceptance.

LLM-Guided Evolution: An Autonomous Model Optimization for Object Detection

YiMing Yu,Jason Zutty

Task: 通过改进LLM-GE框架，优化YOLO模型的架构以提升在KITTI数据集上的目标检测性能。

Motivation: 传统神经架构搜索（NAS）需要大量试错和领域知识，而进化算法依赖固定规则和预定义模块。LLM-GE通过结合大语言模型（LLM）直接修改模型代码，提供更智能的优化方式。

Details

Method: 采用LLM-GE框架，结合“思维进化”（EoT）技术，通过反馈循环迭代优化YOLO模型的设计和参数。 Result: LLM-GE生成的模型变体在KITTI数据集上显著提升了性能，如平均精度（mAP）从92.5%提高到94.5%。 Conclusion: LLM-GE结合了LLM驱动的推理和进化策略，为自动化机器学习提供了灵活高效的新范式。 Abstract: In machine learning, Neural Architecture Search (NAS) requires domain knowledge of model design and a large amount of trial-and-error to achieve promising performance. Meanwhile, evolutionary algorithms have traditionally relied on fixed rules and pre-defined building blocks. The Large Language Model (LLM)-Guided Evolution (GE) framework transformed this approach by incorporating LLMs to directly modify model source code for image classification algorithms on CIFAR data and intelligently guide mutations and crossovers. A key element of LLM-GE is the "Evolution of Thought" (EoT) technique, which establishes feedback loops, allowing LLMs to refine their decisions iteratively based on how previous operations performed. In this study, we perform NAS for object detection by improving LLM-GE to modify the architecture of You Only Look Once (YOLO) models to enhance performance on the KITTI dataset. Our approach intelligently adjusts the design and settings of YOLO to find the optimal algorithms against objective such as detection accuracy and speed. We show that LLM-GE produced variants with significant performance improvements, such as an increase in Mean Average Precision from 92.5% to 94.5%. This result highlights the flexibility and effectiveness of LLM-GE on real-world challenges, offering a novel paradigm for automated machine learning that combines LLM-driven reasoning with evolutionary strategies.

Towards Assessing Deep Learning Test Input Generators

Seif Mzoughi,Ahmed Hajyahmed,Mohamed Elshafei,Foutse Khomh anb Diego Elias Costa

Task: 对四种先进的测试输入生成器（TIGs）在多个关键维度上进行全面评估。

Motivation: 深度学习系统在安全关键应用中部署增多，但其鲁棒性问题可能导致严重故障，现有TIGs的评估缺乏全面性。

Details

Method: 利用三种预训练模型（LeNet-5、VGG16、EfficientNetB3）和不同复杂度的数据集（MNIST、CIFAR-10、ImageNet-1K）评估四种TIGs（DeepHunter、DeepFault、AdvGAN、SinVAD）的性能。 Result: 发现TIGs在鲁棒性揭示能力、测试用例生成多样性和计算效率方面存在显著差异，且性能随数据集复杂度变化。 Conclusion: 为选择适合特定目标和数据集特性的TIGs提供实用指导，但需进一步改进TIGs以应对现实安全关键系统的挑战。 Abstract: Deep Learning (DL) systems are increasingly deployed in safety-critical applications, yet they remain vulnerable to robustness issues that can lead to significant failures. While numerous Test Input Generators (TIGs) have been developed to evaluate DL robustness, a comprehensive assessment of their effectiveness across different dimensions is still lacking. This paper presents a comprehensive assessment of four state-of-the-art TIGs--DeepHunter, DeepFault, AdvGAN, and SinVAD--across multiple critical aspects: fault-revealing capability, naturalness, diversity, and efficiency. Our empirical study leverages three pre-trained models (LeNet-5, VGG16, and EfficientNetB3) on datasets of varying complexity (MNIST, CIFAR-10, and ImageNet-1K) to evaluate TIG performance. Our findings reveal important trade-offs in robustness revealing capability, variation in test case generation, and computational efficiency across TIGs. The results also show that TIG performance varies significantly with dataset complexity, as tools that perform well on simpler datasets may struggle with more complex ones. In contrast, others maintain steadier performance or better scalability. This paper offers practical guidance for selecting appropriate TIGs aligned with specific objectives and dataset characteristics. Nonetheless, more work is needed to address TIG limitations and advance TIGs for real-world, safety-critical systems.

Determining Sphere Radius through Pairwise Distances

Boris Sukhovilov

Task: 提出一种基于球面上点间距离测量确定球面半径的新方法。

Motivation: 解决在距离测量存在误差且球面形状存在随机偏差时确定球面半径的问题。

Details

Method: 使用最少四个点和任意N个点，通过距离矩阵提供闭式解，并确定半径估计的标准差。 Result: 提出了球面半径的闭式解，并找到最优点配置以最小化半径估计的标准差。 Conclusion: 方法有效且开源实现可用。 Abstract: We propose a novel method for determining the radius of a spherical surface based on the distances measured between points on this surface. We consider the most general case of determining the radius when the distances are measured with errors and the sphere has random deviations from its ideal shape. For the solution, we used the minimally necessary four points and an arbitrary N number of points. We provide a new closed form solution for the radius of the sphere through the matrix of pairwise distances. We also determine the standard deviation of the radius estimate caused by measurement errors and deviations of the sphere from its ideal shape. We found optimal configurations of points on the sphere that provide the minimum standard deviation of the radius estimate. This paper describes our solution and provides all the mathematical derivations. We share the implementation of our method as open source code at https://github.com/boris-sukhovilov/Sphere_Radius.

MG-Gen: Single Image to Motion Graphics Generation with Layer Decomposition

Takahiro Shirakawa,Tomoyuki Suzuki,Daichi Haraguchi

Task: 提出一种名为MG-Gen的新框架，从单张栅格图像生成矢量数据以支持动态图形生成。

Motivation: 现有图像到视频生成方法在动态图形生成中存在文本运动不足和对象失真的问题，且基于代码的方法需要矢量数据。

Details

Method: MG-Gen通过分层分解输入图像，重建为HTML格式数据，并生成可执行JavaScript代码。 Result: 实验证明MG-Gen能生成动态图形，同时保持文本可读性和输入一致性。 Conclusion: 结合分层分解和动画代码生成是动态图形生成的有效策略。 Abstract: General image-to-video generation methods often produce suboptimal animations that do not meet the requirements of animated graphics, as they lack active text motion and exhibit object distortion. Also, code-based animation generation methods typically require layer-structured vector data which are often not readily available for motion graphic generation. To address these challenges, we propose a novel framework named MG-Gen that reconstructs data in vector format from a single raster image to extend the capabilities of code-based methods to enable motion graphics generation from a raster image in the framework of general image-to-video generation. MG-Gen first decomposes the input image into layer-wise elements, reconstructs them as HTML format data and then generates executable JavaScript code for the reconstructed HTML data. We experimentally confirm that \ours{} generates motion graphics while preserving text readability and input consistency. These successful results indicate that combining layer decomposition and animation code generation is an effective strategy for motion graphics generation.

HPGN: Hybrid Priors-Guided Network for Compressed Low-Light Image Enhancement

Hantang Li,Jinhua Hao,Lei Xiong,Shuyuan Zhu

Task: 提出一种混合先验引导网络（HPGN），用于增强压缩低光图像。

Motivation: 现有方法在增强过程中忽视压缩伪影的去除，或未能为不同压缩质量的图像建立统一的联合任务增强框架。

Details

Method: 通过整合压缩和光照先验，利用JPEG质量因子（QF）和DCT量化矩阵（QM）设计高效联合任务即插即用模块，并采用随机QF生成策略指导模型训练。 Result: 实验结果表明，所提方法在增强不同压缩水平的图像上具有优越性。 Conclusion: HPGN能够有效增强压缩低光图像，并适用于不同压缩质量的图像。 Abstract: In practical applications, conventional methods generate large volumes of low-light images that require compression for efficient storage and transmission. However, most existing methods either disregard the removal of potential compression artifacts during the enhancement process or fail to establish a unified framework for joint task enhancement of images with varying compression qualities. To solve this problem, we propose the hybrid priors-guided network (HPGN), which enhances compressed low-light images by integrating both compression and illumination priors. Our approach fully utilizes the JPEG quality factor (QF) and DCT quantization matrix (QM) to guide the design of efficient joint task plug-and-play modules. Additionally, we employ a random QF generation strategy to guide model training, enabling a single model to enhance images across different compression levels. Experimental results confirm the superiority of our proposed method.

Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge

Yudi Sang,Yanzhen Liu,Sutuke Yibulayimu,Yunning Wang,Benjamin D. Killeen,Mingxu Liu,Ping-Cheng Ku,Ole Johannsen,Karol Gotkowski,Maximilian Zenk,Klaus Maier-Hein,Fabian Isensee,Peiyan Yue,Yi Wang,Haidong Yu,Zhaohong Pan,Yutong He,Xiaokun Liang,Daiqi Liu,Fuxin Fan,Artur Jurgas,Andrzej Skalski,Yuxi Ma,Jing Yang,Szymon Płotka,Rafał Litka,Gang Zhu,Yingchun Song,Mathias Unberath,Mehran Armand,Dan Ruan,S. Kevin Zhou,Qiyong Cao,Chunpeng Zhao,Xinbao Wu,Yu Wang

Task: The task is to segment pelvic fracture fragments in CT and X-ray images for trauma diagnosis, surgical planning, and intraoperative guidance.

Motivation: The motivation is to address the challenge of accurately and efficiently delineating bone fragments due to complex anatomy and imaging limitations.

Details

Method: The method involves benchmarking state-of-the-art algorithms on a diverse dataset of 150 CT scans and simulated X-ray images generated using the DeepDRR method, with submissions from 16 teams evaluated under a multi-metric testing scheme. Result: The top-performing CT algorithm achieved an average fragment-wise IoU of 0.930, while the best X-ray algorithm attained an IoU of 0.774, highlighting greater challenges in X-ray segmentation. Conclusion: The conclusion suggests that interactive segmentation approaches, integrating human decision-making, may be essential for improving model reliability and clinical applicability. Abstract: The segmentation of pelvic fracture fragments in CT and X-ray images is crucial for trauma diagnosis, surgical planning, and intraoperative guidance. However, accurately and efficiently delineating the bone fragments remains a significant challenge due to complex anatomy and imaging limitations. The PENGWIN challenge, organized as a MICCAI 2024 satellite event, aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms on these complex tasks. A diverse dataset of 150 CT scans was collected from multiple clinical centers, and a large set of simulated X-ray images was generated using the DeepDRR method. Final submissions from 16 teams worldwide were evaluated under a rigorous multi-metric testing scheme. The top-performing CT algorithm achieved an average fragment-wise intersection over union (IoU) of 0.930, demonstrating satisfactory accuracy. However, in the X-ray task, the best algorithm attained an IoU of 0.774, highlighting the greater challenges posed by overlapping anatomical structures. Beyond the quantitative evaluation, the challenge revealed methodological diversity in algorithm design. Variations in instance representation, such as primary-secondary classification versus boundary-core separation, led to differing segmentation strategies. Despite promising results, the challenge also exposed inherent uncertainties in fragment definition, particularly in cases of incomplete fractures. These findings suggest that interactive segmentation approaches, integrating human decision-making with task-relevant information, may be essential for improving model reliability and clinical applicability.

Translation of Fetal Brain Ultrasound Images into Pseudo-MRI Images using Artificial Intelligence

Naomi Silverstein,Efrat Leibowitz,Ron Beloosesky,Haim Azhari

Task: 利用人工智能技术将超声图像转化为类似MRI的图像，以提升胎儿脑部组织的视觉辨别能力。

Motivation: 超声在胎儿脑部评估中广泛使用但图像质量有限，而MRI图像质量高但成本和时间成本较高，因此需要一种折中方法。

Details

Method: 采用基于扩散模型的“双扩散强制相关”（DDIC）方法，假设超声和MRI域共享潜在空间，并使用HC18、CRL胎儿脑图谱和FeTA数据集进行训练。 Result: 生成的伪MRI图像在脑组织（尤其是侧脑室和Sylvian裂）的视觉辨别上显著提升，多项指标（如互信息、峰值信噪比等）显示DDIC优于其他方法，医学专家测试显示81%的图像有改进。 Conclusion: 伪MRI图像有望通过改善图像表示来简化诊断并提升临床效果。 Abstract: Ultrasound is a widely accessible and cost-effective medical imaging tool commonly used for prenatal evaluation of the fetal brain. However, it has limitations, particularly in the third trimester, where the complexity of the fetal brain requires high image quality for extracting quantitative data. In contrast, magnetic resonance imaging (MRI) offers superior image quality and tissue differentiation but is less available, expensive, and requires time-consuming acquisition. Thus, transforming ultrasonic images into an MRI-mimicking display may be advantageous and allow better tissue anatomy presentation. To address this goal, we have examined the use of artificial intelligence, implementing a diffusion model renowned for generating high-quality images. The proposed method, termed "Dual Diffusion Imposed Correlation" (DDIC), leverages a diffusion-based translation methodology, assuming a shared latent space between ultrasound and MRI domains. Model training was obtained utilizing the "HC18" dataset for ultrasound and the "CRL fetal brain atlas" along with the "FeTA " datasets for MRI. The generated pseudo-MRI images provide notable improvements in visual discrimination of brain tissue, especially in the lateral ventricles and the Sylvian fissure, characterized by enhanced contrast clarity. Improvement was demonstrated in Mutual information, Peak signal-to-noise ratio, Fr\'echet Inception Distance, and Contrast-to-noise ratio. Findings from these evaluations indicate statistically significant superior performance of the DDIC compared to other translation methodologies. In addition, a Medical Opinion Test was obtained from 5 gynecologists. The results demonstrated display improvement in 81% of the tested images. In conclusion, the presented pseudo-MRI images hold the potential for streamlining diagnosis and enhancing clinical outcomes through improved representation.

Estimating Scene Flow in Robot Surroundings with Distributed Miniaturized Time-of-Flight Sensors

Jack Sander,Giammarco Caroleo,Alessandro Albini,Perla Maiolino

Task: 提出一种从低密度和噪声点云中估计场景流的方法，以改进机器人的安全和反应能力。

Motivation: 跟踪机器人周围人或物体的运动对提高机器人运动的安全性至关重要。

Details

Method: 通过聚类连续帧的点云并应用迭代最近点（ICP）算法估计密集运动流，结合基于适应度的分类和离群点去除策略以减少噪声和低密度数据的影响。 Result: 实验验证表明，该方法能准确估计运动方向和速度，误差与传感器噪声一致。 Conclusion: 所提方法在低密度和噪声点云条件下有效，适用于机器人环境中的场景流估计。 Abstract: Tracking motions of humans or objects in the surroundings of the robot is essential to improve safe robot motions and reactions. In this work, we present an approach for scene flow estimation from low-density and noisy point clouds acquired from miniaturized Time of Flight (ToF) sensors distributed on the robot body. The proposed method clusters points from consecutive frames and applies Iterative Closest Point (ICP) to estimate a dense motion flow, with additional steps introduced to mitigate the impact of sensor noise and low-density data points. Specifically, we employ a fitness-based classification to distinguish between stationary and moving points and an inlier removal strategy to refine geometric correspondences. The proposed approach is validated in an experimental setup where 24 ToF are used to estimate the velocity of an object moving at different controlled speeds. Experimental results show that the method consistently approximates the direction of the motion and its magnitude with an error which is in line with sensor noise.

RASP: Revisiting 3D Anamorphic Art for Shadow-Guided Packing of Irregular Objects

Soumyaratna Debnath,Ashish Tiwari,Kaustubh Sadekar,Shanmuganathan Raman

Task: 通过阴影引导的优化方法在有限体积内排列任意形状的3D对象。

Motivation: 利用3D变形艺术的原理，探索如何通过计算模型高效地排列3D对象以实现艺术表达。

Details

Method: 提出RASP框架，基于可微分渲染和SDF公式处理对象间交叉和容器溢出问题。 Result: 展示了多视角变形艺术的艺术效果，实现了从多个视角观察时的有意义表达。 Conclusion: RASP框架在3D对象排列和部件组装中表现出色，扩展了多视角变形艺术的应用。 Abstract: Recent advancements in learning-based methods have opened new avenues for exploring and interpreting art forms, such as shadow art, origami, and sketch art, through computational models. One notable visual art form is 3D Anamorphic Art in which an ensemble of arbitrarily shaped 3D objects creates a realistic and meaningful expression when observed from a particular viewpoint and loses its coherence over the other viewpoints. In this work, we build on insights from 3D Anamorphic Art to perform 3D object arrangement. We introduce RASP, a differentiable-rendering-based framework to arrange arbitrarily shaped 3D objects within a bounded volume via shadow (or silhouette)-guided optimization with an aim of minimal inter-object spacing and near-maximal occupancy. Furthermore, we propose a novel SDF-based formulation to handle inter-object intersection and container extrusion. We demonstrate that RASP can be extended to part assembly alongside object packing considering 3D objects to be "parts" of another 3D object. Finally, we present artistic illustrations of multi-view anamorphic art, achieving meaningful expressions from multiple viewpoints within a single ensemble.

Adaptive path planning for efficient object search by UAVs in agricultural fields

Rick van Essen,Eldert van Henten,Lammert Kooistra,Gert Kootstra

Task: 开发一种用于农业领域无人机搜索物体的自适应路径规划器。

Motivation: 提高无人机在农业领域中搜索物体的效率，尤其是在物体分布不均匀的情况下。

Details

Method: 使用高空覆盖飞行路径，并在检测网络不确定时规划额外的低空检查；利用YOLOv8检测网络和仿真环境进行验证。 Result: 自适应路径规划器在物体分布不均匀时表现更优，路径更短且检测精度与覆盖路径规划器相当。 Conclusion: 自适应路径规划器能更快找到非均匀分布的物体，且对定位误差具有鲁棒性。 Abstract: This paper presents an adaptive path planner for object search in agricultural fields using UAVs. The path planner uses a high-altitude coverage flight path and plans additional low-altitude inspections when the detection network is uncertain. The path planner was evaluated in an offline simulation environment containing real-world images. We trained a YOLOv8 detection network to detect artificial plants placed in grass fields to showcase the potential of our path planner. We evaluated the effect of different detection certainty measures, optimized the path planning parameters, investigated the effects of localization errors and different numbers of objects in the field. The YOLOv8 detection confidence worked best to differentiate between true and false positive detections and was therefore used in the adaptive planner. The optimal parameters of the path planner depended on the distribution of objects in the field, when the objects were uniformly distributed, more low-altitude inspections were needed compared to a non-uniform distribution of objects, resulting in a longer path length. The adaptive planner proved to be robust against localization uncertainty. When increasing the number of objects, the flight path length increased, especially when the objects were uniformly distributed. When the objects were non-uniformly distributed, the adaptive path planner yielded a shorter path than a low-altitude coverage path, even with high number of objects. Overall, the presented adaptive path planner allowed to find non-uniformly distributed objects in a field faster than a coverage path planner and resulted in a compatible detection accuracy. The path planner is made available at https://github.com/wur-abe/uav_adaptive_planner.

Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision

Xiaofeng Han,Shunpeng Chen,Zenghuang Fu,Zhe Feng,Lue Fan,Dong An,Changwei Wang,Li Guo,Weiliang Meng,Xiaopeng Zhang,Rongtao Xu,Shibiao Xu

Task: 系统综述多模态融合技术和视觉语言模型在机器人视觉任务中的应用。

Motivation: 探讨多模态融合和视觉语言模型在机器人视觉中的优势、局限性和协同作用，为相关研究提供参考。

Details

Method: 通过比较传统多模态融合方法与基于大语言模型的视觉语言模型，分析常用数据集的适用性和挑战。 Result: 总结了多模态融合在机器人视觉中的关键研究挑战，并提出了未来研究方向。 Conclusion: 通过全面综述和前瞻性讨论，为机器人视觉中的多模态感知和交互提供了有价值的参考。 Abstract: Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We systematically review the applications of multimodal fusion in key robotic vision tasks, including semantic scene understanding, simultaneous localization and mapping (SLAM), 3D object detection, navigation and localization, and robot manipulation. We compare VLMs based on large language models (LLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies. Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Furthermore, we identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation, and propose future research directions, including self-supervised learning for robust multimodal representations, transformer-based fusion architectures, and scalable multimodal frameworks. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Yan Ma,Steffi Chern,Xuyang Shen,Yiran Zhong,Pengfei Liu

Task: 提出一个透明、从头开始的强化学习框架，用于视觉语言模型（VLM），并验证其有效性。

Motivation: 现有强化学习在视觉语言模型中的应用依赖复杂框架，缺乏可复现性和标准化评估协议。

Details

Method: 设计了一个最小但功能完整的四步流程，并提出了标准化评估方案。 Result: 实验发现响应长度对随机种子敏感，反思行为与输出长度相关，且强化学习在泛化能力上优于监督微调。 Conclusion: 该框架和发现旨在建立可复现的基线，促进强化学习在视觉语言模型研究中的广泛应用。 Abstract: Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.

Efficient Model Editing with Task-Localized Sparse Fine-tuning

Leonardo Iurada,Marco Ciccone,Tatiana Tommasi

Task: 提出一种名为TaLoS的方法，用于构建稀疏任务向量以实现模块化参数编辑。

Motivation: 现有方法依赖网络线性化计算任务向量，导致计算瓶颈且无法确保权重解耦，限制了任务向量的无冲突组合。

Details

Method: 通过识别预训练模型中梯度敏感性低的参数子集，稀疏更新这些参数以促进权重解耦。 Result: TaLoS在训练和推理效率上优于现有方法，并在任务添加和否定任务中表现更优。 Conclusion: TaLoS通过模块化参数编辑，为实际应用中适应性基础模型的部署提供了可行方案。 Abstract: Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.

Towards Computation- and Communication-efficient Computational Pathology

Chu Han,Bingchao Zhao,Jiatai Lin,Shanshan Lyu,Longfei Wang,Tianpeng Deng,Cheng Lu,Changhong Liang,Hannah Y. Wen,Xiaojing Guo,Zhenwei Shi,Zaiyi Liu

Task: 提出一种名为MAGA-GLTrans的计算和通信高效框架，以解决当前计算病理学模型在高倍率全切片图像分析中的效率问题。

Motivation: 当前计算病理学模型依赖高倍率图像分析，导致诊断效率低下，限制了其在时间敏感场景中的临床应用。

Details

Method: 通过提出的放大对齐（MAGA）机制，利用自监督学习对齐低倍率和高倍率图像的特征表示，从而减少计算时间和存储需求。 Result: MAGA-GLTrans在多种任务中表现出色，计算时间减少10.7倍，文件传输和存储需求降低20倍以上。 Conclusion: MAGA-GLTrans在时间敏感应用中具有显著潜力，特别是在术中冰冻切片诊断中，同时保持高准确性。 Abstract: Despite the impressive performance across a wide range of applications, current computational pathology models face significant diagnostic efficiency challenges due to their reliance on high-magnification whole-slide image analysis. This limitation severely compromises their clinical utility, especially in time-sensitive diagnostic scenarios and situations requiring efficient data transfer. To address these issues, we present a novel computation- and communication-efficient framework called Magnification-Aligned Global-Local Transformer (MAGA-GLTrans). Our approach significantly reduces computational time, file transfer requirements, and storage overhead by enabling effective analysis using low-magnification inputs rather than high-magnification ones. The key innovation lies in our proposed magnification alignment (MAGA) mechanism, which employs self-supervised learning to bridge the information gap between low and high magnification levels by effectively aligning their feature representations. Through extensive evaluation across various fundamental CPath tasks, MAGA-GLTrans demonstrates state-of-the-art classification performance while achieving remarkable efficiency gains: up to 10.7 times reduction in computational time and over 20 times reduction in file transfer and storage requirements. Furthermore, we highlight the versatility of our MAGA framework through two significant extensions: (1) its applicability as a feature extractor to enhance the efficiency of any CPath architecture, and (2) its compatibility with existing foundation models and histopathology-specific encoders, enabling them to process low-magnification inputs with minimal information loss. These advancements position MAGA-GLTrans as a particularly promising solution for time-sensitive applications, especially in the context of intraoperative frozen section diagnosis where both accuracy and efficiency are paramount.

Adaptive Frequency Enhancement Network for Remote Sensing Image Semantic Segmentation

Feng Gao,Miao Fu,Jingchao Cao,Junyu Dong,Qian Du

Task: 提出一种自适应频率增强网络（AFENet）用于高分辨率遥感图像的语义分割。

Motivation: 现有方法在适应不同土地覆盖分布和增强空间与频域特征交互方面存在挑战。

Details

Method: AFENet包含自适应频率与空间特征交互模块（AFSIM）和选择性特征融合模块（SFM），分别动态调制高低频特征和选择性融合全局与局部特征。 Result: 在三个公开数据集上，AFENet优于现有方法，并验证了AFSIM和SFM的有效性。 Conclusion: AFENet通过增强频域与空间特征的交互，显著提升了语义分割性能。 Abstract: Semantic segmentation of high-resolution remote sensing images plays a crucial role in land-use monitoring and urban planning. Recent remarkable progress in deep learning-based methods makes it possible to generate satisfactory segmentation results. However, existing methods still face challenges in adapting network parameters to various land cover distributions and enhancing the interaction between spatial and frequency domain features. To address these challenges, we propose the Adaptive Frequency Enhancement Network (AFENet), which integrates two key components: the Adaptive Frequency and Spatial feature Interaction Module (AFSIM) and the Selective feature Fusion Module (SFM). AFSIM dynamically separates and modulates high- and low-frequency features according to the content of the input image. It adaptively generates two masks to separate high- and low-frequency components, therefore providing optimal details and contextual supplementary information for ground object feature representation. SFM selectively fuses global context and local detailed features to enhance the network's representation capability. Hence, the interactions between frequency and spatial features are further enhanced. Extensive experiments on three publicly available datasets demonstrate that the proposed AFENet outperforms state-of-the-art methods. In addition, we also validate the effectiveness of AFSIM and SFM in managing diverse land cover types and complex scenarios. Our codes are available at https://github.com/oucailab/AFENet.

BECAME: BayEsian Continual Learning with Adaptive Model MErging

Mei Li,Yuxiang Lu,Qinyan Dai,Suizhi Huang,Yue Ding,Hongtao Lu

Task: 探索模型合并技术如何优化持续学习中的稳定性与可塑性权衡。

Motivation: 持续学习中的梯度投影方法虽然确保了稳定性，但限制了可塑性，而现有的模型合并方法依赖经验假设和超参数选择。

Details

Method: 基于贝叶斯持续学习原理重新设计合并机制，并提出一个两阶段框架BECAME，结合梯度投影和自适应合并。 Result: 实验表明，BECAME在持续学习中优于现有方法和合并策略。 Conclusion: 模型合并通过理论支持的自适应机制，显著优化了稳定性与可塑性的权衡。 Abstract: Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting. A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks). While representative gradient projection methods ensure stability, they often limit plasticity. Model merging techniques offer promising solutions, but prior methods typically rely on empirical assumptions and carefully selected hyperparameters. In this paper, we explore the potential of model merging to enhance the stability-plasticity trade-off, providing theoretical insights that underscore its benefits. Specifically, we reformulate the merging mechanism using Bayesian continual learning principles and derive a closed-form solution for the optimal merging coefficient that adapts to the diverse characteristics of tasks. To validate our approach, we introduce a two-stage framework named BECAME, which synergizes the expertise of gradient projection and adaptive merging. Extensive experiments show that our approach outperforms state-of-the-art CL methods and existing merging strategies.

Spline-based Transformers

Prashanth Chandran,Agon Serifi,Markus Gross,Moritz Bächer

Task: 提出一种新型的Transformer模型——基于样条的Transformer，无需位置编码。

Motivation: 解决传统Transformer中位置编码的局限性，如序列长度外推问题，并提供用户直接操作潜在空间的新方式。

Details

Method: 受计算机动画中样条工作流的启发，将输入序列嵌入为潜在空间中的平滑轨迹。 Result: 在多种数据集上展示了优于传统位置编码的性能，包括合成2D数据、大规模真实世界图像、3D形状和动画数据。 Conclusion: 基于样条的Transformer不仅解决了位置编码的问题，还提供了更灵活的潜在空间交互方式。 Abstract: We introduce Spline-based Transformers, a novel class of Transformer models that eliminate the need for positional encoding. Inspired by workflows using splines in computer animation, our Spline-based Transformers embed an input sequence of elements as a smooth trajectory in latent space. Overcoming drawbacks of positional encoding such as sequence length extrapolation, Spline-based Transformers also provide a novel way for users to interact with transformer latent spaces by directly manipulating the latent control points to create new latent trajectories and sequences. We demonstrate the superior performance of our approach in comparison to conventional positional encoding on a variety of datasets, ranging from synthetic 2D to large-scale real-world datasets of images, 3D shapes, and animations.