2025 03 27

Untangling the Influence of Typology, Data and Model Architecture on Ranking Transfer Languages for Cross-Lingual POS Tagging

Enora Rice,Ali Marashian,Hannah Haynie,Katharina von der Wense,Alexis Palmer

Task: 研究跨语言迁移学习中如何选择适合的迁移语言，特别关注词性标注任务。

Motivation: 解决数据稀缺问题，但迁移语言的选择仍具挑战性，且语言类型、训练数据和模型架构的作用尚不明确。

Details

Method: 采用整体方法，结合数据集特定特征和细粒度类型学特征，分析其对迁移语言选择的影响，并在零样本预测的预训练多语言模型中验证。 Result: 词重叠、类型-标记比和谱系距离是所有架构中的关键特征；结合类型学和数据集特征可获得最佳排名。 Conclusion: 类型学特征和数据集特征的组合能优化迁移语言选择，且单独使用任一特征组也能取得良好效果。 Abstract: Cross-lingual transfer learning is an invaluable tool for overcoming data scarcity, yet selecting a suitable transfer language remains a challenge. The precise roles of linguistic typology, training data, and model architecture in transfer language choice are not fully understood. We take a holistic approach, examining how both dataset-specific and fine-grained typological features influence transfer language selection for part-of-speech tagging, considering two different sources for morphosyntactic features. While previous work examines these dynamics in the context of bilingual biLSTMS, we extend our analysis to a more modern transfer learning pipeline: zero-shot prediction with pretrained multilingual models. We train a series of transfer language ranking systems and examine how different feature inputs influence ranker performance across architectures. Word overlap, type-token ratio, and genealogical distance emerge as top features across all architectures. Our findings reveal that a combination of typological and dataset-dependent features leads to the best rankings, and that good performance can be obtained with either feature group on its own.

Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair

Maksim Borisov,Zhanibek Kozhirbayev,Valentin Malykh

Task: 构建一个无标注数据的哈萨克-俄语代码转换机器翻译模型。

Motivation: 低资源语言对的机器翻译任务具有挑战性，尤其是当说话者使用代码转换时。

Details

Method: 基于合成数据生成的方法。 Result: 模型达到16.48 BLEU分数，接近现有商业系统并在人工评估中超越它。 Conclusion: 提出的方法有效，并首次提供了哈萨克-俄语代码转换平行语料库。 Abstract: Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. Additionally, we present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results, which include a model achieving 16.48 BLEU almost reaching an existing commercial system and beating it by human evaluation.

Poor Alignment and Steerability of Large Language Models: Evidence from College Admission Essays

Jinsook Lee,AJ Alvero,Thorsten Joachims,René Kizilcec

Task: 研究大型语言模型（LLM）生成的文本在正式沟通中的表现，特别是模型对齐和可引导性问题。

Motivation: 探讨LLM在正式沟通（如大学申请文书）中是否能模拟人类写作风格，以及是否可以通过提示调整其写作风格。

Details

Method: 比较30,000份人类申请文书与两种LLM生成文书（仅提示问题或附加人口统计信息）的词汇和句子变化。 Result: LLM生成的文书与人类文书在语言上显著不同，且提示特定人口统计信息对对齐模型效果有限。 Conclusion: 当前LLM在模型对齐和可引导性方面存在问题，需谨慎在高风险场景中使用。 Abstract: People are increasingly using technologies equipped with large language models (LLM) to write texts for formal communication, which raises two important questions at the intersection of technology and society: Who do LLMs write like (model alignment); and can LLMs be prompted to change who they write like (model steerability). We investigate these questions in the high-stakes context of undergraduate admissions at a selective university by comparing lexical and sentence variation between essays written by 30,000 applicants to two types of LLM-generated essays: one prompted with only the essay question used by the human applicants; and another with additional demographic information about each applicant. We consistently find that both types of LLM-generated essays are linguistically distinct from human-authored essays, regardless of the specific model and analytical approach. Further, prompting a specific sociodemographic identity is remarkably ineffective in aligning the model with the linguistic patterns observed in human writing from this identity group. This holds along the key dimensions of sex, race, first-generation status, and geographic location. The demographically prompted and unprompted synthetic texts were also more similar to each other than to the human text, meaning that prompting did not alleviate homogenization. These issues of model alignment and steerability in current LLMs raise concerns about the use of LLMs in high-stakes contexts.

Cross-Tokenizer Distillation via Approximate Likelihood Matching

Benjamin Minixhofer,Edoardo Maria Ponti,Ivan Vulić

Task: 开发一种跨分词器的蒸馏方法，以解决当前蒸馏方法要求教师和学生模型使用相同分词器的限制。

Motivation: 当前蒸馏方法主要要求教师和学生模型使用相同的分词器，限制了其适用范围。

Details

Method: 提出一种纯蒸馏方法，通过最大化学生预测与教师预测的相似性，而不依赖下一词预测损失，同时对分词器和词汇表的大幅不匹配具有鲁棒性。 Result: 在两种用例中验证了方法的有效性：1）实现了前所未有的跨分词器迁移效果；2）将大型数学专用LLM蒸馏为较小模型，取得了竞争力的数学问题解决性能。 Conclusion: 该方法显著提升了不同LLM之间的适应性和交互性。 Abstract: Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods predominantly require the same tokenizer between the teacher and the student, restricting their applicability to only a small subset of teacher-student pairs. In this work, we develop a cross-tokenizer distillation method to solve this crucial deficiency. Our method is the first to enable cross-tokenizer distillation without a next-token prediction loss as the main objective, instead purely maximizing the student predictions' similarity to the teacher's predictions (known as pure distillation), while also being robust to large mismatches between the teacher and the student tokenizer function and vocabulary. Empirically, our method enables substantially improved performance as tested on two use cases. First, we show that viewing tokenizer transfer as self-distillation enables unprecedently effective transfer across tokenizers. We transfer (subword-level) Llama and Gemma models to byte-level tokenization more effectively than prior methods transfer to a similar subword tokenizer under a comparable training budget. Transferring different base models to the same tokenizer also enables ensembling them (e.g., via averaging their predicted probabilities) which boosts performance. Second, we use our cross-tokenizer distillation method to distil a large maths-specialized LLM into a smaller model, achieving competitive maths problem-solving performance. Overall, our results make substantial strides toward better adaptability and enhanced interaction between different LLMs.

A Study on the Matching Rate of Dance Movements Using 2D Skeleton Detection and 3D Pose Estimation: Why Is SEVENTEEN's Performance So Bita-Zoroi (Perfectly Synchronized)?

Atsushi Simojo,Harumi Haraguchi

Task: 分析SEVENTEEN舞蹈表演的同步率及其影响因素。

Motivation: 尽管SEVENTEEN成员人数众多且身高差异显著，但其舞蹈表演在K-pop行业中展现出极高的统一性，但缺乏具体数据支持其同步率。

Details

Method: 通过YouTube视频，应用2D骨架检测和3D姿态估计技术，评估关节角度、身体部位运动及跳跃和蹲伏动作。 Result: 分析显示身体部位运动方向、跳跃和蹲伏动作中脚踝和头部位置的高度一致性。 Conclusion: SEVENTEEN的高同步率归因于运动方向的一致性和跳跃及蹲伏动作中脚踝与头部高度的同步性。 Abstract: SEVENTEEN is a K-pop group with a large number of members 13 in total and the significant physical disparity between the tallest and shortest members among K-pop groups. However, despite their large numbers and physical differences, their dance performances exhibit unparalleled unity in the K-pop industry. According to one theory, their dance synchronization rate is said to be 90% or even 97%. However, there is little concrete data to substantiate this synchronization rate. In this study, we analyzed SEVENTEEN's dance performances using videos available on YouTube. We applied 2D skeleton detection and 3D pose estimation to evaluate joint angles, body part movements, and jumping and crouching motions to investigate the factors contributing to their performance unity. The analysis revealed exceptionally high consistency in the movement direction of body parts, as well as in the ankle and head positions during jumping movements and the head position during crouching movements. These findings suggested that SEVENTEEN's high synchronization rate can be attributed to the consistency of movement direction and the synchronization of ankle and head heights during jumping and crouching movements.

Sophie Hao

Task: 探讨生成语言学在当前危机中的生存与发展策略。

Motivation: 回应Chesi和Piantadosi关于生成语言学危机的观点，指出危机的根源在于社会抱负的局限性而非学术严谨性。

Details

Method: 通过分析生成语言学的社会与学术双重性质，提出扩展社会抱负的策略。 Result: 生成语言学的成功不仅依赖学术严谨性，还需通过吸引外部利益相关者来扩展社会影响力。 Conclusion: 生成语言学者应同时提升学术严谨性和扩大社会抱负，以应对当前危机并实现长远发展。 Abstract: Chesi's (forthcoming) target paper depicts a generative linguistics in crisis, foreboded by Piantadosi's (2023) declaration that "modern language models refute Chomsky's approach to language." In order to survive, Chesi warns, generativists must hold themselves to higher standards of formal and empirical rigor. This response argues that the crisis described by Chesi and Piantadosi actually has little to do with rigor, but is rather a reflection of generativists' limited social ambitions. Chesi ties the fate of generative linguistics to its intellectual merits, but the current success of language model research is social in nature as much as it is intellectual. In order to thrive, then, generativists must do more than heed Chesi's call for rigor; they must also expand their ambitions by giving outsiders a stake in their future success.

Robust Object Detection of Underwater Robot based on Domain Generalization

Pinhao Song

Task: 设计一个高性能且鲁棒的水下目标检测器。

Motivation: 水下环境的多样性和复杂性给目标检测带来了新的挑战，包括遮挡、生物伪装、图像质量差和域偏移等问题。

Details

Method: 未明确提及具体方法，但目标是解决水下环境带来的挑战。 Result: 未明确提及具体结果，但目标是实现高性能和鲁棒性。 Conclusion: 本文旨在解决水下目标检测的挑战，设计一个适应复杂水下环境的检测器。 Abstract: Object detection aims to obtain the location and the category of specific objects in a given image, which includes two tasks: classification and location. In recent years, researchers tend to apply object detection to underwater robots equipped with vision systems to complete tasks including seafood fishing, fish farming, biodiversity monitoring and so on. However, the diversity and complexity of underwater environments bring new challenges to object detection. First, aquatic organisms tend to live together, which leads to severe occlusion. Second, theaquatic organisms are good at hiding themselves, which have a similar color to the background. Third, the various water quality and changeable and extreme lighting conditions lead to the distorted, low contrast, blue or green images obtained by the underwater camera, resulting in domain shift. And the deep model is generally vulnerable to facing domain shift. Fourth, the movement of the underwater robot leads to the blur of the captured image and makes the water muddy, which results in low visibility of the water. This paper investigates the problems brought by the underwater environment mentioned above, and aims to design a high-performance and robust underwater object detector.

Bigger But Not Better: Small Neural Language Models Outperform Large Language Models in Detection of Thought Disorder

Changye Li,Weizhe Xu,Serguei Pakhomov,Ellen Bradley,Dror Ben-Zeev,Trevor Cohen

Task: 研究小型神经语言模型是否可以作为检测正式思维障碍的有效替代方案。

Motivation: 大型语言模型（LLMs）在临床应用中存在隐私、成本和透明度等问题，限制了其实际效用。

Details

Method: 使用滑动窗口困惑度测量方法，比较不同大小模型对正式思维障碍的检测能力。 Result: 小型模型对正式思维障碍的敏感性高于大型模型，且检测能力随模型规模和上下文长度增加而下降。 Conclusion: 小型模型为开发高效、经济且保护隐私的筛查工具提供了有前景的方向。 Abstract: Disorganized thinking is a key diagnostic indicator of schizophrenia-spectrum disorders. Recently, clinical estimates of the severity of disorganized thinking have been shown to correlate with measures of how difficult speech transcripts would be for large language models (LLMs) to predict. However, LLMs' deployment challenges -- including privacy concerns, computational and financial costs, and lack of transparency of training data -- limit their clinical utility. We investigate whether smaller neural language models can serve as effective alternatives for detecting positive formal thought disorder, using the same sliding window based perplexity measurements that proved effective with larger models. Surprisingly, our results show that smaller models are more sensitive to linguistic differences associated with formal thought disorder than their larger counterparts. Detection capability declines beyond a certain model size and context length, challenging the common assumption of ``bigger is better'' for LLM-based applications. Our findings generalize across audio diaries and clinical interview speech samples from individuals with psychotic symptoms, suggesting a promising direction for developing efficient, cost-effective, and privacy-preserving screening tools that can be deployed in both clinical and naturalistic settings.

VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs

Kelaiti Xiao,Liang Yang,Paerhati Tulajiang,Hongfei Lin

Task: 评估大型语言模型（LLMs）对非传统、风格化图像的解读能力。

Motivation: 传统摄影基准无法充分测试模型对抽象、象征和隐喻元素的处理能力，需要更全面的数据集来推动多模态推理研究。

Details

Method: 通过多阶段筛选、标注和标准化构建VisualQuest数据集，并使用先进的多模态LLMs进行评估。 Result: 评估显示模型性能差异显著，凸显事实背景知识和推理能力在视觉识别任务中的重要性。 Conclusion: VisualQuest为多模态推理和模型架构设计提供了全面且可靠的基准。 Abstract: This paper introduces VisualQuest, a novel image dataset designed to assess the ability of large language models (LLMs) to interpret non-traditional, stylized imagery. Unlike conventional photographic benchmarks, VisualQuest challenges models with images that incorporate abstract, symbolic, and metaphorical elements, requiring the integration of domain-specific knowledge and advanced reasoning. The dataset was meticulously curated through multiple stages of filtering, annotation, and standardization to ensure high quality and diversity. Our evaluations using several state-of-the-art multimodal LLMs reveal significant performance variations that underscore the importance of both factual background knowledge and inferential capabilities in visual recognition tasks. VisualQuest thus provides a robust and comprehensive benchmark for advancing research in multimodal reasoning and model architecture design.

Changye Li,Zhecheng Sheng,Trevor Cohen,Serguei Pakhomov

Task: 量化测试管理员在痴呆症评估中对语言特征的影响。

Motivation: 研究测试管理员对语言特征的影响，以揭示其对下游分析和临床评估的潜在偏差。

Details

Method: 使用两个不同地点和测试管理员参与的“Cookie Theft”图片描述数据集进行分析。 Result: 测试管理员的参与水平显著影响患者语言中的语言特征。 Conclusion: 需要更标准化的测试管理协议以减少偏差，确保临床语音分析框架的可靠性。 Abstract: Alzheimer's Disease (AD) dementia is a progressive neurodegenerative disease that negatively impacts patients' cognitive ability. Previous studies have demonstrated that changes in naturalistic language samples can be useful for early screening of AD dementia. However, the nature of language deficits often requires test administrators to use various speech elicitation techniques during spontaneous language assessments to obtain enough propositional utterances from dementia patients. This could lead to the ``observer's effect'' on the downstream analysis that has not been fully investigated. Our study seeks to quantify the influence of test administrators on linguistic features in dementia assessment with two English corpora the ``Cookie Theft'' picture description datasets collected at different locations and test administrators show different levels of administrator involvement. Our results show that the level of test administrator involvement significantly impacts observed linguistic features in patient speech. These results suggest that many of significant linguistic features in the downstream classification task may be partially attributable to differences in the test administration practices rather than solely to participants' cognitive status. The variations in test administrator behavior can lead to systematic biases in linguistic data, potentially confounding research outcomes and clinical assessments. Our study suggests that there is a need for a more standardized test administration protocol in the development of responsible clinical speech analytics frameworks.

Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation

Zhiyao Ren,Yibing Zhan,Baosheng Yu,Dacheng Tao

Task: 探索如何从参考图像解码文本提示，即图像反向提示工程。

Motivation: 通过反向提示工程从参考图像中获取洞察，理解艺术家的创作过程，并生成新颖图像。

Details

Method: 提出自动反向提示优化（ARPO），通过迭代模仿梯度提示优化过程，将初始提示优化为高质量提示。 Result: ARPO快速收敛生成高质量反向提示，并能通过编辑这些提示轻松创建多样风格和内容的新图像。 Conclusion: ARPO在反向提示生成中表现出色，为图像生成提供了新思路。 Abstract: Text-to-image generation has become increasingly popular, but achieving the desired images often requires extensive prompt engineering. In this paper, we explore how to decode textual prompts from reference images, a process we refer to as image reverse prompt engineering. This technique enables us to gain insights from reference images, understand the creative processes of great artists, and generate impressive new images. To address this challenge, we propose a method known as automatic reverse prompt optimization (ARPO). Specifically, our method refines an initial prompt into a high-quality prompt through an iteratively imitative gradient prompt optimization process: 1) generating a recreated image from the current prompt to instantiate its guidance capability; 2) producing textual gradients, which are candidate prompts intended to reduce the difference between the recreated image and the reference image; 3) updating the current prompt with textual gradients using a greedy search method to maximize the CLIP similarity between prompt and reference image. We compare ARPO with several baseline methods, including handcrafted techniques, gradient-based prompt tuning methods, image captioning, and data-driven selection method. Both quantitative and qualitative results demonstrate that our ARPO converges quickly to generate high-quality reverse prompts. More importantly, we can easily create novel images with diverse styles and content by directly editing these reverse prompts. Code will be made publicly available.

Efficient Model Development through Fine-tuning Transfer

Pin-Jie Lin,Rishab Balasubramanian,Fengyuan Liu,Nikhil Kandpal,Tu Vu

Task: 探索如何在不同版本的大语言模型之间迁移微调更新，以减少重复对齐和微调的成本。

Motivation: 现代大语言模型在更新时效率低下，每次新版本发布都需要重复昂贵的对齐和微调过程，尤其是在领域或语言特定模型中。

Details

Method: 通过从源模型版本中提取表示微调权重变化的差异向量，并将其应用于目标版本的基础模型，实现微调更新的迁移。 Result: 实验表明，迁移差异向量能显著提升目标基础模型的性能，甚至达到与微调版本相当的水平，例如在GPQA上准确率提升10.7%。 Conclusion: 微调迁移是一种可行的策略，能够降低训练成本同时保持模型性能，且在线性连接的模型间效果最佳。 Abstract: Modern LLMs struggle with efficient updates, as each new pretrained model version requires repeating expensive alignment processes. This challenge also applies to domain- or language-specific models, where fine-tuning on specialized data must be redone for every new base model release. In this paper, we explore the transfer of fine-tuning updates between model versions. Specifically, we derive the diff vector from one source model version, which represents the weight changes from fine-tuning, and apply it to the base model of a different target version. Through empirical evaluations on various open-weight model versions, we show that transferring diff vectors can significantly improve the target base model, often achieving performance comparable to its fine-tuned counterpart. For example, reusing the fine-tuning updates from Llama 3.0 8B leads to an absolute accuracy improvement of 10.7% on GPQA over the base Llama 3.1 8B without additional training, surpassing Llama 3.1 8B Instruct. In a multilingual model development setting, we show that this approach can significantly increase performance on target-language tasks without retraining, achieving an absolute improvement of 4.7% and 15.5% on Global MMLU for Malagasy and Turkish, respectively, compared to Llama 3.1 8B Instruct. Our controlled experiments reveal that fine-tuning transfer is most effective when the source and target models are linearly connected in the parameter space. Additionally, we demonstrate that fine-tuning transfer offers a stronger and more computationally efficient starting point for further fine-tuning. Finally, we propose an iterative recycling-then-finetuning approach for continuous model development, which improves both efficiency and effectiveness. Our findings suggest that fine-tuning transfer is a viable strategy to reduce training costs while maintaining model performance.

Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders

Paul Koch,Jörg Krüger,Ankit Chowdhury,Oliver Heimann

Task: 提出一种自监督训练方法Vanishing Depth，将预训练的RGB编码器扩展以支持度量深度理解。

Motivation: 当前最先进的视觉编码器不支持度量深度理解，而这对精确的视觉引导机器人技术至关重要。

Details

Method: 基于新颖的位置深度编码，实现稳定的深度密度和深度分布不变的特征提取。 Result: 在多个RGBD下游任务中取得性能提升和SOTA结果，无需微调编码器。 Conclusion: Vanishing Depth方法在多种任务中表现优异，为非微调编码器设定了新的基准。 Abstract: Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose Vanishing Depth, a self-supervised training approach that extends pretrained RGB encoders to incorporate and align metric depth into their feature embeddings. Based on our novel positional depth encoding, we enable stable depth density and depth distribution invariant feature extraction. We achieve performance improvements and SOTA results across a spectrum of relevant RGBD downstream tasks - without the necessity of finetuning the encoder. Most notably, we achieve 56.05 mIoU on SUN-RGBD segmentation, 88.3 RMSE on Void's depth completion, and 83.8 Top 1 accuracy on NYUv2 scene classification. In 6D-object pose estimation, we outperform our predecessors of DinoV2, EVA-02, and Omnivore and achieve SOTA results for non-finetuned encoders in several related RGBD downstream tasks.

ProtoBERT-LoRA: Parameter-Efficient Prototypical Finetuning for Immunotherapy Study Identification

Shijia Zhang,Xiyu Ding,Kai Ding,Jacob Zhang,Kevin Galinsky,Mengrui Wang,Ryan P. Mayers,Zheyu Wang,Hadi Kharrazi

Task: 提出一种名为ProtoBERT-LoRA的混合框架，用于在基因组存储库中识别免疫检查点抑制剂（ICI）研究。

Motivation: 由于语义模糊性、极端类别不平衡和低资源环境下标记数据有限，识别ICI研究具有挑战性。

Details

Method: 结合PubMedBERT、原型网络和低秩适应（LoRA）进行高效微调，通过原型训练强制类别可分离嵌入。 Result: 在测试数据集上，ProtoBERT-LoRA的F1得分为0.624（精确率0.481，召回率0.887），优于基于规则的系统、机器学习基线和微调的PubMedBERT。应用于44,287个未标记研究时，手动审查工作量减少了82%。 Conclusion: 结合原型和LoRA的方法显著提升了性能，比单独使用LoRA提高了29%。 Abstract: Identifying immune checkpoint inhibitor (ICI) studies in genomic repositories like Gene Expression Omnibus (GEO) is vital for cancer research yet remains challenging due to semantic ambiguity, extreme class imbalance, and limited labeled data in low-resource settings. We present ProtoBERT-LoRA, a hybrid framework that combines PubMedBERT with prototypical networks and Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model enforces class-separable embeddings via episodic prototype training while preserving biomedical domain knowledge. Our dataset was divided as: Training (20 positive, 20 negative), Prototype Set (10 positive, 10 negative), Validation (20 positive, 200 negative), and Test (71 positive, 765 negative). Evaluated on test dataset, ProtoBERT-LoRA achieved F1-score of 0.624 (precision: 0.481, recall: 0.887), outperforming the rule-based system, machine learning baselines and finetuned PubMedBERT. Application to 44,287 unlabeled studies reduced manual review efforts by 82%. Ablation studies confirmed that combining prototypes with LoRA improved performance by 29% over stand-alone LoRA.

Test-Time Reasoning Through Visual Human Preferences with VLMs and Soft Rewards

Alexander Gambashidze,Konstantin Sobolev,Andrey Kuznetsov,Ivan Oseledets

Task: 研究视觉语言模型（VLMs）是否能够有效捕捉人类视觉偏好。

Motivation: 探索VLMs在测试时通过强化学习方法思考偏好的能力，以提升模型的透明性和泛化能力。

Details

Method: 使用强化学习方法（受DeepSeek R1和OpenAI O1启发），在ImageReward和HPSv2数据集上训练VLMs。 Result: 模型在ImageReward测试集上达到64.9%的准确率，在HPSv2上达到65.4%的准确率，与传统编码器模型相当，同时提供透明推理和更好的泛化。 Conclusion: VLMs能够合理捕捉人类视觉偏好，通过软奖励策略提升图像排名效果，减少标注需求并增强奖励泛化和可解释性。 Abstract: Can Visual Language Models (VLMs) effectively capture human visual preferences? This work addresses this question by training VLMs to think about preferences at test time, employing reinforcement learning methods inspired by DeepSeek R1 and OpenAI O1. Using datasets such as ImageReward and Human Preference Score v2 (HPSv2), our models achieve accuracies of 64.9% on the ImageReward test set (trained on ImageReward official split) and 65.4% on HPSv2 (trained on approximately 25% of its data). These results match traditional encoder-based models while providing transparent reasoning and enhanced generalization. This approach allows to use not only rich VLM world knowledge, but also its potential to think, yielding interpretable outcomes that help decision-making processes. By demonstrating that human visual preferences reasonable by current VLMs, we introduce efficient soft-reward strategies for image ranking, outperforming simplistic selection or scoring methods. This reasoning capability enables VLMs to rank arbitrary images-regardless of aspect ratio or complexity-thereby potentially amplifying the effectiveness of visual Preference Optimization. By reducing the need for extensive markup while improving reward generalization and explainability, our findings can be a strong mile-stone that will enhance text-to-vision models even further.

Leveraging Implicit Sentiments: Enhancing Reliability and Validity in Psychological Trait Evaluation of LLMs

Huanhuan Ma,Haisong Gong,Xiaoyuan Yi,Xing Xie,Dongkuan Xu

Task: 开发一种专门用于评估大型语言模型（LLMs）心理特征的新工具Core Sentiment Inventory（CSI）。

Motivation: 随着LLMs从工具转变为类人助手，理解其心理特征（如情感倾向和个性）对确保其可信度至关重要，但现有基于人类心理评估的方法存在显著局限性。

Details

Method: 设计了一种双语（英语和中文）工具CSI，通过隐含评估模型在乐观、悲观和中立三个维度上的情感倾向。 Result: 实验表明CSI能有效捕捉情感模式，显著提高可靠性，且其评分与LLMs实际输出的情感相关性超过0.85。 Conclusion: CSI是一种可靠且有效的工具，能够为LLMs提供深入的心理画像，并预测其行为。 Abstract: Recent advancements in Large Language Models (LLMs) have led to their increasing integration into human life. With the transition from mere tools to human-like assistants, understanding their psychological aspects-such as emotional tendencies and personalities-becomes essential for ensuring their trustworthiness. However, current psychological evaluations of LLMs, often based on human psychological assessments like the BFI, face significant limitations. The results from these approaches often lack reliability and have limited validity when predicting LLM behavior in real-world scenarios. In this work, we introduce a novel evaluation instrument specifically designed for LLMs, called Core Sentiment Inventory (CSI). CSI is a bilingual tool, covering both English and Chinese, that implicitly evaluates models' sentiment tendencies, providing an insightful psychological portrait of LLM across three dimensions: optimism, pessimism, and neutrality. Through extensive experiments, we demonstrate that: 1) CSI effectively captures nuanced emotional patterns, revealing significant variation in LLMs across languages and contexts; 2) Compared to current approaches, CSI significantly improves reliability, yielding more consistent results; and 3) The correlation between CSI scores and the sentiment of LLM's real-world outputs exceeds 0.85, demonstrating its strong validity in predicting LLM behavior. We make CSI public available via: https://github.com/dependentsign/CSI.

ACVUBench: Audio-Centric Video Understanding Benchmark

Yudong Yang,Jimin Zhuang,Guangzhi Sun,Changli Tang,Yixuan Li,Peihan Li,Yifan Jiang,Wei Li,Zejun Ma,Chao Zhang

Task: 提出一个以音频为中心的视频理解基准（ACVUBench），用于评估多模态大语言模型在视频理解中对听觉信息的关注能力。

Motivation: 视频的全面理解依赖于听觉信息，而现有模型通常将音频作为辅助模态，忽视了其提供的关键上下文、情感线索和语义信息。

Details

Method: 构建包含2,662个视频和13k+标注问答对的基准数据集，涵盖18个领域，并设计一系列音频中心任务，测试音频内容和音视频交互的理解能力。 Result: 对多种开源和专有多模态大语言模型进行全面评估，并分析其在音视频理解中的不足。 Conclusion: ACVUBench为多模态大语言模型的音频理解能力提供了标准化评估工具，揭示了现有模型在音频信息处理上的缺陷。 Abstract: Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (ACVUBench) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. Specifically, ACVUBench incorporates 2,662 videos spanning 18 different domains with rich auditory information, together with over 13k high-quality human annotated or validated question-answer pairs. Moreover, ACVUBench introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos are available at https://github.com/lark-png/ACVUBench.

GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization

Zhouhong Gu,Xingzhou Chen,Xiaoran Shi,Tao Wang,Suhang Zheng,Tianyu Li,Hongwei Feng,Yanghua Xiao

Task: 提出一种名为生成对抗策略优化（GAPO）的新框架，以更精确地控制大型语言模型的输出。

Motivation: 现有方法在理解和适应约束条件方面表现不佳，尤其是在处理细粒度约束时容易产生幻觉或性能不稳定。

Details

Method: 结合GAN的训练动态和仅编码器的奖励模型，逐步学习和适应复杂约束。 Result: GAPO在多个基准测试中表现优异，尤其在细粒度约束处理场景中显著优于PPO、DPO和KTO等方法。 Conclusion: GAPO通过对抗训练和仅编码器架构，为控制LLM输出提供了更鲁棒和有效的解决方案。 Abstract: Recent advances in large language models have highlighted the critical need for precise control over model outputs through predefined constraints. While existing methods attempt to achieve this through either direct instruction-response synthesis or preferential response optimization, they often struggle with constraint understanding and adaptation. This limitation becomes particularly evident when handling fine-grained constraints, leading to either hallucination or brittle performance. We introduce Generative Adversarial Policy Optimization (GAPO), a novel framework that combines GAN-based training dynamics with an encoder-only reward model to progressively learn and adapt to increasingly complex constraints. GAPO leverages adversarial training to automatically generate training samples of varying difficulty while utilizing the encoder-only architecture to better capture prompt-response relationships. Extensive experiments demonstrate GAPO's superior performance across multiple benchmarks, particularly in scenarios requiring fine-grained constraint handling, where it significantly outperforms existing methods like PPO, DPO, and KTO. Our results suggest that GAPO's unique approach to preferential prompt learning offers a more robust and effective solution for controlling LLM outputs. Code is avaliable in https://github.com/MikeGu721/GAPO.

Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

Stefan Stojanov,David Wendt,Seungwoo Kim,Rahul Venkatesh,Kevin Feigelis,Jiajun Wu,Daniel LK Yamins

Task: 开发一种自监督技术（Opt-CWM），用于从预训练的下一帧预测模型中估计光流和遮挡。

Motivation: 当前的运动估计方法主要依赖合成数据或特定情境的启发式调整，限制了模型在真实场景中的能力。

Details

Method: 通过优化反事实探针从基础视频模型中提取运动信息，避免固定启发式方法，并在无限制视频输入上训练。 Result: 在真实世界视频上实现了最先进的运动估计性能，且无需标注数据。 Conclusion: Opt-CWM是一种高效的自监督方法，能够显著提升真实场景中的运动估计能力。 Abstract: Estimating motion in videos is an essential computer vision problem with many downstream applications, including controllable video generation and robotics. Current solutions are primarily trained using synthetic data or require tuning of situation-specific heuristics, which inherently limits these models' capabilities in real-world contexts. Despite recent developments in large-scale self-supervised learning from videos, leveraging such representations for motion estimation remains relatively underexplored. In this work, we develop Opt-CWM, a self-supervised technique for flow and occlusion estimation from a pre-trained next-frame prediction model. Opt-CWM works by learning to optimize counterfactual probes that extract motion information from a base video model, avoiding the need for fixed heuristics while training on unrestricted video inputs. We achieve state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.

SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain

Nan Gao,Yihua Bao,Dongdong Weng,Jiayi Zhao,Jia Li,Yan Zhou,Pengfei Wan,Di Zhang

Task: 通过语音同步手势合成生成语义上有意义的手势，以增强人机交互的真实感。

Motivation: 当前生成语义上有意义的手势仍然是一个具有挑战性的问题。

Details

Method: 提出SARGes框架，利用大型语言模型（LLMs）解析语音内容并生成可靠的手势语义标签，进而指导有意义的手势合成。 Result: 实验结果表明，SARGes实现了高度语义对齐的手势标注（50.2%准确率）和高效的单次推理（0.4秒）。 Conclusion: 该方法为语义手势合成提供了一条可解释的意图推理路径。 Abstract: Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures.First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.

SLIP: Spoof-Aware One-Class Face Anti-Spoofing with Language Image Pretraining

Pei-Kai Huang,Jun-Xiong Chong,Cheng-Hsuan Chiang,Tzu-Hsien Chen,Tyng-Luh Liu,Chiou-Ting Hsu

Task: 提出一种名为SLIP的新框架，用于解决单类人脸反欺骗（FAS）中因缺乏欺骗训练数据导致的性能下降问题。

Motivation: 单类FAS方法仅从真实图像中学习活体特征，但缺乏欺骗数据可能导致模型学习与活体/欺骗无关的域信息，影响性能。

Details

Method: 1. 提出语言引导的欺骗线索图估计；2. 引入提示驱动的活体特征解耦；3. 设计特征融合增强策略。 Result: SLIP在实验中表现优于现有单类FAS方法。 Conclusion: SLIP通过语言引导和特征解耦有效提升了单类FAS的性能。 Abstract: Face anti-spoofing (FAS) plays a pivotal role in ensuring the security and reliability of face recognition systems. With advancements in vision-language pretrained (VLP) models, recent two-class FAS techniques have leveraged the advantages of using VLP guidance, while this potential remains unexplored in one-class FAS methods. The one-class FAS focuses on learning intrinsic liveness features solely from live training images to differentiate between live and spoof faces. However, the lack of spoof training data can lead one-class FAS models to inadvertently incorporate domain information irrelevant to the live/spoof distinction (e.g., facial content), causing performance degradation when tested with a new application domain. To address this issue, we propose a novel framework called Spoof-aware one-class face anti-spoofing with Language Image Pretraining (SLIP). Given that live faces should ideally not be obscured by any spoof-attack-related objects (e.g., paper, or masks) and are assumed to yield zero spoof cue maps, we first propose an effective language-guided spoof cue map estimation to enhance one-class FAS models by simulating whether the underlying faces are covered by attack-related objects and generating corresponding nonzero spoof cue maps. Next, we introduce a novel prompt-driven liveness feature disentanglement to alleviate live/spoof-irrelative domain variations by disentangling live/spoof-relevant and domain-dependent information. Finally, we design an effective augmentation strategy by fusing latent features from live images and spoof prompts to generate spoof-like image features and thus diversify latent spoof features to facilitate the learning of one-class FAS. Our extensive experiments and ablation studies support that SLIP consistently outperforms previous one-class FAS methods.

Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages

Yangyang Meng,Jinpeng Li,Guodong Lin,Yu Pu,Guanbo Wang,Hu Du,Zhiming Shao,Yukai Huang,Ke Li,Wei-Qiang Zhang

Task: 扩展Whisper架构以支持更多语言的自动语音识别（ASR）模型Dolphin的开发。

Motivation: 为东亚、南亚、东南亚和中东的40种东方语言及22种汉语方言提供更高的识别准确率。

Details

Method: 整合内部专有和开源数据集，优化模型性能。 Result: Dolphin在多种语言上显著优于当前最先进的开源模型。 Conclusion: 通过公开训练模型和推理代码，促进可重复性和社区驱动的创新。 Abstract: This report introduces Dolphin, a large-scale multilingual automatic speech recognition (ASR) model that extends the Whisper architecture to support a wider range of languages. Our approach integrates in-house proprietary and open-source datasets to refine and optimize Dolphin's performance. The model is specifically designed to achieve notable recognition accuracy for 40 Eastern languages across East Asia, South Asia, Southeast Asia, and the Middle East, while also supporting 22 Chinese dialects. Experimental evaluations show that Dolphin significantly outperforms current state-of-the-art open-source models across various languages. To promote reproducibility and community-driven innovation, we are making our trained models and inference source code publicly available.

The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs

Jonathan Sauder,Viktor Domazetoski,Guilhem Banc-Prandi,Gabriela Perna,Anders Meibom,Devis Tuia

Task: 自动化从图像中识别和估计活珊瑚的丰度。

Motivation: 由于传统珊瑚礁调查方法依赖专家劳动时间，限制了可扩展性，因此需要计算机视觉工具来提升监测效率。

Details

Method: 发布Coralscapes数据集，包含2075张图像、39个底栖类别和174k个专家标注的分割掩码，用于语义分割模型的基准测试。 Result: 在现有较小数据集上，从Coralscapes迁移学习能持续实现最先进的性能。 Conclusion: Coralscapes将推动基于计算机视觉的高效、可扩展和标准化的珊瑚礁调查方法研究，并可能促进水下生态机器人的发展。 Abstract: Coral reefs are declining worldwide due to climate change and local stressors. To inform effective conservation or restoration, monitoring at the highest possible spatial and temporal resolution is necessary. Conventional coral reef surveying methods are limited in scalability due to their reliance on expert labor time, motivating the use of computer vision tools to automate the identification and abundance estimation of live corals from images. However, the design and evaluation of such tools has been impeded by the lack of large high quality datasets. We release the Coralscapes dataset, the first general-purpose dense semantic segmentation dataset for coral reefs, covering 2075 images, 39 benthic classes, and 174k segmentation masks annotated by experts. Coralscapes has a similar scope and the same structure as the widely used Cityscapes dataset for urban scene segmentation, allowing benchmarking of semantic segmentation models in a new challenging domain which requires expert knowledge to annotate. We benchmark a wide range of semantic segmentation models, and find that transfer learning from Coralscapes to existing smaller datasets consistently leads to state-of-the-art performance. Coralscapes will catalyze research on efficient, scalable, and standardized coral reef surveying methods based on computer vision, and holds the potential to streamline the development of underwater ecological robotics.

Qwen2.5-Omni Technical Report

Jin Xu,Zhifang Guo,Jinzheng He,Hangrui Hu,Ting He,Shuai Bai,Keqin Chen,Jialin Wang,Yang Fan,Kai Dang,Bin Zhang,Xiong Wang,Yunfei Chu,Junyang Lin

Task: 开发一个端到端的多模态模型Qwen2.5-Omni，能够感知文本、图像、音频和视频，并以流式方式生成文本和自然语音响应。

Motivation: 为了实现对多模态信息的流式处理，并同步音频和视频的时间戳，同时避免文本和语音生成之间的干扰。

Details

Method: 采用块式处理的多模态编码器，提出时间对齐的多模态位置嵌入方法TMRoPE，以及Thinker-Talker架构，其中Thinker负责文本生成，Talker利用Thinker的隐藏表示生成音频令牌。 Result: Qwen2.5-Omni在性能上与Qwen2.5-VL相当，优于Qwen2-Audio，并在多模态基准测试中达到最先进水平，尤其在语音生成方面表现突出。 Conclusion: Qwen2.5-Omni在多模态感知和生成任务中表现出色，尤其在流式处理和语音生成方面具有显著优势。 Abstract: In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception

Luke Chen,Junyao Wang,Trier Mortlock,Pramod Khargonekar,Mohammad Abdullah Al Faruque

Task: 提出一种名为HyperDUM的新型确定性不确定性方法（DUM），用于高效量化特征级认知不确定性。

Motivation: 现有方法通常仅量化任务级输出预测不确定性，未考虑多模态特征融合级的认知不确定性，且流行的不确定性量化方法（如贝叶斯近似）因计算成本高而难以实际部署。

Details

Method: 利用超维计算，通过通道和分块级投影与捆绑技术分别捕获通道和空间不确定性，并自适应加权多模态传感器特征以减少不确定性传播。 Result: 在3D目标检测和语义分割任务中，HyperDUM平均优于现有最优算法，分别提升2.01%/1.27%和1.29%，且计算量和参数量显著减少。 Conclusion: HyperDUM为现实自主系统提供了一种高效的不确定性量化解决方案。 Abstract: Uncertainty Quantification (UQ) is crucial for ensuring the reliability of machine learning models deployed in real-world autonomous systems. However, existing approaches typically quantify task-level output prediction uncertainty without considering epistemic uncertainty at the multimodal feature fusion level, leading to sub-optimal outcomes. Additionally, popular uncertainty quantification methods, e.g., Bayesian approximations, remain challenging to deploy in practice due to high computational costs in training and inference. In this paper, we propose HyperDUM, a novel deterministic uncertainty method (DUM) that efficiently quantifies feature-level epistemic uncertainty by leveraging hyperdimensional computing. Our method captures the channel and spatial uncertainties through channel and patch -wise projection and bundling techniques respectively. Multimodal sensor features are then adaptively weighted to mitigate uncertainty propagation and improve feature fusion. Our evaluations show that HyperDUM on average outperforms the state-of-the-art (SOTA) algorithms by up to 2.01%/1.27% in 3D Object Detection and up to 1.29% improvement over baselines in semantic segmentation tasks under various types of uncertainties. Notably, HyperDUM requires 2.36x less Floating Point Operations and up to 38.30x less parameters than SOTA methods, providing an efficient solution for real-world autonomous systems.

Advancements in Natural Language Processing: Exploring Transformer-Based Architectures for Text Understanding

Tianhao Wu,Yu Wang,Ngoc Quach

Task: 探索基于Transformer的模型（如BERT和GPT）在文本理解任务中的性能提升。

Motivation: Transformer架构显著提升了机器理解和生成类人文本的能力，超越了传统方法如RNN。

Details

Method: 通过数据准备、模型选择、预训练、微调和评估，分析统计特性（如文本长度分布的概率密度函数和特征空间分类）。 Result: 在GLUE和SQuAD等基准测试中达到最先进性能（F1分数超过90%），但计算成本较高。 Conclusion: Transformer在现代NLP中起关键作用，未来方向包括效率优化和多模态集成。 Abstract: Natural Language Processing (NLP) has witnessed a transformative leap with the advent of transformer-based architectures, which have significantly enhanced the ability of machines to understand and generate human-like text. This paper explores the advancements in transformer models, such as BERT and GPT, focusing on their superior performance in text understanding tasks compared to traditional methods like recurrent neural networks (RNNs). By analyzing statistical properties through visual representations-including probability density functions of text length distributions and feature space classifications-the study highlights the models' proficiency in handling long-range dependencies, adapting to conditional shifts, and extracting features for classification, even with overlapping classes. Drawing on recent 2024 research, including enhancements in multi-hop knowledge graph reasoning and context-aware chat interactions, the paper outlines a methodology involving data preparation, model selection, pretraining, fine-tuning, and evaluation. The results demonstrate state-of-the-art performance on benchmarks like GLUE and SQuAD, with F1 scores exceeding 90%, though challenges such as high computational costs persist. This work underscores the pivotal role of transformers in modern NLP and suggests future directions, including efficiency optimization and multimodal integration, to further advance language-based AI systems.

iNatAg: Multi-Class Classification Models Enabled by a Large-Scale Benchmark Dataset with 4.7M Images of 2,959 Crop and Weed Species

Naitik Jain,Amogh Joshi,Mason Earles

Task: 准确识别作物和杂草物种，以支持精准农业和可持续农业。

Motivation: 由于物种间高度视觉相似性、环境多变性和缺乏大规模农业专用图像数据，作物和杂草物种识别仍具挑战性。

Details

Method: 引入iNatAg数据集，包含470万张图像和2959个物种的精确标注，并基于Swin Transformer架构训练基准模型，结合地理空间数据和LoRA微调。 Result: 最佳模型在作物和杂草分类任务中达到92.38%的准确率，表现最优。 Conclusion: iNatAg数据集为构建鲁棒的地理位置感知农业分类系统提供了新基础，并公开数据集以促进农业机器学习应用。 Abstract: Accurate identification of crop and weed species is critical for precision agriculture and sustainable farming. However, it remains a challenging task due to a variety of factors -- a high degree of visual similarity among species, environmental variability, and a continued lack of large, agriculture-specific image data. We introduce iNatAg, a large-scale image dataset which contains over 4.7 million images of 2,959 distinct crop and weed species, with precise annotations along the taxonomic hierarchy from binary crop/weed labels to specific species labels. Curated from the broader iNaturalist database, iNatAg contains data from every continent and accurately reflects the variability of natural image captures and environments. Enabled by this data, we train benchmark models built upon the Swin Transformer architecture and evaluate the impact of various modifications such as the incorporation of geospatial data and LoRA finetuning. Our best models achieve state-of-the-art performance across all taxonomic classification tasks, achieving 92.38\% on crop and weed classification. Furthermore, the scale of our dataset enables us to explore incorrect misclassifications and unlock new analytic possiblities for plant species. By combining large-scale species coverage, multi-task labels, and geographic diversity, iNatAg provides a new foundation for building robust, geolocation-aware agricultural classification systems. We release the iNatAg dataset publicly through AgML (https://github.com/Project-AgML/AgML), enabling direct access and integration into agricultural machine learning workflows.

sudo rm -rf agentic_security

Sejin Lee,Jian Kim,Haon Park,Ashkan Yousefpour,Sangyoon Yu,Min Song

Task: 提出一种名为SUDO的攻击框架，绕过商业计算机使用代理的安全防护机制。

Motivation: 揭示大语言模型（LLMs）作为计算机使用代理时的安全漏洞，强调现有防护措施的不足。

Details

Method: 通过Detox2Tox机制将有害请求转化为看似无害的请求，利用视觉语言模型（VLMs）获取详细指令，并在执行前重新引入恶意内容。 Result: 在50个实际任务测试中，SUDO攻击成功率达24%（无优化）至41%（优化后）。 Conclusion: 揭示了现有防护措施的脆弱性，呼吁开发更鲁棒、上下文感知的安全机制。 Abstract: Large Language Models (LLMs) are increasingly deployed as computer-use agents, autonomously performing tasks within real desktop or web environments. While this evolution greatly expands practical use cases for humans, it also creates serious security exposures. We present SUDO (Screen-based Universal Detox2Tox Offense), a novel attack framework that systematically bypasses refusal trained safeguards in commercial computer-use agents, such as Claude Computer Use. The core mechanism, Detox2Tox, transforms harmful requests (that agents initially reject) into seemingly benign requests via detoxification, secures detailed instructions from advanced vision language models (VLMs), and then reintroduces malicious content via toxification just before execution. Unlike conventional jailbreaks, SUDO iteratively refines its attacks based on a built-in refusal feedback, making it increasingly effective against robust policy filters. In extensive tests spanning 50 real-world tasks and multiple state-of-the-art VLMs, SUDO achieves a stark attack success rate of 24% (with no refinement), and up to 41% (by its iterative refinement) in Claude Computer Use. By revealing these vulnerabilities and demonstrating the ease with which they can be exploited in real-world computing environments, this paper highlights an immediate need for robust, context-aware safeguards. WARNING: This paper includes harmful or offensive model outputs.

Simiao Ren,Yao Yao,Kidus Zewde,Zisheng Liang,Tsang,Ng,Ning-Yau Cheng,Xiaoou Zhan,Qinzhe Liu,Yifei Chen,Hengwei Xu

Task: 探索多模态大型语言模型（LLMs）在深度伪造图像检测中的潜力。

Motivation: 随着生成模型的进步，深度伪造检测面临更大挑战，需要更先进的方法。

Details

Method: 对12种最新的多模态LLMs进行基准测试，采用提示调优并分析其推理路径。 Result: 部分多模态LLMs在零样本下表现优异，甚至超越传统方法，但其他模型表现不佳。 Conclusion: 多模态推理在深度伪造检测中具有潜力，模型大小对性能有影响，但新版本和推理能力不一定有帮助。 Abstract: Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection such as (OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet) . We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models' reasoning pathways to identify key contributing factors in their decision-making process. Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot, even surpass traditional deepfake detection pipelines in out-of-distribution datasets while the rest of the LLM families performs extremely disappointing with some worse than random guess. Furthermore, we found newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection while model size do help in some cases. This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.

A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications

Sunayana Sitaram,Adrian de Wynter,Isobel McCrum,Qilong Gu,Si-Qing Chen

Task: 开发方法评估和减少42种语言和方言中的性别误称问题。

Motivation: 性别误称会损害个人自我认同感，英语已有明确解决方法，但其他语言因语法和文化差异面临挑战。

Details

Method: 采用参与式设计方法设计跨语言保护措施，并在大型语言模型应用中测试，采用人机协作的数据生成和标注步骤。 Result: 提出的保护措施显著降低了所有语言生成的摘要中的性别误称率，且未影响质量。 Conclusion: 人机协作方法展示了在多语言和文化中扩展包容性和负责任AI解决方案的可行性。 Abstract: Misgendering is the act of referring to someone by a gender that does not match their chosen identity. It marginalizes and undermines a person's sense of self, causing significant harm. English-based approaches have clear-cut approaches to avoiding misgendering, such as the use of the pronoun ``they''. However, other languages pose unique challenges due to both grammatical and cultural constructs. In this work we develop methodologies to assess and mitigate misgendering across 42 languages and dialects using a participatory-design approach to design effective and appropriate guardrails across all languages. We test these guardrails in a standard large language model-based application (meeting transcript summarization), where both the data generation and the annotation steps followed a human-in-the-loop approach. We find that the proposed guardrails are very effective in reducing misgendering rates across all languages in the summaries generated, and without incurring loss of quality. Our human-in-the-loop approach demonstrates a method to feasibly scale inclusive and responsible AI-based solutions across multiple languages and cultures.

EBS-EKF: Accurate and High Frequency Event-based Star Tracking

Albert W Reed,Connor Hashemi,Dennis Melamed,Nitesh Menon,Keigo Hirakawa,Scott McCloskey

Task: 提出一种基于事件传感器（EBS）的新型星跟踪算法。

Motivation: 事件传感器因其低延迟和高效能成为星跟踪的有前景技术，但现有研究仅限于仿真和简化信号模型。

Details

Method: 结合EBS电路分析和扩展卡尔曼滤波器（EKF）设计算法，并在真实夜空数据中定量评估。 Result: 新方法比现有方法精度高一个数量级，更新频率更高且运动容忍度优于传统APS跟踪器。 Conclusion: 该方法通过改进信号建模和状态估计，显著提升了星跟踪性能，并提供了首个与APS解决方案同步的事件数据集。 Abstract: Event-based sensors (EBS) are a promising new technology for star tracking due to their low latency and power efficiency, but prior work has thus far been evaluated exclusively in simulation with simplified signal models. We propose a novel algorithm for event-based star tracking, grounded in an analysis of the EBS circuit and an extended Kalman filter (EKF). We quantitatively evaluate our method using real night sky data, comparing its results with those from a space-ready active-pixel sensor (APS) star tracker. We demonstrate that our method is an order-of-magnitude more accurate than existing methods due to improved signal modeling and state estimation, while providing more frequent updates and greater motion tolerance than conventional APS trackers. We provide all code and the first dataset of events synchronized with APS solutions.

Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models

Shih-Wen Ke,Guan-Yu Lai,Guo-Lin Fang,Hsi-Yuan Kao

Task: 通过迭代提示技术提升大型语言模型（LLM）在越狱攻击中的成功率。

Motivation: 研究如何利用系统修改和优化的提示逐步突破LLM的伦理和安全约束。

Details

Method: 采用迭代提示技术，分析LLM的响应模式，并结合说服策略优化提示。 Result: 攻击成功率（ASR）随提示优化而提升，最高达90%（GPT4和ChatGLM），最低68%（LLaMa2），优于基线方法（PAIR和PAP）。 Conclusion: 迭代提示技术能有效提升越狱攻击的成功率，性能与GCG和ArtPrompt相当。 Abstract: Large language models (LLMs) are designed to align with human values in their responses. This study exploits LLMs with an iterative prompting technique where each prompt is systematically modified and refined across multiple iterations to enhance its effectiveness in jailbreaking attacks progressively. This technique involves analyzing the response patterns of LLMs, including GPT-3.5, GPT-4, LLaMa2, Vicuna, and ChatGLM, allowing us to adjust and optimize prompts to evade the LLMs' ethical and security constraints. Persuasion strategies enhance prompt effectiveness while maintaining consistency with malicious intent. Our results show that the attack success rates (ASR) increase as the attacking prompts become more refined with the highest ASR of 90% for GPT4 and ChatGLM and the lowest ASR of 68% for LLaMa2. Our technique outperforms baseline techniques (PAIR and PAP) in ASR and shows comparable performance with GCG and ArtPrompt.

Peepers & Pixels: Human Recognition Accuracy on Low Resolution Faces

Xavier Merino,Gabriella Pangelinan,Samuel Langborgh,Michael C. King,Kevin W. Bowyer

Task: 探索人类在一对多（1:N）人脸识别中在不同像素间距（IPD）下的识别准确率边界。

Motivation: 虽然自动化1:N人脸识别在理想条件下准确率接近完美，但在低分辨率（低IPD）的监控图像中，人类识别能力的可靠性尚未明确，可能成为系统准确性的限制因素。

Details

Method: 通过系统性地测试人类在不同IPD值下的识别准确率。 Result: 在低IPD（10px, 5px）下，人类识别准确率接近或低于随机水平（50.7%, 35.9%），但决策信心仍较高（77%, 70.7%）。 Conclusion: 低IPD图像中，人类识别能力可能成为整体系统准确性的限制因素。 Abstract: Automated one-to-many (1:N) face recognition is a powerful investigative tool commonly used by law enforcement agencies. In this context, potential matches resulting from automated 1:N recognition are reviewed by human examiners prior to possible use as investigative leads. While automated 1:N recognition can achieve near-perfect accuracy under ideal imaging conditions, operational scenarios may necessitate the use of surveillance imagery, which is often degraded in various quality dimensions. One important quality dimension is image resolution, typically quantified by the number of pixels on the face. The common metric for this is inter-pupillary distance (IPD), which measures the number of pixels between the pupils. Low IPD is known to degrade the accuracy of automated face recognition. However, the threshold IPD for reliability in human face recognition remains undefined. This study aims to explore the boundaries of human recognition accuracy by systematically testing accuracy across a range of IPD values. We find that at low IPDs (10px, 5px), human accuracy is at or below chance levels (50.7%, 35.9%), even as confidence in decision-making remains relatively high (77%, 70.7%). Our findings indicate that, for low IPD images, human recognition ability could be a limiting factor to overall system accuracy.

CFunModel: A "Funny" Language Model Capable of Chinese Humor Generation and Processing

Zhenghan Yu,Xinyu Hu,Xiaojun Wan

Task: 构建一个专门用于中文幽默任务的数据集（CFunSet）和模型（CFunModel）。

Motivation: 大型语言模型在中文幽默生成和处理方面表现不佳，需要专门的数据集和模型来提升性能。

Details

Method: 整合现有中文幽默数据集，并从Tieba-JokeBar收集笑话，构建CFunSet；基于此数据集开发CFunModel。 Result: CFunModel在多项中文幽默相关任务中表现优于其他大型语言模型。 Conclusion: CFunSet和CFunModel为中文幽默任务提供了有效的工具，填补了现有研究的空白。 Abstract: Humor plays a significant role in daily language communication. With the rapid development of large language models (LLMs), natural language processing has made significant strides in understanding and generating various genres of texts. However, most LLMs exhibit poor performance in generating and processing Chinese humor. In this study, we introduce a comprehensive Chinese humor-related dataset, the Chinese Fun Set (CFunSet). This dataset aggregates existing Chinese humor datasets and includes over 20,000 jokes collected from Tieba-JokeBar, a Chinese online platform known for joke sharing. The resulting corpus comprises more than 160,000 entries. Leveraging CFunSet, we developed the Chinese Fun Model (CFunModel), the first large language model designed to handle various Chinese humor-related tasks including Crosstalk Response Selection, Humor Recognition, Joke Generation, etc. Experimental results demonstrate that CFunModel outperforms popular large language models in these tasks. Our CFunSet is available at https://huggingface.co/datasets/ZhenghanYU/CFunSet and CFunModel is available at https://huggingface.co/ZhenghanYU/CFunModel. A demostration video of our work is available at https://youtu.be/MOsISOJ66Ms.

EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis

Sheng Miao,Jiaxin Huang,Dongfeng Bai,Xu Yan,Hongyu Zhou,Yue Wang,Bingbing Liu,Andreas Geiger,Yiyi Liao

Task: 提出一种高效的3D高斯泼溅模型EVolSplat，用于城市场景的新视角合成。

Motivation: 现有基于NeRF和3DGS的方法需要缓慢的逐场景优化，而前馈式方法存在多视角不一致和内容重复问题。

Details

Method: 使用3D卷积网络在多帧中预测3D高斯分布，通过噪声深度初始化并优化几何属性，结合2D纹理预测颜色，并采用半球背景模型处理远景和天空。 Result: 在KITTI-360和Waymo数据集上实现了最先进的渲染质量和实时性能。 Conclusion: EVolSplat是一种高效且高质量的前馈式3D高斯泼溅模型，适用于城市场景的新视角合成。 Abstract: Novel view synthesis of urban scenes is essential for autonomous driving-related applications.Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization. We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner. Unlike existing feed-forward, pixel-aligned 3DGS methods, which often suffer from issues like multi-view inconsistencies and duplicated content, our approach predicts 3D Gaussians across multiple frames within a unified volume using a 3D convolutional network. This is achieved by initializing 3D Gaussians with noisy depth predictions, and then refining their geometric properties in 3D space and predicting color based on 2D textures. Our model also handles distant views and the sky with a flexible hemisphere background model. This enables us to perform fast, feed-forward reconstruction while achieving real-time rendering. Experimental evaluations on the KITTI-360 and Waymo datasets show that our method achieves state-of-the-art quality compared to existing feed-forward 3DGS- and NeRF-based methods.

TempTest: Local Normalization Distortion and the Detection of Machine-generated Text

Tom Kempton,Stuart Burrell,Connor Cheverall

Task: 提出一种不依赖生成语言模型的机器生成文本检测方法。

Motivation: 现有方法依赖统计量（如对数似然、对数排名和熵），随着语言模型越来越接近人类文本分布，这些方法的检测能力将受限。

Details

Method: 通过针对解码策略（如温度或top-k采样）在归一化条件概率时的缺陷，设计了一种理论可解释的检测方法。 Result: 在白盒和黑盒设置下，该方法在多种语言模型、数据集和文本长度上表现优异，部分情况下显著优于现有方法。 Conclusion: 该方法在检测机器生成文本时具有理论支持、易于解释，且性能与现有方法相当或更优。 Abstract: Existing methods for the zero-shot detection of machine-generated text are dominated by three statistical quantities: log-likelihood, log-rank, and entropy. As language models mimic the distribution of human text ever closer, this will limit our ability to build effective detection algorithms. To combat this, we introduce a method for detecting machine-generated text that is entirely agnostic of the generating language model. This is achieved by targeting a defect in the way that decoding strategies, such as temperature or top-k sampling, normalize conditional probability measures. This method can be rigorously theoretically justified, is easily explainable, and is conceptually distinct from existing methods for detecting machine-generated text. We evaluate our detector in the white and black box settings across various language models, datasets, and passage lengths. We also study the effect of paraphrasing attacks on our detector and the extent to which it is biased against non-native speakers. In each of these settings, the performance of our test is at least comparable to that of other state-of-the-art text detectors, and in some cases, we strongly outperform these baselines.

Guiding Human-Object Interactions with Rich Geometry and Relations

Mengqing Xue,Yifei Liu,Ling Guo,Shaoli Huang,Changxing Ding

Task: 提出一种基于扩散的框架ROG，用于建模具有丰富几何细节的人-物交互（HOI）时空关系。

Motivation: 现有方法依赖简化的物体表示，可能忽略几何复杂性，导致交互保真度不足。

Details

Method: 选择物体网格的边界和细节关键点构建交互距离场（IDF），并开发基于扩散的关系模型，结合时空注意力机制。 Result: ROG在合成HOI的真实感和语义准确性上显著优于现有方法。 Conclusion: ROG通过几何细节和关系建模，提升了HOI合成的质量。 Abstract: Human-object interaction (HOI) synthesis is crucial for creating immersive and realistic experiences for applications such as virtual reality. Existing methods often rely on simplified object representations, such as the object's centroid or the nearest point to a human, to achieve physically plausible motions. However, these approaches may overlook geometric complexity, resulting in suboptimal interaction fidelity. To address this limitation, we introduce ROG, a novel diffusion-based framework that models the spatiotemporal relationships inherent in HOIs with rich geometric detail. For efficient object representation, we select boundary-focused and fine-detail key points from the object mesh, ensuring a comprehensive depiction of the object's geometry. This representation is used to construct an interactive distance field (IDF), capturing the robust HOI dynamics. Furthermore, we develop a diffusion-based relation model that integrates spatial and temporal attention mechanisms, enabling a better understanding of intricate HOI relationships. This relation model refines the generated motion's IDF, guiding the motion generation process to produce relation-aware and semantically aligned movements. Experimental evaluations demonstrate that ROG significantly outperforms state-of-the-art methods in the realism and semantic accuracy of synthesized HOIs.

Enhancing Depression Detection via Question-wise Modality Fusion

Aishik Mandal,Dana Atzil-Slonim,Thamar Solorio,Iryna Gurevych

Task: 自动化抑郁症诊断过程，通过多模态数据和问题导向的模态融合方法。

Motivation: 当前抑郁症诊断依赖自述问卷或临床访谈，导致治疗延迟和资源消耗，现有自动化方法在多模态融合和训练上存在不足。

Details

Method: 提出QuestMF框架和ImbOLL损失函数，解决模态贡献不均和序数分类问题。 Result: 在E-DAIC数据集上性能与现有最优模型相当，且增强了解释性。 Conclusion: QuestMF框架提升了诊断效率和个性化干预能力，代码已公开。 Abstract: Depression is a highly prevalent and disabling condition that incurs substantial personal and societal costs. Current depression diagnosis involves determining the depression severity of a person through self-reported questionnaires or interviews conducted by clinicians. This often leads to delayed treatment and involves substantial human resources. Thus, several works try to automate the process using multimodal data. However, they usually overlook the following: i) The variable contribution of each modality for each question in the questionnaire and ii) Using ordinal classification for the task. This results in sub-optimal fusion and training methods. In this work, we propose a novel Question-wise Modality Fusion (QuestMF) framework trained with a novel Imbalanced Ordinal Log-Loss (ImbOLL) function to tackle these issues. The performance of our framework is comparable to the current state-of-the-art models on the E-DAIC dataset and enhances interpretability by predicting scores for each question. This will help clinicians identify an individual's symptoms, allowing them to customise their interventions accordingly. We also make the code for the QuestMF framework publicly available.

Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration

Shihao Zhou,Dayu Li,Jinshan Pan,Juncheng Zhou,Jinglei Shi,Jufeng Yang

Task: 改进多头注意力机制（MHA）以解决冗余问题，提出一种名为HINT的分层多头注意力驱动的Transformer模型用于图像恢复。

Motivation: 多头注意力机制中，各头从均匀分割的子空间独立计算注意力，导致冗余问题，影响模型输出质量。

Details

Method: 提出HINT模型，包含分层多头注意力（HMHA）和查询-键缓存更新（QKCU）模块，通过多样化子空间学习和增强头间交互减少冗余。 Result: 在5种图像恢复任务的12个基准测试中验证了HINT的优越性。 Conclusion: HINT通过改进多头注意力机制，显著提升了图像恢复任务的性能。 Abstract: Transformer-based approaches have gained significant attention in image restoration, where the core component, i.e, Multi-Head Attention (MHA), plays a crucial role in capturing diverse features and recovering high-quality results. In MHA, heads perform attention calculation independently from uniform split subspaces, and a redundancy issue is triggered to hinder the model from achieving satisfactory outputs. In this paper, we propose to improve MHA by exploring diverse learners and introducing various interactions between heads, which results in a Hierarchical multI-head atteNtion driven Transformer model, termed HINT, for image restoration. HINT contains two modules, i.e., the Hierarchical Multi-Head Attention (HMHA) and the Query-Key Cache Updating (QKCU) module, to address the redundancy problem that is rooted in vanilla MHA. Specifically, HMHA extracts diverse contextual features by employing heads to learn from subspaces of varying sizes and containing different information. Moreover, QKCU, comprising intra- and inter-layer schemes, further reduces the redundancy problem by facilitating enhanced interactions between attention heads within and across layers. Extensive experiments are conducted on 12 benchmarks across 5 image restoration tasks, including low-light enhancement, dehazing, desnowing, denoising, and deraining, to demonstrate the superiority of HINT. The source code is available in the supplementary materials.

Explainable ICD Coding via Entity Linking

Leonor Barreiros,Isabel Coutinho,Gonçalo M. Correia,Bruno Martins

Task: 将临床编码任务重新定义为实体链接问题，以提供明确的文本证据支持编码。

Motivation: 传统自动化临床编码方法缺乏明确的文本证据，而医疗编码员需要这些证据来验证编码的正确性。

Details

Method: 利用参数高效微调的大型语言模型（LLMs）和约束解码，提出了三种解决实体链接问题的方法。 Result: 这些方法在临床术语消歧和少样本场景中表现良好。 Conclusion: 通过重新定义任务并结合LLMs，实现了更好的人机协作和编码准确性。 Abstract: Clinical coding is a critical task in healthcare, although traditional methods for automating clinical coding may not provide sufficient explicit evidence for coders in production environments. This evidence is crucial, as medical coders have to make sure there exists at least one explicit passage in the input health record that justifies the attribution of a code. We therefore propose to reframe the task as an entity linking problem, in which each document is annotated with its set of codes and respective textual evidence, enabling better human-machine collaboration. By leveraging parameter-efficient fine-tuning of Large Language Models (LLMs), together with constrained decoding, we introduce three approaches to solve this problem that prove effective at disambiguating clinical mentions and that perform well in few-shot scenarios.

Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack

M. Kerem Aydin,Yi-Chun Hung,Jaclyn Pytlarz,Qi Guo,Emma Alexander

Task: 提出一种名为Spectrum from Defocus (SfD)的方法，用于高效恢复高光谱图像。

Motivation: 高光谱相机在低光子环境下面临空间、光谱和时间分辨率的严苛权衡，现有计算成像系统需要复杂光学和大量计算资源。

Details

Method: 采用色差焦点扫描技术，结合两片现成镜头和灰度传感器，通过基于物理的迭代算法对模糊的灰度焦点堆栈进行解混、去卷积和去噪。 Result: 实现了光子效率高、光学简单且计算快速（<1秒）的高光谱成像。 Conclusion: SfD为快速、紧凑且可解释的高光谱成像提供了有前景的解决方案。 Abstract: Hyperspectral cameras face harsh trade-offs between spatial, spectral, and temporal resolution in an inherently low-photon regime. Computational imaging systems break through these trade-offs with compressive sensing, but require complex optics and/or extensive compute. We present Spectrum from Defocus (SfD), a chromatic focal sweep method that recovers state-of-the-art hyperspectral images with a small system of off-the-shelf optics and < 1 second of compute. Our camera uses two lenses and a grayscale sensor to preserve nearly all incident light in a chromatically-aberrated focal stack. Our physics-based iterative algorithm efficiently demixes, deconvolves, and denoises the blurry grayscale focal stack into a sharp spectral image. The combination of photon efficiency, optical simplicity, and physical modeling makes SfD a promising solution for fast, compact, interpretable hyperspectral imaging.

StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs

Zhicheng Guo,Sijie Cheng,Yuchen Niu,Hao Wang,Sicheng Zhou,Wenbing Huang,Yang Liu

Task: 提出MirrorAPI框架，通过训练专用LLMs模拟真实API响应，以解决工具学习中的稳定性、可扩展性和真实性问题。

Motivation: 现有工具环境在稳定性、可扩展性和真实性方面存在挑战，特别是在基准测试中。

Details

Method: 使用监督微调和链式思维推理，基于7,000+API的请求-响应对数据集训练专用LLMs。 Result: MirrorAPI在MirrorAPI-Bench上表现优于现有方法，并成功集成到StableToolBench中。 Conclusion: MirrorAPI通过高精度模拟API响应，显著提升了工具学习的性能。 Abstract: The rapid advancement of large language models (LLMs) has spurred significant interest in tool learning, where LLMs are augmented with external tools to tackle complex tasks. However, existing tool environments face challenges in balancing stability, scalability, and realness, particularly for benchmarking purposes. To address this problem, we propose MirrorAPI, a novel framework that trains specialized LLMs to accurately simulate real API responses, effectively acting as "mirrors" to tool environments. Using a comprehensive dataset of request-response pairs from 7,000+ APIs, we employ supervised fine-tuning and chain-of-thought reasoning to enhance simulation fidelity. MirrorAPI achieves superior accuracy and stability compared to state-of-the-art methods, as demonstrated by its performance on the newly constructed MirrorAPI-Bench and its integration into StableToolBench.

Xiao Guo,Xiufeng Song,Yue Zhang,Xiaohong Liu,Xiaoming Liu

Task: 提出一种同时生成二进制分类结果和文本解释的深度伪造检测方法。

Motivation: 解决现有方法只能单独提供分类结果或文本解释的问题，提升深度伪造检测的泛化性和可解释性。

Details

Method: 结合预训练CLIP的多模态学习能力和大型语言模型（LLM）的可解释性，设计多模态人脸伪造检测器（M2F2-Det），包括定制化的人脸伪造提示学习和LLM生成的详细文本解释。 Result: 在检测和解释生成任务上达到最先进性能，有效识别和解释多种伪造。 Conclusion: M2F2-Det通过多模态学习和LLM的结合，显著提升了深度伪造检测的泛化性和可解释性。 Abstract: Deepfake detection is a long-established research topic vital for mitigating the spread of malicious misinformation. Unlike prior methods that provide either binary classification results or textual explanations separately, we introduce a novel method capable of generating both simultaneously. Our method harnesses the multi-modal learning capability of the pre-trained CLIP and the unprecedented interpretability of large language models (LLMs) to enhance both the generalization and explainability of deepfake detection. Specifically, we introduce a multi-modal face forgery detector (M2F2-Det) that employs tailored face forgery prompt learning, incorporating the pre-trained CLIP to improve generalization to unseen forgeries. Also, M2F2-Det incorporates an LLM to provide detailed textual explanations of its detection decisions, enhancing interpretability by bridging the gap between natural language and subtle cues of facial forgeries. Empirically, we evaluate M2F2-Det on both detection and explanation generation tasks, where it achieves state-of-the-art performance, demonstrating its effectiveness in identifying and explaining diverse forgeries.

Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

Yijiong Yu

Task: 通过并行化加速推理模型的推理过程。

Motivation: 现有的推理模型虽然准确性高，但生成详细的推理序列计算成本高且耗时。

Details

Method: 利用任务的并行性，通过专用注意力掩码在单序列中解码多个标记。 Result: 实验结果显示解码时间加速超过100%，同时基本保持准确性。 Conclusion: 该方法显著提升了推理效率，为复杂任务提供了可行的加速方案。 Abstract: Recent advances in reasoning models have demonstrated significant improvements in accuracy, particularly for complex tasks such as mathematical reasoning, by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning branches exist, we decode multiple tokens per step using a specialized attention mask, processing them within a single sequence. Experimental results show that our method achieves over 100% speedup in decoding time while basically maintaining accuracy.

Yuxuan Chen,Jiawen Li,Jiali Hu,Xitong Ling,Tian Guan,Anjia Han,Yonghong He

Task: 提出一种跨模态无监督的整张切片图像（WSI）表示学习框架ProAlign。

Motivation: 现有方法局限于特定下游任务或仅关注视觉模态，忽略了文本数据的丰富语义信息。

Details

Method: 利用大语言模型生成WSI中原型类型的描述文本，通过补丁-文本对比构建初始原型嵌入，并提出无参数注意力聚合策略。 Result: 在四个公共数据集上，ProAlign优于现有无监督框架，性能接近部分弱监督模型。 Conclusion: ProAlign通过跨模态学习提升了WSI表示学习的通用性和性能。 Abstract: With the rapid advancement of pathology foundation models (FMs), the representation learning of whole slide images (WSIs) attracts increasing attention. Existing studies develop high-quality patch feature extractors and employ carefully designed aggregation schemes to derive slide-level representations. However, mainstream weakly supervised slide representation learning methods, primarily based on multiple instance learning (MIL), are tailored to specific downstream tasks, which limits their generalizability. To address this issue, some studies explore unsupervised slide representation learning. However, these approaches focus solely on the visual modality of patches, neglecting the rich semantic information embedded in textual data. In this work, we propose ProAlign, a cross-modal unsupervised slide representation learning framework. Specifically, we leverage a large language model (LLM) to generate descriptive text for the prototype types present in a WSI, introducing patch-text contrast to construct initial prototype embeddings. Furthermore, we propose a parameter-free attention aggregation strategy that utilizes the similarity between patches and these prototypes to form unsupervised slide embeddings applicable to a wide range of downstream tasks. Extensive experiments on four public datasets show that ProAlign outperforms existing unsupervised frameworks and achieves performance comparable to some weakly supervised models.

A Retrieval-Based Approach to Medical Procedure Matching in Romanian

Andrei Niculae,Adrian Cosma,Emilian Radoi

Task: 将医疗程序名称从医疗机构映射到保险公司使用的标准化术语。

Motivation: 命名不一致导致程序分类错误，引发行政低效和保险索赔问题，目前多依赖人工映射，自动化潜力大。

Details

Method: 提出基于检索的架构，利用句子嵌入进行罗马尼亚医疗系统中的名称匹配，评估多种嵌入模型。 Result: 确定了针对罗马尼亚语等低资源语言的最有效解决方案。 Conclusion: 为低资源语言的医疗NLP领域做出贡献。 Abstract: Accurately mapping medical procedure names from healthcare providers to standardized terminology used by insurance companies is a crucial yet complex task. Inconsistencies in naming conventions lead to missclasified procedures, causing administrative inefficiencies and insurance claim problems in private healthcare settings. Many companies still use human resources for manual mapping, while there is a clear opportunity for automation. This paper proposes a retrieval-based architecture leveraging sentence embeddings for medical name matching in the Romanian healthcare system. This challenge is significantly more difficult in underrepresented languages such as Romanian, where existing pretrained language models lack domain-specific adaptation to medical text. We evaluate multiple embedding models, including Romanian, multilingual, and medical-domain-specific representations, to identify the most effective solution for this task. Our findings contribute to the broader field of medical NLP for low-resource languages such as Romanian.

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

Alex Jinpeng Wang,Linjie Li,Zhengyuan Yang,Lijuan Wang,Min Li

Task: 生成高质量的长文本图像，解决现有文本到图像系统仅能处理短文本的局限性。

Motivation: 当前生成模型在生成连贯的长文本图像（如幻灯片或文档中的段落）方面存在挑战，现有系统通常仅支持短文本。

Details

Method: 引入一种新型的文本聚焦二进制分词器，并基于此开发多模态自回归模型\ModelName。 Result: \ModelName在生成长文本图像时显著优于SD3.5 Large和GPT4o+DALL-E 3，具有更高的准确性和灵活性。 Conclusion: \ModelName为长文本图像生成开辟了新领域，并支持创新应用如文档和幻灯片生成。 Abstract: Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~\cite{sd3} and GPT4o~\cite{gpt4o} with DALL-E 3~\cite{dalle3} in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.

Low-resource Information Extraction with the European Clinical Case Corpus

Soumitra Ghosh,Begona Altuna,Saeed Farzi,Pietro Ferrazzi,Alberto Lavelli,Giulia Mezzanotte,Manuela Speranza,Bernardo Magnini

Task: 构建并发布一个多语言的医学领域数据集E3C-3.0，包含疾病和测试结果关系的标注。

Motivation: 解决医学领域多语言数据稀缺问题，并探索跨语言迁移学习的有效性。

Details

Method: 采用半自动方法，包括基于大语言模型（LLMs）的自动标注投影和人工修订。 Result: 实验表明，当前最先进的LLMs通过在该数据集上微调表现更优，且跨语言迁移学习效果显著。 Conclusion: E3C-3.0数据集在多语言医学领域具有实用价值，并能有效支持跨语言迁移学习。 Abstract: We present E3C-3.0, a multilingual dataset in the medical domain, comprising clinical cases annotated with diseases and test-result relations. The dataset includes both native texts in five languages (English, French, Italian, Spanish and Basque) and texts translated and projected from the English source into five target languages (Greek, Italian, Polish, Slovak, and Slovenian). A semi-automatic approach has been implemented, including automatic annotation projection based on Large Language Models (LLMs) and human revision. We present several experiments showing that current state-of-the-art LLMs can benefit from being fine-tuned on the E3C-3.0 dataset. We also show that transfer learning in different languages is very effective, mitigating the scarcity of data. Finally, we compare performance both on native data and on projected data. We release the data at https://huggingface.co/collections/NLP-FBK/e3c-projected-676a7d6221608d60e4e9fd89 .

Assessing SAM for Tree Crown Instance Segmentation from Drone Imagery

Mélisande Teng,Arthur Ouaknine,Etienne Laliberté,Yoshua Bengio,David Rolnick,Hugo Larochelle

Task: 比较Segment Anything Model (SAM)与Mask R-CNN在无人机图像中自动树冠实例分割任务中的性能。

Motivation: 当前树木种植项目的监测方法成本高、耗时长，无人机遥感和计算机视觉技术提供了潜在的解决方案，但需要验证其有效性。

Details

Method: 使用SAM和Mask R-CNN进行树冠实例分割，并探索结合数字表面模型(DSM)的效果。 Result: SAM在未调优的情况下性能不及Mask R-CNN，但结合DSM信息后预测效果有所提升。 Conclusion: SAM在树冠分割任务中具有潜力，但需要进一步调优，结合DSM可以提升性能。 Abstract: The potential of tree planting as a natural climate solution is often undermined by inadequate monitoring of tree planting projects. Current monitoring methods involve measuring trees by hand for each species, requiring extensive cost, time, and labour. Advances in drone remote sensing and computer vision offer great potential for mapping and characterizing trees from aerial imagery, and large pre-trained vision models, such as the Segment Anything Model (SAM), may be a particularly compelling choice given limited labeled data. In this work, we compare SAM methods for the task of automatic tree crown instance segmentation in high resolution drone imagery of young tree plantations. We explore the potential of SAM for this task, and find that methods using SAM out-of-the-box do not outperform a custom Mask R-CNN, even with well-designed prompts, but that there is potential for methods which tune SAM further. We also show that predictions can be improved by adding Digital Surface Model (DSM) information as an input.

Synthetic Data Augmentation for Cross-domain Implicit Discourse Relation Recognition

Frances Yung,Varsha Suresh,Zaynab Reza,Mansoor Ahmad,Vera Demberg

Task: 识别两个文本片段之间的隐含连贯关系（IDRR）。

Motivation: 零样本或少样本方法在IDRR任务中表现不佳，但大型语言模型（LLM）可能通过生成合成数据来辅助模型训练。

Details

Method: 在跨域设置中，利用未标记的目标域数据和LLM生成连贯关系样本，以适配基于源域标记数据训练的模型。 Result: 实验表明，该方法的不同变体未带来显著改进。 Conclusion: LLM生成的样本对IDRR任务帮助有限，评估模型时需关注统计显著性和可比性。 Abstract: Implicit discourse relation recognition (IDRR) -- the task of identifying the implicit coherence relation between two text spans -- requires deep semantic understanding. Recent studies have shown that zero- or few-shot approaches significantly lag behind supervised models, but LLMs may be useful for synthetic data augmentation, where LLMs generate a second argument following a specified coherence relation. We applied this approach in a cross-domain setting, generating discourse continuations using unlabelled target-domain data to adapt a base model which was trained on source-domain labelled data. Evaluations conducted on a large-scale test set revealed that different variations of the approach did not result in any significant improvements. We conclude that LLMs often fail to generate useful samples for IDRR, and emphasize the importance of considering both statistical significance and comparability when evaluating IDRR models.

Reasoning and Learning a Perceptual Metric for Self-Training of Reflective Objects in Bin-Picking with a Low-cost Camera

Peiyuan Ni,Chee Meng Chew,Marcelo H. Ang Jr.,Gregory S. Chirikjian

Task: 提出一种两阶段框架（度量学习阶段和自训练阶段）以解决低成本RGB-D相机在金属物体分拣中的深度信息稀疏和反射表面纹理问题。

Motivation: 低成本RGB-D相机在金属物体分拣中因深度信息稀疏和反射表面纹理导致误差和需要人工标注，需减少人工干预。

Details

Method: 提出Multi-object Pose Reasoning (MoPR)算法优化位姿假设，并采用Symmetry-aware Lie-group based Bayesian Gaussian Mixture Model (SaL-BGMM)进行对称感知滤波，同时提出WR-InfoNCE损失支持自训练。 Result: 在ROBI数据集和自建的Self-ROBI数据集上，方法优于多种先进方法。 Conclusion: 两阶段框架有效解决了低成本相机在金属物体分拣中的问题，减少了人工干预。 Abstract: Bin-picking of metal objects using low-cost RGB-D cameras often suffers from sparse depth information and reflective surface textures, leading to errors and the need for manual labeling. To reduce human intervention, we propose a two-stage framework consisting of a metric learning stage and a self-training stage. Specifically, to automatically process data captured by a low-cost camera (LC), we introduce a Multi-object Pose Reasoning (MoPR) algorithm that optimizes pose hypotheses under depth, collision, and boundary constraints. To further refine pose candidates, we adopt a Symmetry-aware Lie-group based Bayesian Gaussian Mixture Model (SaL-BGMM), integrated with the Expectation-Maximization (EM) algorithm, for symmetry-aware filtering. Additionally, we propose a Weighted Ranking Information Noise Contrastive Estimation (WR-InfoNCE) loss to enable the LC to learn a perceptual metric from reconstructed data, supporting self-training on untrained or even unseen objects. Experimental results show that our approach outperforms several state-of-the-art methods on both the ROBI dataset and our newly introduced Self-ROBI dataset.

Collaborative Storytelling and LLM: A Linguistic Analysis of Automatically-Generated Role-Playing Game Sessions

Alessandro Maisto

Task: 研究大型语言模型（LLM）在无人类干预下生成角色扮演游戏（RPG）会话时，其语言表现出的口语或书面特征。

Motivation: 探索LLM在RPG会话中的语言特性，以了解其表达方式与传统口语、书面文本的差异。

Details

Method: 对LLM生成的文本进行词汇和句法特征的 linguistic 分析，并与人类RPG会话、口语对话及书籍进行比较。 Result: LLM生成的语言模式与所有其他文本类别（包括口语对话、人类RPG会话和书籍）均不同。 Conclusion: LLM的训练方式显著影响其表达方式，研究为LLM的叙事能力提供了重要启示。 Abstract: Role-playing games (RPG) are games in which players interact with one another to create narratives. The role of players in the RPG is largely based on the interaction between players and their characters. This emerging form of shared narrative, primarily oral, is receiving increasing attention. In particular, many authors investigated the use of an LLM as an actor in the game. In this paper, we aim to discover to what extent the language of Large Language Models (LLMs) exhibit oral or written features when asked to generate an RPG session without human interference. We will conduct a linguistic analysis of the lexical and syntactic features of the generated texts and compare the results with analyses of conversations, transcripts of human RPG sessions, and books. We found that LLMs exhibit a pattern that is distinct from all other text categories, including oral conversations, human RPG sessions and books. Our analysis has shown how training influences the way LLMs express themselves and provides important indications of the narrative capabilities of these tools.

BEAR: A Video Dataset For Fine-grained Behaviors Recognition Oriented with Action and Environment Factors

Chengyang Hu,Yuduo Chen,Lizhuang Ma

Task: 开发一个新的视频细粒度行为数据集BEAR，并研究输入模态对行为识别的影响。

Motivation: 现有细粒度行为识别研究仅控制部分信息相似，导致评估不公平且不全面。

Details

Method: 构建BEAR数据集，包含基于环境和动作的细粒度行为协议，并进行多模态实验。 Result: 实验结果提供了关于输入模态对行为识别影响的深入见解。 Conclusion: BEAR数据集和实验结果对进一步研究行为识别具有重要意义。 Abstract: Behavior recognition is an important task in video representation learning. An essential aspect pertains to effective feature learning conducive to behavior recognition. Recently, researchers have started to study fine-grained behavior recognition, which provides similar behaviors and encourages the model to concern with more details of behaviors with effective features for distinction. However, previous fine-grained behaviors limited themselves to controlling partial information to be similar, leading to an unfair and not comprehensive evaluation of existing works. In this work, we develop a new video fine-grained behavior dataset, named BEAR, which provides fine-grained (i.e. similar) behaviors that uniquely focus on two primary factors defining behavior: Environment and Action. It includes two fine-grained behavior protocols including Fine-grained Behavior with Similar Environments and Fine-grained Behavior with Similar Actions as well as multiple sub-protocols as different scenarios. Furthermore, with this new dataset, we conduct multiple experiments with different behavior recognition models. Our research primarily explores the impact of input modality, a critical element in studying the environmental and action-based aspects of behavior recognition. Our experimental results yield intriguing insights that have substantial implications for further research endeavors.

PVLens: Enhancing Pharmacovigilance Through Automated Label Extraction

Jeffery L Painter,Gregory E Powell,Andrew Bate

Task: 开发一个自动化系统PVLens，从FDA结构化产品标签（SPLs）中提取标记的安全信息，并将其映射到MedDRA术语。

Motivation: 现有的药物安全参考数据库（如SIDER）过时且静态，无法满足实时药物警戒的需求。

Details

Method: PVLens结合自动化技术和专家监督，通过基于网络的审查工具实现信息提取和映射。 Result: 在验证97种药物标签时，PVLens的F1得分为0.882，召回率为0.983，精确度为0.799。 Conclusion: PVLens提供了一个可扩展、更准确且持续更新的替代方案，提升了实时药物警戒的准确性和时效性。 Abstract: Reliable drug safety reference databases are essential for pharmacovigilance, yet existing resources like SIDER are outdated and static. We introduce PVLens, an automated system that extracts labeled safety information from FDA Structured Product Labels (SPLs) and maps terms to MedDRA. PVLens integrates automation with expert oversight through a web-based review tool. In validation against 97 drug labels, PVLens achieved an F1 score of 0.882, with high recall (0.983) and moderate precision (0.799). By offering a scalable, more accurate and continuously updated alternative to SIDER, PVLens enhances real-time pharamcovigilance with improved accuracy and contemporaneous insights.

Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors

Weilong Yan,Ming Li,Haipeng Li,Shuwei Shao,Robby T. Tan

Task: 提出一种合成到真实的鲁棒深度估计框架，用于单目相机在多样化户外条件下的自监督深度估计。

Motivation: 解决多样化户外条件下（如白天、雨天、夜晚）自监督深度估计的挑战，包括通用表示学习的困难及缺乏真实世界恶劣条件的标注数据。

Details

Method: 结合运动和结构先验知识，通过合成适应和真实适应两个阶段，利用一致性重加权策略和深度分布正则化来优化模型。 Result: 在nuScenes和Robotcar数据集上，AbsRel和RMSE指标平均提升7.5%和4.3%，在DrivingStereo的零样本评估中表现优于现有方法。 Conclusion: 提出的框架在多样化条件下实现了鲁棒的深度估计，显著优于现有方法。 Abstract: Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data. Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting in suboptimal results. In this paper, we present the first synthetic-to-real robust depth estimation framework, incorporating motion and structure priors to capture real-world knowledge effectively. In the synthetic adaptation, we transfer motion-structure knowledge inside cost volumes for better robust representation, using a frozen daytime model to train a depth estimator in synthetic adverse conditions. In the innovative real adaptation, which targets to fix synthetic-real gaps, models trained earlier identify the weather-insensitive regions with a designed consistency-reweighting strategy to emphasize valid pseudo-labels. We introduce a new regularization by gathering explicit depth distributions to constrain the model when facing real-world data. Experiments show that our method outperforms the state-of-the-art across diverse conditions in multi-frame and single-frame evaluations. We achieve improvements of 7.5% and 4.3% in AbsRel and RMSE on average for nuScenes and Robotcar datasets (daytime, nighttime, rain). In zero-shot evaluation of DrivingStereo (rain, fog), our method generalizes better than the previous ones.

Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging

Han Wu,Yuxuan Yao,Shuqi Liu,Zehua Liu,Xiaojin Fu,Xiongwei Han,Xing Li,Hui-Ling Zhen,Tao Zhong,Mingxuan Yuan

Task: 研究如何通过模型合并实现从长到短（L2S）推理，以平衡推理深度与效率。

Motivation: 大型语言模型（LLM）从系统1到系统2推理的过渡虽然提升了复杂任务处理能力，但效率下降且易产生冗余推理步骤。

Details

Method: 采用模型合并方法（如任务向量、SVD和激活信息合并），整合系统1的快速思考与系统2的方法论推理。 Result: 模型合并可将平均响应长度减少55%，同时保持或提升基线性能，且模型规模与合并效果强相关。 Conclusion: 模型合并是一种高效且有效的L2S推理范式，解决了过度思考问题并保持了系统2推理的鲁棒性。 Abstract: The transition from System 1 to System 2 reasoning in large language models (LLMs) has marked significant advancements in handling complex tasks through deliberate, iterative thinking. However, this progress often comes at the cost of efficiency, as models tend to overthink, generating redundant reasoning steps without proportional improvements in output quality. Long-to-Short (L2S) reasoning has emerged as a promising solution to this challenge, aiming to balance reasoning depth with practical efficiency. While existing approaches, such as supervised fine-tuning (SFT), reinforcement learning (RL), and prompt engineering, have shown potential, they are either computationally expensive or unstable. Model merging, on the other hand, offers a cost-effective and robust alternative by integrating the quick-thinking capabilities of System 1 models with the methodical reasoning of System 2 models. In this work, we present a comprehensive empirical study on model merging for L2S reasoning, exploring diverse methodologies, including task-vector-based, SVD-based, and activation-informed merging. Our experiments reveal that model merging can reduce average response length by up to 55% while preserving or even improving baseline performance. We also identify a strong correlation between model scale and merging efficacy with extensive evaluations on 1.5B/7B/14B/32B models. Furthermore, we investigate the merged model's ability to self-critique and self-correct, as well as its adaptive response length based on task complexity. Our findings highlight model merging as a highly efficient and effective paradigm for L2S reasoning, offering a practical solution to the overthinking problem while maintaining the robustness of System 2 reasoning. This work can be found on Github https://github.com/hahahawu/Long-to-Short-via-Model-Merging.

Video Motion Graphs

Haiyang Liu,Zhan Xu,Fa-Ting Hong,Hsin-Ping Huang,Yi Zhou,Yang Zhou

Task: 设计一个系统（Video Motion Graphs）来生成逼真的人体运动视频。

Motivation: 通过结合参考视频和条件信号（如音乐或运动标签），生成新的视频，解决现有方法在多模态条件下人体运动视频生成中的不足。

Details

Method: 使用HMInterp模型，采用双分支插值方法（结合运动扩散模型和基于扩散的视频帧插值模型），并通过条件渐进训练利用身份强和弱条件。 Result: Video Motion Graphs在生成多模态条件下的人体运动视频方面优于现有的生成和检索方法。 Conclusion: 该系统通过创新的插值模型和训练策略，实现了高质量的视频纹理和准确的运动轨迹。 Abstract: We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robust Video Frame Interpolation (VFI) model that enables seamless interpolation of discontinuous frames, even for complex motion scenarios like dancing. HMInterp i) employs a dual-branch interpolation approach, combining a Motion Diffusion Model for human skeleton motion interpolation with a diffusion-based video frame interpolation model for final frame generation. ii) adopts condition progressive training to effectively leverage identity strong and weak conditions, such as images and pose. These designs ensure both high video texture quality and accurate motion trajectory. Results show that our Video Motion Graphs outperforms existing generative- and retrieval-based methods for multi-modal conditioned human motion video generation. Project page can be found at https://h-liu1997.github.io/Video-Motion-Graphs/

TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes

Raj Sanjay Shah,Lei Xu,Qianchu Liu,Jon Burnsky,Drew Bertagnolli,Chaitanya Shivade

Task: 设计一个全面的评估标准，用于评估行为治疗笔记的质量。

Motivation: 行为治疗笔记的质量标准尚未完善，这对法律合规和患者护理至关重要。

Details

Method: 与持证治疗师合作设计评估标准，并在公开数据集上应用该框架，比较治疗师撰写笔记和LLM生成笔记的质量。 Result: 基于标准的评估比传统Likert量表更可靠；LLM在完整性和简洁性上表现接近人类，但在忠实性上较差；治疗师笔记常缺乏完整性和简洁性，而LLM笔记存在幻觉问题；在盲测中，治疗师更偏好LLM生成的笔记。 Conclusion: 提出的评估标准有效，LLM在生成笔记方面有潜力，但需解决忠实性问题。 Abstract: Behavioral therapy notes are important for both legal compliance and patient care. Unlike progress notes in physical health, quality standards for behavioral therapy notes remain underdeveloped. To address this gap, we collaborated with licensed therapists to design a comprehensive rubric for evaluating therapy notes across key dimensions: completeness, conciseness, and faithfulness. Further, we extend a public dataset of behavioral health conversations with therapist-written notes and LLM-generated notes, and apply our evaluation framework to measure their quality. We find that: (1) A rubric-based manual evaluation protocol offers more reliable and interpretable results than traditional Likert-scale annotations. (2) LLMs can mimic human evaluators in assessing completeness and conciseness but struggle with faithfulness. (3) Therapist-written notes often lack completeness and conciseness, while LLM-generated notes contain hallucination. Surprisingly, in a blind test, therapists prefer and judge LLM-generated notes to be superior to therapist-written notes.

DINeMo: Learning Neural Mesh Models with no 3D Annotations

Weijie Guo,Guofeng Zhang,Wufei Ma,Alan Yuille

Task: 提出一种无需3D标注的神经网格模型DINeMo，用于类别级3D/6D姿态估计。

Motivation: 解决现有方法依赖3D标注、难以扩展的问题，利用视觉基础模型生成伪对应关系。

Details

Method: 采用双向伪对应生成方法，结合局部外观特征和全局上下文信息。 Result: 在汽车数据集上表现优异，显著超越零样本和小样本方法，缩小与全监督方法的差距67.3%。 Conclusion: DINeMo在扩展性和效率上优于依赖3D标注的监督学习方法。 Abstract: Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding, which would enable a broad range of applications in robotics and embodied AI. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. Despite the largely enhanced robustness to partial occlusion and domain shifts, these methods depended heavily on 3D annotations for part-contrastive learning, which confines them to a narrow set of categories and hinders efficient scaling. In this work, we present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence obtained from large visual foundation models. We adopt a bidirectional pseudo-correspondence generation method, which produce pseudo correspondence utilize both local appearance features and global context information. Experimental results on car datasets demonstrate that our DINeMo outperforms previous zero- and few-shot 3D pose estimation by a wide margin, narrowing the gap with fully-supervised methods by 67.3%. Our DINeMo also scales effectively and efficiently when incorporating more unlabeled images during training, which demonstrate the advantages over supervised learning methods that rely on 3D annotations. Our project page is available at https://analysis-by-synthesis.github.io/DINeMo/.

UniEDU: A Unified Language and Vision Assistant for Education Applications

Zhendong Chu,Jian Xie,Shen Wang,Zichao Wang,Qingsong Wen

Task: 提出一个统一的语言和视觉助手UniEDU，用于多种教育应用，包括知识推荐、知识追踪、时间成本预测和用户答案预测。

Motivation: 教育材料通常包含多种模态（如文本和图像），现有模型难以全面理解其中的细微信息，因此需要一种统一的解决方案。

Details

Method: 设计了一个统一的模型UniEDU，能够同时处理多种教育任务，并优化了计算效率。 Result: UniEDU在多个任务上表现优异，计算效率提高了约300%，且性能接近完全微调的模型。 Conclusion: UniEDU为适应教育需求的多样化AI系统迈出了重要一步，适合实际部署。 Abstract: Education materials for K-12 students often consist of multiple modalities, such as text and images, posing challenges for models to fully understand nuanced information in these materials. In this paper, we propose a unified language and vision assistant UniEDU designed for various educational applications, including knowledge recommendation, knowledge tracing, time cost prediction, and user answer prediction, all within a single model. Unlike conventional task-specific models, UniEDU offers a unified solution that excels across multiple educational tasks while maintaining strong generalization capabilities. Its adaptability makes it well-suited for real-world deployment in diverse learning environments. Furthermore, UniEDU is optimized for industry-scale deployment by significantly reducing computational overhead-achieving approximately a 300\% increase in efficiency-while maintaining competitive performance with minimal degradation compared to fully fine-tuned models. This work represents a significant step toward creating versatile AI systems tailored to the evolving demands of education.

TC-GS: Tri-plane based compression for 3D Gaussian Splatting

Taorui Wang,Zitong Yu,Yong Xu

Task: 提出一种基于三平面编码和KNN解码的3D高斯泼溅压缩方法。

Motivation: 解决3D高斯泼溅数据量大且无序导致的压缩困难问题。

Details

Method: 使用三平面编码高斯属性，结合KNN解码和位置敏感解码器，并引入自适应小波损失。 Result: 在多个数据集上取得了与或超越现有最佳压缩方法的结果。 Conclusion: 提出的方法有效解决了3D高斯泼溅的压缩问题，并提升了性能。 Abstract: Recently, 3D Gaussian Splatting (3DGS) has emerged as a prominent framework for novel view synthesis, providing high fidelity and rapid rendering speed. However, the substantial data volume of 3DGS and its attributes impede its practical utility, requiring compression techniques for reducing memory cost. Nevertheless, the unorganized shape of 3DGS leads to difficulties in compression. To formulate unstructured attributes into normative distribution, we propose a well-structured tri-plane to encode Gaussian attributes, leveraging the distribution of attributes for compression. To exploit the correlations among adjacent Gaussians, K-Nearest Neighbors (KNN) is used when decoding Gaussian distribution from the Tri-plane. We also introduce Gaussian position information as a prior of the position-sensitive decoder. Additionally, we incorporate an adaptive wavelet loss, aiming to focus on the high-frequency details as iterations increase. Our approach has achieved results that are comparable to or surpass that of SOTA 3D Gaussians Splatting compression work in extensive experiments across multiple datasets. The codes are released at https://github.com/timwang2001/TC-GS.

From Annotation to Adaptation: Metrics, Synthetic Data, and Aspect Extraction for Aspect-Based Sentiment Analysis with Large Language Models

Nikita Neveditsin,Pawan Lingras,Vijay Mago

Task: 评估大型语言模型（LLMs）在基于方面的情感分析（ABSA）中的表现，特别是隐式方面的提取。

Motivation: 研究LLMs在新领域（如体育反馈数据集）中提取方面-极性对的能力，并提出一种评估生成模型方面提取的指标。

Details

Method: 使用合成体育反馈数据集评估开源LLMs，并提出新的评估指标。 Result: 发现LLMs在ABSA任务中既有潜力也存在局限性。 Conclusion: LLMs在ABSA任务中表现有潜力，但仍需改进以克服其局限性。 Abstract: This study examines the performance of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA), with a focus on implicit aspect extraction in a novel domain. Using a synthetic sports feedback dataset, we evaluate open-weight LLMs' ability to extract aspect-polarity pairs and propose a metric to facilitate the evaluation of aspect extraction with generative models. Our findings highlight both the potential and limitations of LLMs in the ABSA task.

TraNCE: Transformative Non-linear Concept Explainer for CNNs

Ugochukwu Ejike Akpudo,Yongsheng Gao,Jun Zhou,Andrew Lewis

Task: 提出一种新型的非线性概念解释方法（TraNCE）来解决卷积神经网络（CNN）的可解释性问题。

Motivation: 现有的基于概念的可解释性方法假设图像激活是线性可重构的，无法捕捉复杂的激活关系，且仅通过保真度评估全局解释存在局限性。

Details

Method: 采用变分自编码器（VAE）自动发现概念，利用Bessel函数可视化原型图像像素，并提出新的Faith评分指标。 Result: TraNCE能够捕捉激活中的复杂关系，提供更全面的解释，并通过Faith评分综合评估解释的忠实性和一致性。 Conclusion: TraNCE为CNN的可解释性提供了更有效的非线性方法，解决了现有方法的局限性。 Abstract: Convolutional neural networks (CNNs) have succeeded remarkably in various computer vision tasks. However, they are not intrinsically explainable. While the feature-level understanding of CNNs reveals where the models looked, concept-based explainability methods provide insights into what the models saw. However, their assumption of linear reconstructability of image activations fails to capture the intricate relationships within these activations. Their Fidelity-only approach to evaluating global explanations also presents a new concern. For the first time, we address these limitations with the novel Transformative Nonlinear Concept Explainer (TraNCE) for CNNs. Unlike linear reconstruction assumptions made by existing methods, TraNCE captures the intricate relationships within the activations. This study presents three original contributions to the CNN explainability literature: (i) An automatic concept discovery mechanism based on variational autoencoders (VAEs). This transformative concept discovery process enhances the identification of meaningful concepts from image activations. (ii) A visualization module that leverages the Bessel function to create a smooth transition between prototypical image pixels, revealing not only what the CNN saw but also what the CNN avoided, thereby mitigating the challenges of concept duplication as documented in previous works. (iii) A new metric, the Faith score, integrates both Coherence and Fidelity for a comprehensive evaluation of explainer faithfulness and consistency.

Ontology-based Semantic Similarity Measures for Clustering Medical Concepts in Drug Safety

Jeffery L Painter,François Haguinet,Gregory E Powell,Andrew Bate

Task: 评估六种基于本体的语义相似度度量方法在药物安全数据中聚类MedDRA Preferred Terms (PTs)的效果。

Motivation: 语义相似度度量在生物医学研究中广泛应用，但在药物警戒中尚未充分利用。

Details

Method: 使用统一医学语言系统（UMLS）评估每种方法围绕医学意义中心点聚类PTs的能力，并开发了一个支持大规模相似度计算的高通量框架。 Result: 基于路径的方法表现一般（F1分数0.36和0.28），而基于内在信息内容（IC）的方法（如INTRINSIC-LIN和SOKAL）聚类准确率更高（F1分数0.403）。 Conclusion: 基于IC的语义相似度度量方法有望通过改进早期信号检测和减少人工审查来增强药物警戒工作流程。 Abstract: Semantic similarity measures (SSMs) are widely used in biomedical research but remain underutilized in pharmacovigilance. This study evaluates six ontology-based SSMs for clustering MedDRA Preferred Terms (PTs) in drug safety data. Using the Unified Medical Language System (UMLS), we assess each method's ability to group PTs around medically meaningful centroids. A high-throughput framework was developed with a Java API and Python and R interfaces support large-scale similarity computations. Results show that while path-based methods perform moderately with F1 scores of 0.36 for WUPALMER and 0.28 for LCH, intrinsic information content (IC)-based measures, especially INTRINSIC-LIN and SOKAL, consistently yield better clustering accuracy (F1 score of 0.403). Validated against expert review and standard MedDRA queries (SMQs), our findings highlight the promise of IC-based SSMs in enhancing pharmacovigilance workflows by improving early signal detection and reducing manual review.

Leveraging 3D Geometric Priors in 2D Rotation Symmetry Detection

Ahyun Seo,Minsu Cho

Task: 检测旋转对称物体的旋转中心和支撑顶点。

Motivation: 传统方法依赖手工特征匹配，而基于卷积神经网络的现代分割模型在3D几何一致性上表现不佳。

Details

Method: 提出一种模型，直接在3D空间中预测旋转中心和顶点，并通过投影回2D保持结构完整性，同时引入顶点重建阶段以强化3D几何先验。 Result: 在DENDI数据集上表现出色，旋转轴检测性能优越，并通过消融实验验证了3D先验的有效性。 Conclusion: 所提模型通过结合3D几何先验，显著提升了旋转对称检测的鲁棒性和准确性。 Abstract: Symmetry plays a vital role in understanding structural patterns, aiding object recognition and scene interpretation. This paper focuses on rotation symmetry, where objects remain unchanged when rotated around a central axis, requiring detection of rotation centers and supporting vertices. Traditional methods relied on hand-crafted feature matching, while recent segmentation models based on convolutional neural networks detect rotation centers but struggle with 3D geometric consistency due to viewpoint distortions. To overcome this, we propose a model that directly predicts rotation centers and vertices in 3D space and projects the results back to 2D while preserving structural integrity. By incorporating a vertex reconstruction stage enforcing 3D geometric priors -- such as equal side lengths and interior angles -- our model enhances robustness and accuracy. Experiments on the DENDI dataset show superior performance in rotation axis detection and validate the impact of 3D priors through ablation studies.

Beyond Believability: Accurate Human Behavior Simulation with Fine-Tuned LLMs

Yuxuan Lu,Jing Huang,Yan Han,Bennet Bei,Yaochen Xie,Dakuo Wang,Jessie Wang,Qi He

Task: 评估和改进LLM在网页动作生成任务中的客观准确性。

Motivation: 现有研究关注LLM模拟人类行为的可信度，但缺乏对其客观准确性的评估。

Details

Method: 利用大规模真实世界数据集，对LLM进行微调，并引入合成推理轨迹。 Result: 微调和推理轨迹显著提升了LLM的动作生成能力。 Conclusion: 本研究为LLM行为模拟提供了新基准，并展示了真实数据与推理增强的价值。 Abstract: Recent research shows that LLMs can simulate ``believable'' human behaviors to power LLM agents via prompt-only methods. In this work, we focus on evaluating and improving LLM's objective ``accuracy'' rather than the subjective ``believability'' in the web action generation task, leveraging a large-scale, real-world dataset collected from online shopping human actions. We present the first comprehensive quantitative evaluation of state-of-the-art LLMs (e.g., DeepSeek-R1, Llama, and Claude) on the task of web action generation. Our results show that fine-tuning LLMs on real-world behavioral data substantially improves their ability to generate actions compared to prompt-only methods. Furthermore, incorporating synthesized reasoning traces into model training leads to additional performance gains, demonstrating the value of explicit rationale in behavior modeling. This work establishes a new benchmark for evaluating LLMs in behavior simulation and offers actionable insights into how real-world action data and reasoning augmentation can enhance the fidelity of LLM agents.

Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

Prin Phunyaphibarn,Phillip Y. Lee,Jaihoon Kim,Minhyuk Sung

Task: 改进基于Classifier-Free Guidance (CFG)的条件扩散模型的训练方法。

Motivation: 观察到联合学习无条件噪声预测会限制训练带宽，导致无条件生成质量下降，进而影响条件生成质量。

Details

Method: 提出用基础模型预测的无条件噪声替换CFG中的无条件噪声，并验证其他扩散模型也可用于此替换。 Result: 实验证明该方法显著提升了条件生成质量，适用于多种CFG模型。 Conclusion: 通过优化无条件噪声预测，可以有效提升条件扩散模型的生成性能。 Abstract: Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.

ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems

Chenxi Wang,Jizhan Fang,Xiang Chen,Bozhong Tian,Ziwen Xu,Huajun Chen,Ningyu Zhang

Task: 提出一种基于知识编辑的方法（ADS-Edit）以解决大型多模态模型在自动驾驶系统中的直接应用问题。

Motivation: 大型多模态模型在自动驾驶系统中面临交通知识误解、复杂路况和车辆状态多样性等挑战。

Details

Method: 采用知识编辑技术，无需完全重新训练即可针对性修改模型行为，并构建了专为自动驾驶设计的ADS-Edit多模态知识编辑数据集。 Result: 通过全面实验得出多项有趣结论，验证了方法的有效性。 Conclusion: 该工作有望推动知识编辑技术在自动驾驶领域的进一步应用。 Abstract: Recent advancements in Large Multimodal Models (LMMs) have shown promise in Autonomous Driving Systems (ADS). However, their direct application to ADS is hindered by challenges such as misunderstanding of traffic knowledge, complex road conditions, and diverse states of vehicle. To address these challenges, we propose the use of Knowledge Editing, which enables targeted modifications to a model's behavior without the need for full retraining. Meanwhile, we introduce ADS-Edit, a multimodal knowledge editing dataset specifically designed for ADS, which includes various real-world scenarios, multiple data types, and comprehensive evaluation metrics. We conduct comprehensive experiments and derive several interesting conclusions. We hope that our work will contribute to the further advancement of knowledge editing applications in the field of autonomous driving. Code and data are available in https://github.com/zjunlp/EasyEdit.

Incremental Object Keypoint Learning

Mingfu Liang,Jiahuan Zhou,Xu Zou,Ying Wu

Task: 探索一种新的关键点学习范式，即增量式对象关键点学习（IKL），以解决现有模型无法检测未定义新关键点的问题。

Motivation: 现有监督学习范式依赖预定义关键点标注，无法适应新关键点的检测需求，限制了模型的通用性和迁移性。

Details

Method: 提出两阶段学习方案：1）知识关联阶段（KA-Net）自动关联新旧关键点；2）相互促进阶段通过空间蒸馏损失联合优化新旧关键点估计。 Result: 实验表明，该方法有效缓解旧关键点的遗忘问题，甚至提升旧关键点估计性能，并在低样本数据下表现优越。 Conclusion: IKL范式通过关联新旧关键点实现知识迁移，不仅避免遗忘，还能提升性能，具有高效标注和通用性优势。 Abstract: Existing progress in object keypoint estimation primarily benefits from the conventional supervised learning paradigm based on numerous data labeled with pre-defined keypoints. However, these well-trained models can hardly detect the undefined new keypoints in test time, which largely hinders their feasibility for diverse downstream tasks. To handle this, various solutions are explored but still suffer from either limited generalizability or transferability. Therefore, in this paper, we explore a novel keypoint learning paradigm in that we only annotate new keypoints in the new data and incrementally train the model, without retaining any old data, called Incremental object Keypoint Learning (IKL). A two-stage learning scheme as a novel baseline tailored to IKL is developed. In the first Knowledge Association stage, given the data labeled with only new keypoints, an auxiliary KA-Net is trained to automatically associate the old keypoints to these new ones based on their spatial and intrinsic anatomical relations. In the second Mutual Promotion stage, based on a keypoint-oriented spatial distillation loss, we jointly leverage the auxiliary KA-Net and the old model for knowledge consolidation to mutually promote the estimation of all old and new keypoints. Owing to the investigation of the correlations between new and old keypoints, our proposed method can not just effectively mitigate the catastrophic forgetting of old keypoints, but may even further improve the estimation of the old ones and achieve a positive transfer beyond anti-forgetting. Such an observation has been solidly verified by extensive experiments on different keypoint datasets, where our method exhibits superiority in alleviating the forgetting issue and boosting performance while enjoying labeling efficiency even under the low-shot data regime.

MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search

Yunhai Hu,Yilun Zhao,Chen Zhao,Arman Cohan

Task: 通过结合检索增强生成（RAG）和蒙特卡洛树搜索（MCTS）提升小语言模型在知识密集型任务中的推理能力。

Motivation: 标准RAG方法检索与推理分离导致知识整合不理想，而传统MCTS仅依赖内部知识缺乏外部事实支持。

Details

Method: 动态整合检索与推理的迭代决策过程，结合结构化推理与自适应检索。 Result: 在多个数据集上表现优异，小模型性能接近GPT-4o，减少幻觉并提升事实准确性和一致性。 Conclusion: MCTS-RAG为小规模模型的推理能力设定了新标准。 Abstract: We introduce MCTS-RAG, a novel approach that enhances the reasoning capabilities of small language models on knowledge-intensive tasks by leveraging retrieval-augmented generation (RAG) to provide relevant context and Monte Carlo Tree Search (MCTS) to refine reasoning paths. MCTS-RAG dynamically integrates retrieval and reasoning through an iterative decision-making process. Unlike standard RAG methods, which typically retrieve information independently from reasoning and thus integrate knowledge suboptimally, or conventional MCTS reasoning, which depends solely on internal model knowledge without external facts, MCTS-RAG combines structured reasoning with adaptive retrieval. This integrated approach enhances decision-making, reduces hallucinations, and ensures improved factual accuracy and response consistency. The experimental results on multiple reasoning and knowledge-intensive datasets datasets (i.e., ComplexWebQA, GPQA, and FoolMeTwice) show that our method enables small-scale LMs to achieve performance comparable to frontier LLMs like GPT-4o by effectively scaling inference-time compute, setting a new standard for reasoning in small-scale models.

LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions

Yejin Kwon,Daeun Moon,Youngje Oh,Hyunsoo Yoon

Task: 通过LogicQA框架检测并解释逻辑异常。

Motivation: 逻辑异常可能在视觉上正常但违反预定义的约束条件，需要一种无需训练和标注的方法来检测并提供解释。

Details

Method: LogicQA框架通过自动生成问题清单并收集响应来识别逻辑约束的违反。 Result: 在MVTec LOCO AD基准测试中达到87.6%的AUROC和87.0%的F1-max，并在半导体SEM数据中表现优异。 Conclusion: LogicQA是一种高效、无需训练和标注的逻辑异常检测方法，适用于工业应用。 Abstract: Anomaly Detection (AD) focuses on detecting samples that differ from the standard pattern, making it a vital tool in process control. Logical anomalies may appear visually normal yet violate predefined constraints on object presence, arrangement, or quantity, depending on reasoning and explainability. We introduce LogicQA, a framework that enhances AD by providing industrial operators with explanations for logical anomalies. LogicQA compiles automatically generated questions into a checklist and collects responses to identify violations of logical constraints. LogicQA is training-free, annotation-free, and operates in a few-shot setting. We achieve state-of-the-art (SOTA) Logical AD performance on public benchmarks, MVTec LOCO AD, with an AUROC of 87.6 percent and an F1-max of 87.0 percent along with the explanations of anomalies. Also, our approach has shown outstanding performance on semiconductor SEM corporate data, further validating its effectiveness in industrial applications.

Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark

Sondos Mahmoud Bsharat,Mukul Ranjan,Aidar Myrzakhan,Jiacheng Liu,Bowei Guo,Shengkun Tang,Zhuang Liu,Yuanzhi Li,Zhiqiang Shen

Task: 开发一个专门针对移动设备的大规模基准数据集Mobile-MMLU，用于评估大型语言模型在移动环境中的性能。

Motivation: 当前基准数据集主要针对服务器和桌面环境，缺乏针对移动场景的扩展数据集，且移动设备在存储和计算资源上有严格限制，需要优化效率和知识优先级。

Details

Method: 引入Mobile-MMLU数据集，包含16,186个问题，覆盖80个移动相关领域，并设计了一个更具挑战性的子集Mobile-MMLU-Pro。 Result: 提供了一个标准化框架，用于开发和比较移动优化的LLM，并评估模型在移动环境下的性能指标（如推理延迟、能耗、内存使用和响应质量）。 Conclusion: Mobile-MMLU家族为移动计算环境中的生产力和决策提供了标准化工具，推动了移动优化LLM的发展。 Abstract: Rapid advancements in large language models (LLMs) have increased interest in deploying them on mobile devices for on-device AI applications. Mobile users interact differently with LLMs compared to desktop users, creating unique expectations and data biases. Current benchmark datasets primarily target at server and desktop environments, and there is a notable lack of extensive datasets specifically designed for mobile contexts. Additionally, mobile devices face strict limitations in storage and computing resources, constraining model size and capabilities, thus requiring optimized efficiency and prioritized knowledge. To address these challenges, we introduce Mobile-MMLU, a large-scale benchmark dataset tailored for mobile intelligence. It consists of 16,186 questions across 80 mobile-related fields, designed to evaluate LLM performance in realistic mobile scenarios. A challenging subset, Mobile-MMLU-Pro, provides advanced evaluation similar in size to MMLU-Pro but significantly more difficult than our standard full set. Both benchmarks use multiple-choice, order-invariant questions focused on practical mobile interactions, such as recipe suggestions, travel planning, and essential daily tasks. The dataset emphasizes critical mobile-specific metrics like inference latency, energy consumption, memory usage, and response quality, offering comprehensive insights into model performance under mobile constraints. Moreover, it prioritizes privacy and adaptability, assessing models' ability to perform on-device processing, maintain user privacy, and adapt to personalized usage patterns. Mobile-MMLU family offers a standardized framework for developing and comparing mobile-optimized LLMs, enabling advancements in productivity and decision-making within mobile computing environments. Our code and data are available at: https://github.com/VILA-Lab/Mobile-MMLU.

Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos

Jiaheng Zhou,Yanfeng Zhou,Wei Fang,Yuxing Tang,Le Lu,Ge Yang

Task: 开发一种数据高效的Vision Mamba网络（E-ViM$^3$）用于超声视频的自动化分析。

Motivation: 超声视频标记数据稀缺且视频分析本身具有挑战性，阻碍了相关方法的进展。

Details

Method: 提出E-ViM$^3$网络，保留视频数据的3D结构，增强长程依赖和归纳偏置；设计Enclosure Global Tokens（EGT）捕获全局特征；采用掩码视频建模和Spatial-Temporal Chained（STC）掩码策略进行自监督预训练。 Result: 在四个不同规模的数据集（EchoNet-Dynamic、CAMUS、MICCAI-BUV、WHBUS）上实现最先进的性能，并在标签有限时表现优异。 Conclusion: E-ViM$^3$在临床应用中具有潜力，尤其在数据稀缺情况下表现突出。 Abstract: Ultrasound videos are an important form of clinical imaging data, and deep learning-based automated analysis can improve diagnostic accuracy and clinical efficiency. However, the scarcity of labeled data and the inherent challenges of video analysis have impeded the advancement of related methods. In this work, we introduce E-ViM$^3$, a data-efficient Vision Mamba network that preserves the 3D structure of video data, enhancing long-range dependencies and inductive biases to better model space-time correlations. With our design of Enclosure Global Tokens (EGT), the model captures and aggregates global features more effectively than competing methods. To further improve data efficiency, we employ masked video modeling for self-supervised pre-training, with the proposed Spatial-Temporal Chained (STC) masking strategy designed to adapt to various video scenarios. Experiments demonstrate that E-ViM$^3$ performs as the state-of-the-art in two high-level semantic analysis tasks across four datasets of varying sizes: EchoNet-Dynamic, CAMUS, MICCAI-BUV, and WHBUS. Furthermore, our model achieves competitive performance with limited labels, highlighting its potential impact on real-world clinical applications.

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

Han Chen,Zicong Jiang,Zining Zhang,Bingsheng He,Pingyi Luo,Mian Lu,Yuqiang Chen

Task: 提出一种名为LogQuant的2位量化技术，用于大型语言模型（LLM）推理中的KV缓存，以实现内存节省并保持性能。

Motivation: 现有方法假设后续令牌更重要或基于早期注意力模式预测重要令牌，可能导致性能瓶颈或频繁预测错误。

Details

Method: 采用基于对数的过滤机制，选择性压缩整个上下文中的KV缓存，以相同或更低的内存占用实现更好性能。 Result: 在基准测试中，吞吐量提升25%，批量大小增加60%，内存消耗未增加；在数学和代码补全等任务中，相同压缩率下准确率提升40%至200%。 Conclusion: LogQuant在性能和内存效率上优于现有方法，且易于集成到流行推理框架中。 Abstract: We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

EGVD: Event-Guided Video Diffusion Model for Physically Realistic Large-Motion Frame Interpolation

Ziran Zhang,Xiaohui Li,Yihao Liu,Yujin Wang,Yueting Chen,Tianfan Xue,Shi Guo

Task: 提出一种基于事件相机和预训练稳定视频扩散模型的新框架（EGVD），用于解决大运动场景下的视频帧插值问题。

Motivation: 现有基于事件的视频帧插值方法因训练数据有限和复杂运动模式而表现不佳。

Details

Method: 结合RGB帧和事件信号的多模态运动条件生成器（MMCG），选择性微调策略，以及输入输出归一化技术。 Result: 在真实和模拟数据集上显著优于现有方法，感知质量指标提升27.4%（LPIPS）和24.1%（BSRGB）。 Conclusion: EGVD框架有效解决了大运动和复杂光照条件下的视频帧插值问题，具有优越性能。 Abstract: Video frame interpolation (VFI) in scenarios with large motion remains challenging due to motion ambiguity between frames. While event cameras can capture high temporal resolution motion information, existing event-based VFI methods struggle with limited training data and complex motion patterns. In this paper, we introduce Event-Guided Video Diffusion Model (EGVD), a novel framework that leverages the powerful priors of pre-trained stable video diffusion models alongside the precise temporal information from event cameras. Our approach features a Multi-modal Motion Condition Generator (MMCG) that effectively integrates RGB frames and event signals to guide the diffusion process, producing physically realistic intermediate frames. We employ a selective fine-tuning strategy that preserves spatial modeling capabilities while efficiently incorporating event-guided temporal information. We incorporate input-output normalization techniques inspired by recent advances in diffusion modeling to enhance training stability across varying noise levels. To improve generalization, we construct a comprehensive dataset combining both real and simulated event data across diverse scenarios. Extensive experiments on both real and simulated datasets demonstrate that EGVD significantly outperforms existing methods in handling large motion and challenging lighting conditions, achieving substantial improvements in perceptual quality metrics (27.4% better LPIPS on Prophesee and 24.1% on BSRGB) while maintaining competitive fidelity measures. Code and datasets available at: https://github.com/OpenImagingLab/EGVD.

Open Deep Search: Democratizing Search with Open-source Reasoning Agents

Salaheddin Alzubi,Creston Brooks,Purva Chiniya,Edoardo Contente,Chiara von Gerlach,Lucas Irwin,Yihan Jiang,Arda Kaz,Windsor Nguyen,Sewoong Oh,Himanshu Tyagi,Pramod Viswanath

Task: 开发Open Deep Search（ODS）以缩小专有搜索AI解决方案与开源替代品之间的差距。

Motivation: 当前专有搜索AI（如Perplexity的Sonar Reasoning Pro和OpenAI的GPT-4o Search Preview）与开源解决方案之间存在显著性能差距，ODS旨在通过增强开源LLM的推理能力来解决这一问题。

Details

Method: ODS由两部分组成：Open Search Tool（一种优于专有替代品的网络搜索工具）和Open Reasoning Agent（通过协调工具调用完成任务）。 Result: ODS在SimpleQA和FRAMES基准测试中接近或超越现有最先进基线，例如在FRAMES上比GPT-4o Search Preview的准确率提高了9.7%。 Conclusion: ODS是一个通用框架，能够无缝增强任何LLM的搜索和推理能力，实现最先进的性能。 Abstract: We introduce Open Deep Search (ODS) to close the increasing gap between the proprietary search AI solutions, such as Perplexity's Sonar Reasoning Pro and OpenAI's GPT-4o Search Preview, and their open-source counterparts. The main innovation introduced in ODS is to augment the reasoning capabilities of the latest open-source LLMs with reasoning agents that can judiciously use web search tools to answer queries. Concretely, ODS consists of two components that work with a base LLM chosen by the user: Open Search Tool and Open Reasoning Agent. Open Reasoning Agent interprets the given task and completes it by orchestrating a sequence of actions that includes calling tools, one of which is the Open Search Tool. Open Search Tool is a novel web search tool that outperforms proprietary counterparts. Together with powerful open-source reasoning LLMs, such as DeepSeek-R1, ODS nearly matches and sometimes surpasses the existing state-of-the-art baselines on two benchmarks: SimpleQA and FRAMES. For example, on the FRAMES evaluation benchmark, ODS improves the best existing baseline of the recently released GPT-4o Search Preview by 9.7% in accuracy. ODS is a general framework for seamlessly augmenting any LLMs -- for example, DeepSeek-R1 that achieves 82.4% on SimpleQA and 30.1% on FRAMES -- with search and reasoning capabilities to achieve state-of-the-art performance: 88.3% on SimpleQA and 75.3% on FRAMES.

ViLBench: A Suite for Vision-Language Process Reward Modeling

Haoqin Tu,Weitao Feng,Hardy Chen,Hui Liu,Xianfeng Tang,Cihang Xie

Task: 评估和提升视觉大语言模型（VLLMs）作为过程监督奖励模型（PRMs）和输出奖励模型（ORMs）的性能。

Motivation: 当前对PRMs的评估较少，尤其是在多模态领域，需要填补这一研究空白。

Details

Method: 通过多个视觉语言基准测试VLLMs作为ORMs和PRMs的性能，并引入ViLBench这一新基准。此外，通过增强的树搜索算法收集过程奖励数据，训练模型。 Result: ORMs和PRMs在不同任务中表现不一致，且性能优越的VLLMs不一定能提供更好的奖励性能。GPT-4o在ViLBench上准确率仅为27.3%。通过收集73.6K数据训练的3B模型在ViLBench上平均提升3.3%。 Conclusion: 研究揭示了VLLMs作为奖励模型的局限性，并提出了一种通过数据收集和训练提升性能的有效途径。 Abstract: Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.

TeleLoRA: Teleporting Model-Specific Alignment Across LLMs

Xiao Lin,Manoj Acharya,Anirban Roy,Susmit Jha

Task: 提出一种名为TeleLoRA的新框架，用于在未见过的LLMs上实现零样本特洛伊木马缓解。

Motivation: 不同LLMs具有不同的特洛伊木马触发器和行为，需要模型特定的对齐数据，TeleLoRA旨在通过跨模型协同学习解决这一问题。

Details

Method: TeleLoRA通过学习一个统一的LoRA适配器权重生成器，利用多个LLMs的局部激活信息，并设计为排列对称以泛化不同架构和大小的模型。 Result: 实验表明，TeleLoRA能有效降低攻击成功率，同时保持模型的良性性能。 Conclusion: TeleLoRA为LLMs特洛伊木马缓解提供了一种高效且通用的解决方案。 Abstract: Mitigating Trojans in Large Language Models (LLMs) is one of many tasks where alignment data is LLM specific, as different LLMs have different Trojan triggers and trigger behaviors to be removed. In this paper, we introduce TeleLoRA (Teleporting Low-Rank Adaptation), a novel framework that synergizes model-specific alignment data across multiple LLMs to enable zero-shot Trojan mitigation on unseen LLMs without alignment data. TeleLoRA learns a unified generator of LoRA adapter weights by leveraging local activation information across multiple LLMs. This generator is designed to be permutation symmetric to generalize across models with different architectures and sizes. We optimize the model design for memory efficiency, making it feasible to learn with large-scale LLMs with minimal computational resources. Experiments on LLM Trojan mitigation benchmarks demonstrate that TeleLoRA effectively reduces attack success rates while preserving the benign performance of the models.

Faster Parameter-Efficient Tuning with Token Redundancy Reduction

Kwonyoung Kim,Jungin Park,Jin Kim,Hyeongjun Kwon,Kwanghoon Sohn

Task: 提出一种名为FPET的参数高效调优方法，以提高推理速度和训练效率。

Motivation: 传统参数高效调优方法虽然减少了存储和传输成本，但仍继承了大型预训练模型的推理延迟，并因额外模块引入计算开销。

Details

Method: 引入一个即插即用的令牌冗余减少模块，通过适配器优化自注意力层的令牌相似性，并采用全可微分令牌合并策略。 Result: FPET在保持与最先进方法竞争性能的同时，实现了更快的推理速度和更高的内存效率。 Conclusion: FPET是一种高效且实用的参数调优方法，适用于计算密集型应用。 Abstract: Parameter-efficient tuning (PET) aims to transfer pre-trained foundation models to downstream tasks by learning a small number of parameters. Compared to traditional fine-tuning, which updates the entire model, PET significantly reduces storage and transfer costs for each task regardless of exponentially increasing pre-trained model capacity. However, most PET methods inherit the inference latency of their large backbone models and often introduce additional computational overhead due to additional modules (e.g. adapters), limiting their practicality for compute-intensive applications. In this paper, we propose Faster Parameter-Efficient Tuning (FPET), a novel approach that enhances inference speed and training efficiency while maintaining high storage efficiency. Specifically, we introduce a plug-and-play token redundancy reduction module delicately designed for PET. This module refines tokens from the self-attention layer using an adapter to learn the accurate similarity between tokens and cuts off the tokens through a fully-differentiable token merging strategy, which uses a straight-through estimator for optimal token reduction. Experimental results prove that our FPET achieves faster inference and higher memory efficiency than the pre-trained backbone while keeping competitive performance on par with state-of-the-art PET methods.

ViLBench: A Suite for Vision-Language Process Reward Modeling

Haoqin Tu,Weitao Feng,Hardy Chen,Hui Liu,Xianfeng Tang,Cihang Xie

Task: 评估和提升视觉大语言模型（VLLMs）作为输出奖励模型（ORMs）和过程奖励模型（PRMs）在多模态领域的性能。

Motivation: 尽管过程监督奖励模型（PRMs）在复杂任务中能提供细粒度的反馈，但其在多模态领域的评估仍不足，需要进一步研究。

Details

Method: 通过多个视觉语言基准测试比较ORMs和PRMs的性能，并引入新的基准ViLBench；使用增强的树搜索算法收集过程奖励数据，训练模型。 Result: ORMs和PRMs在不同任务中表现不一致，且性能优越的VLLMs不一定能提供更好的奖励性能；3B模型在ViLBench上平均提升3.3%。 Conclusion: 通过引入ViLBench和收集过程奖励数据，展示了提升VLLMs作为奖励模型的潜力，为未来研究提供了新方向。 Abstract: Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.

InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction

Yuhui Wu,Liyi Chen,Ruibin Li,Shihao Wang,Chenxi Xie,Lei Zhang

Task: 构建高质量基于指令的视频编辑数据集InsViE-1M，并训练模型InsViE。

Motivation: 现有数据集质量低（分辨率低、时长短、编辑质量差），限制了视频编辑模型的性能。

Details

Method: 通过筛选高质量源视频和图像，设计编辑-过滤流程生成高质量训练三元组，并采用多阶段学习策略训练模型。 Result: 构建了包含1M三元组的InsViE-1M数据集，实验证明其优于现有工作。 Conclusion: InsViE-1M数据集和模型在指令跟随和编辑能力上表现优异。 Abstract: Instruction-based video editing allows effective and interactive editing of videos using only instructions without extra inputs such as masks or attributes. However, collecting high-quality training triplets (source video, edited video, instruction) is a challenging task. Existing datasets mostly consist of low-resolution, short duration, and limited amount of source videos with unsatisfactory editing quality, limiting the performance of trained editing models. In this work, we present a high-quality Instruction-based Video Editing dataset with 1M triplets, namely InsViE-1M. We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training. For a source video, we generate multiple edited samples of its first frame with different intensities of classifier-free guidance, which are automatically filtered by GPT-4o with carefully crafted guidelines. The edited first frame is propagated to subsequent frames to produce the edited video, followed by another round of filtering for frame quality and motion evaluation. We also generate and filter a variety of video editing triplets from high-quality images. With the InsViE-1M dataset, we propose a multi-stage learning strategy to train our InsViE model, progressively enhancing its instruction following and editing ability. Extensive experiments demonstrate the advantages of our InsViE-1M dataset and the trained model over state-of-the-art works. Codes are available at InsViE.

QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions

Siyin Wang,Wenyi Yu,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Yu Tsao,Junichi Yamagishi,Yuxuan Wang,Chao Zhang

Task: 探索一种利用自然语言描述进行语音质量评估的新方法。

Motivation: 传统数值评分方法缺乏丰富性和细节，而自然语言反馈能提供更详细的评价和建议，但现有数据集缺乏相关标注。

Details

Method: 引入QualiSpeech数据集，包含11个关键方面和详细的自然语言评论，并提出QualiSpeech Benchmark评估听觉大语言模型（LLMs）的低级语音理解能力。 Result: 实验表明，微调的听觉LLMs能可靠生成噪声和失真的详细描述，有效识别其类型和时间特征。 Conclusion: 该方法展示了结合推理提升质量评估准确性和可靠性的潜力，数据集已公开。 Abstract: This paper explores a novel perspective to speech quality assessment by leveraging natural language descriptions, offering richer, more nuanced insights than traditional numerical scoring methods. Natural language feedback provides instructive recommendations and detailed evaluations, yet existing datasets lack the comprehensive annotations needed for this approach. To bridge this gap, we introduce QualiSpeech, a comprehensive low-level speech quality assessment dataset encompassing 11 key aspects and detailed natural language comments that include reasoning and contextual insights. Additionally, we propose the QualiSpeech Benchmark to evaluate the low-level speech understanding capabilities of auditory large language models (LLMs). Experimental results demonstrate that finetuned auditory LLMs can reliably generate detailed descriptions of noise and distortion, effectively identifying their types and temporal characteristics. The results further highlight the potential for incorporating reasoning to enhance the accuracy and reliability of quality assessments. The dataset will be released at https://huggingface.co/datasets/tsinghua-ee/QualiSpeech.

RelTriple: Learning Plausible Indoor Layouts by Integrating Relationship Triples into the Diffusion Process

Kaifan Sun,Bingchen Yang,Peter Wonka,Jun Xiao,Haiyong Jiang

Task: 自动提取家具布局中的间距关系并生成更真实的室内布局。

Motivation: 手动定义的家具关系通常不完整且可能导致不现实的布局，自动提取间距关系可以解决这一问题。

Details

Method: 采用分层分析和Delaunay三角剖分提取三重关系，并通过RelTriple方法学习对象与区域之间的间距关系。 Result: 在无条件布局生成、平面图条件布局生成和场景重排任务中，性能优于现有方法，空间关系指标提升至少12%。 Conclusion: RelTriple方法通过三重关系建模显著提升了家具布局的空间一致性和实用性。 Abstract: The generation of indoor furniture layouts has significant applications in augmented reality, smart homes, and architectural design. Successful furniture arrangement requires proper physical relationships (e.g., collision avoidance) and spacing relationships between furniture and their functional zones to be respected. However, manually defined relationships are almost always incomplete and can produce unrealistic layouts. This work instead extracts spacing relationships automatically based on a hierarchical analysis and adopts the Delaunay Triangulation to produce important triple relationships. Compared to pairwise relationship modeling, triple relationships account for interactions and space utilization among multiple objects. To this end, we introduce RelTriple, a novel approach that enhances furniture distribution by learning spacing relationships between objects and regions. We formulate triple relationships as object-to-object (O2O) losses and object-to-region (O2R) losses and integrate them directly into the training process of generative diffusion. Our approach consistently improves over existing state-of-the-art methods in visual results evaluation metrics on unconditional layout generation, floorplan-conditioned layout generation, and scene rearrangement, achieving at least 12% on the introduced spatial relationship metric and superior spatial coherence and practical usability.

VideoGEM: Training-free Action Grounding in Videos

Felix Vogel,Walid Bousselham,Anna Kukleva,Nina Shvetsova,Hilde Kuehne

Task: 提出一种基于预训练图像和视频语言模型的无训练空间动作定位方法VideoGEM。

Motivation: 现有视觉语言基础模型在零样本任务中表现优异，但将其能力扩展到视频中的动作和事件定位仍具挑战性，因为动作缺乏物理轮廓且通常由高级概念描述。

Details

Method: 通过调整GEM的自注意力机制，提出层级加权方法以优先处理高级语义层，并引入动态加权和提示分解技术。 Result: 在CLIP、OpenCLIP和ViCLIP三种模型及四个视频定位数据集上验证，性能优于当前有训练的先进方法。 Conclusion: VideoGEM展示了无训练方法在空间视频动作定位中的潜力，优于现有方法。 Abstract: Vision-language foundation models have shown impressive capabilities across various zero-shot tasks, including training-free localization and grounding, primarily focusing on localizing objects in images. However, leveraging those capabilities to localize actions and events in videos is challenging, as actions have less physical outline and are usually described by higher-level concepts. In this work, we propose VideoGEM, the first training-free spatial action grounding method based on pretrained image- and video-language backbones. Namely, we adapt the self-self attention formulation of GEM to spatial activity grounding. We observe that high-level semantic concepts, such as actions, usually emerge in the higher layers of the image- and video-language models. We, therefore, propose a layer weighting in the self-attention path to prioritize higher layers. Additionally, we introduce a dynamic weighting method to automatically tune layer weights to capture each layer`s relevance to a specific prompt. Finally, we introduce a prompt decomposition, processing action, verb, and object prompts separately, resulting in a better spatial localization of actions. We evaluate the proposed approach on three image- and video-language backbones, CLIP, OpenCLIP, and ViCLIP, and on four video grounding datasets, V-HICO, DALY, YouCook-Interactions, and GroundingYouTube, showing that the proposed training-free approach is able to outperform current trained state-of-the-art approaches for spatial video grounding.

CryoSAMU: Enhancing 3D Cryo-EM Density Maps of Protein Structures at Intermediate Resolution with Structure-Aware Multimodal U-Nets

Chenwei Zhang,Anne Condon,Khanh Dao Duc

Task: 提出了一种名为CryoSAMU的新方法，用于增强中间分辨率（4-8 Å）的冷冻电镜3D密度图。

Motivation: 现有深度学习方法未针对中间分辨率密度图优化，且仅依赖密度图特征。

Details

Method: 使用结构感知的多模态U-Net，并在中间分辨率密度图上进行训练。 Result: CryoSAMU在多项指标上表现优异，处理速度显著提升。 Conclusion: CryoSAMU在增强冷冻电镜密度图方面具有竞争力，未来应用前景广阔。 Abstract: Enhancing cryogenic electron microscopy (cryo-EM) 3D density maps at intermediate resolution (4-8 {\AA}) is crucial in protein structure determination. Recent advances in deep learning have led to the development of automated approaches for enhancing experimental cryo-EM density maps. Yet, these methods are not optimized for intermediate-resolution maps and rely on map density features alone. To address this, we propose CryoSAMU, a novel method designed to enhance 3D cryo-EM density maps of protein structures using structure-aware multimodal U-Nets and trained on curated intermediate-resolution density maps. We comprehensively evaluate CryoSAMU across various metrics and demonstrate its competitive performance compared to state-of-the-art methods. Notably, CryoSAMU achieves significantly faster processing speed, showing promise for future practical applications. Our code is available at https://github.com/chenwei-zhang/CryoSAMU.

VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

Jiale Cheng,Ruiliang Lyu,Xiaotao Gu,Xiao Liu,Jiazheng Xu,Yida Lu,Jiayan Teng,Zhuoyi Yang,Yuxiao Dong,Jie Tang,Hongning Wang,Minlie Huang

Task: 提出一种基于无害性、准确性和帮助性的提示优化框架VPO，以提升文本到视频生成的质量和安全性。

Motivation: 现实用户输入往往简洁、模糊或结构不良，与训练数据中的详细描述存在差距，导致提示优化成为关键。现有方法依赖大语言模型优化提示，但可能扭曲用户意图、忽略关键细节或引入安全风险，且未考虑对视频质量的影响。

Details

Method: VPO采用两阶段优化方法：1) 基于安全和对齐原则构建并精调监督微调数据集；2) 引入文本和视频级反馈，通过偏好学习进一步优化模型。 Result: 实验表明VPO显著提升安全性、对齐性和视频质量，并展现强泛化能力。VPO还可与RLHF方法结合或超越其表现。 Conclusion: VPO通过核心原则和两阶段优化，有效对齐视频生成模型，提升生成质量和安全性。 Abstract: Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at https://github.com/thu-coai/VPO.

Context-Aware Weakly Supervised Image Manipulation Localization with SAM Refinement

Xinghao Wang,Changtao Miao,Dianmo Sheng,Tao Gong,Qi Chu,Bin Liu,Nenghai Yu

Task: 提出一种基于弱监督的图像篡改定位方法，仅需图像级二元标签进行训练。

Motivation: 现有弱监督图像篡改定位方法忽视边缘信息的重要性，导致定位性能不佳。

Details

Method: 提出Context-Aware Boundary Localization (CABL)模块和CAM-Guided SAM Refinement (CGSR)模块，结合双分支Transformer-CNN架构。 Result: 在多个数据集上实现了出色的定位性能。 Conclusion: 通过整合边缘信息和上下文不一致性，显著提升了弱监督图像篡改定位的准确性。 Abstract: Malicious image manipulation poses societal risks, increasing the importance of effective image manipulation detection methods. Recent approaches in image manipulation detection have largely been driven by fully supervised approaches, which require labor-intensive pixel-level annotations. Thus, it is essential to explore weakly supervised image manipulation localization methods that only require image-level binary labels for training. However, existing weakly supervised image manipulation methods overlook the importance of edge information for accurate localization, leading to suboptimal localization performance. To address this, we propose a Context-Aware Boundary Localization (CABL) module to aggregate boundary features and learn context-inconsistency for localizing manipulated areas. Furthermore, by leveraging Class Activation Mapping (CAM) and Segment Anything Model (SAM), we introduce the CAM-Guided SAM Refinement (CGSR) module to generate more accurate manipulation localization maps. By integrating two modules, we present a novel weakly supervised framework based on a dual-branch Transformer-CNN architecture. Our method achieves outstanding localization performance across multiple datasets.

Exploring the Effect of Robotic Embodiment and Empathetic Tone of LLMs on Empathy Elicitation

Liza Darwesh,Jaspreet Singh,Marin Marian,Eduard Alexa,Koen Hindriks,Kim Baraka

Task: 研究通过社交代理互动引发对第三方的共情。

Motivation: 探讨物理机器人和语音聊天机器人是否能够通过共情语调引发用户的共情行为。

Details

Method: 参与者与物理机器人或语音聊天机器人互动，机器人由大型语言模型驱动，表现出共情或中性语调，互动围绕虚构角色Katie Banks的困境展开。 Result: 机器人的实体化或共情语调对参与者的志愿意愿无显著影响，大型语言模型虽能模拟共情，但难以引发真实的共情反应。 Conclusion: 社交代理的共情模拟效果有限，需进一步研究如何提升其引发真实共情的能力。 Abstract: This study investigates the elicitation of empathy toward a third party through interaction with social agents. Participants engaged with either a physical robot or a voice-enabled chatbot, both driven by a large language model (LLM) programmed to exhibit either an empathetic tone or remain neutral. The interaction is focused on a fictional character, Katie Banks, who is in a challenging situation and in need of financial donations. The willingness to help Katie, measured by the number of hours participants were willing to volunteer, along with their perceptions of the agent, were assessed for 60 participants. Results indicate that neither robotic embodiment nor empathetic tone significantly influenced participants' willingness to volunteer. While the LLM effectively simulated human empathy, fostering genuine empathetic responses in participants proved challenging.

Traversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model

Yuhan Wang,Suzhi Bi,Ying-Jun Angela Zhang,Xiaojun Yuan

Task: 探索如何利用单一预训练的基于分数的生成模型灵活且最优地遍历失真-感知（DP）权衡。

Motivation: 现有算法在失真-感知权衡中要么牺牲失真以优先感知质量，要么专注于最小化MSE以实现忠实恢复，且无法灵活适应不同DP点。

Details

Method: 提出一种方差缩放的逆向扩散过程，并从理论上描述其边缘分布，证明该过程是条件高斯分布下DP权衡的最优解。 Result: 实验结果表明，单一分数网络能有效且灵活地遍历一般去噪问题的DP权衡。 Conclusion: 基于分数的生成模型为灵活且最优地遍历DP权衡提供了潜力。 Abstract: The distortion-perception (DP) tradeoff reveals a fundamental conflict between distortion metrics (e.g., MSE and PSNR) and perceptual quality. Recent research has increasingly concentrated on evaluating denoising algorithms within the DP framework. However, existing algorithms either prioritize perceptual quality by sacrificing acceptable distortion, or focus on minimizing MSE for faithful restoration. When the goal shifts or noisy measurements vary, adapting to different points on the DP plane needs retraining or even re-designing the model. Inspired by recent advances in solving inverse problems using score-based generative models, we explore the potential of flexibly and optimally traversing DP tradeoffs using a single pre-trained score-based model. Specifically, we introduce a variance-scaled reverse diffusion process and theoretically characterize the marginal distribution. We then prove that the proposed sample process is an optimal solution to the DP tradeoff for conditional Gaussian distribution. Experimental results on two-dimensional and image datasets illustrate that a single score network can effectively and flexibly traverse the DP tradeoff for general denoising problems.

Optimizing Case-Based Reasoning System for Functional Test Script Generation with Large Language Models

Siyuan Guo,Huiwu Liu,Xiaolong Chen,Yuming Xie,Liang Zhang,Tao Han,Hechang Chen,Yi Chang,Jun Wang

Task: 探索大型语言模型（LLMs）生成功能性测试脚本的潜力。

Motivation: 理解目标软件动态演化的代码结构，并提高测试脚本生成的效率和准确性。

Details

Method: 提出基于案例推理（CBR）的4R循环系统（检索、重用、修订、保留），并引入Re4优化方法，包括基于重排的检索微调和强化重用微调。 Result: 在华为数据通信的两个产品开发单元上的实验结果表明，CBR+Re4方法具有优越性，并能缓解LLMs的重复生成问题。 Conclusion: CBR+Re4方法有效提升了LLMs在测试脚本生成中的性能，并改善了用户体验。 Abstract: In this work, we explore the potential of large language models (LLMs) for generating functional test scripts, which necessitates understanding the dynamically evolving code structure of the target software. To achieve this, we propose a case-based reasoning (CBR) system utilizing a 4R cycle (i.e., retrieve, reuse, revise, and retain), which maintains and leverages a case bank of test intent descriptions and corresponding test scripts to facilitate LLMs for test script generation. To improve user experience further, we introduce Re4, an optimization method for the CBR system, comprising reranking-based retrieval finetuning and reinforced reuse finetuning. Specifically, we first identify positive examples with high semantic and script similarity, providing reliable pseudo-labels for finetuning the retriever model without costly labeling. Then, we apply supervised finetuning, followed by a reinforcement learning finetuning stage, to align LLMs with our production scenarios, ensuring the faithful reuse of retrieved cases. Extensive experimental results on two product development units from Huawei Datacom demonstrate the superiority of the proposed CBR+Re4. Notably, we also show that the proposed Re4 method can help alleviate the repetitive generation issues with LLMs.

Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability

Jianyang Zhang,Qianli Luo,Guowu Yang,Wenjing Yang,Weide Liu,Guosheng Lin,Fengmao Lv

Task: 提出Attribute-formed Language Bottleneck Model (ALBM)以解决Language Bottleneck Models (LBMs)在图像识别中的伪线索推断问题和泛化能力不足。

Motivation: 当前LBMs将所有概念简单堆叠为瓶颈层，导致伪线索推断问题且无法泛化到未见类别。

Details

Method: ALBM通过属性形成的类特定空间组织概念，避免伪线索推断；提出Visual Attribute Prompt Learning (VAPL)提升可解释性；采用DSS策略自动生成高质量概念集。 Result: 在9个广泛使用的少样本基准测试中验证了方法的可解释性、可迁移性和性能。 Conclusion: ALBM通过结构化概念空间和自动生成策略，显著提升了图像识别的可解释性和泛化能力。 Abstract: Language Bottleneck Models (LBMs) are proposed to achieve interpretable image recognition by classifying images based on textual concept bottlenecks. However, current LBMs simply list all concepts together as the bottleneck layer, leading to the spurious cue inference problem and cannot generalized to unseen classes. To address these limitations, we propose the Attribute-formed Language Bottleneck Model (ALBM). ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes. In this way, ALBM can avoid the spurious cue inference problem by classifying solely based on the essential concepts of each class. In addition, the cross-class unified attribute set also ensures that the concept spaces of different classes have strong correlations, as a result, the learned concept classifier can be easily generalized to unseen classes. Moreover, to further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes. Furthermore, to avoid labor-intensive concept annotation, we propose the Description, Summary, and Supplement (DSS) strategy to automatically generate high-quality concept sets with a complete and precise attribute. Extensive experiments on 9 widely used few-shot benchmarks demonstrate the interpretability, transferability, and performance of our approach. The code and collected concept sets are available at https://github.com/tiggers23/ALBM.

TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews

Huimin Xu,Seungjun Yi,Terence Lim,Jiawei Xu,Andrew Well,Carlos Mery,Aidong Zhang,Yuji Zhang,Heng Ji,Keshav Pingali,Yan Leng,Ying Ding

Task: 提出一种基于多智能体大语言模型的人机协作主题分析框架TAMA，用于临床访谈。

Motivation: 主题分析在医疗领域有价值但资源密集，大语言模型的应用尚未探索。

Details

Method: 利用多智能体系统的可扩展性和一致性，通过智能体间的结构化对话，并结合心脏专家的专业知识进行主题分析。 Result: TAMA在AAOCA患儿父母的访谈转录中表现优于现有LLM辅助方法，主题命中率、覆盖率和独特性更高。 Conclusion: TAMA通过人机协作显著提升主题分析质量并减少人工工作量，展现了在临床环境中的潜力。 Abstract: Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. TA provides valuable insights in healthcare but is resource-intensive. Large Language Models (LLMs) have been introduced to perform TA, yet their applications in healthcare remain unexplored. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews. We leverage the scalability and coherence of multi-agent systems through structured conversations between agents and coordinate the expertise of cardiac experts in TA. Using interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness. TAMA demonstrates strong potential for automated TA in clinical settings by leveraging multi-agent LLM systems with human-in-the-loop integration by enhancing quality while significantly reducing manual workload.

Zitian Wang,Yue Liao,Kang Rong,Fengyun Rao,Yibo Yang,Si Liu

Task: 通过指令导向的偏好对齐（IPA）提升多模态大语言模型（MLLMs）的综合理解能力。

Motivation: 现有偏好对齐方法主要关注幻觉因素，而忽视了多模态理解能力的关键因素，导致改进局限于幻觉缓解。

Details

Method: 提出IPA框架，包括自动偏好构建、验证过程识别指令导向因素，以及渐进式偏好收集流程。 Result: 在Qwen2VL-7B上实验表明，IPA在幻觉评估、视觉问答和文本理解任务中均有效。 Conclusion: IPA能够提升模型的综合理解能力，填补了现有方法的不足。 Abstract: Preference alignment has emerged as an effective strategy to enhance the performance of Multimodal Large Language Models (MLLMs) following supervised fine-tuning. While existing preference alignment methods predominantly target hallucination factors, they overlook the factors essential for multi-modal comprehension capabilities, often narrowing their improvements on hallucination mitigation. To bridge this gap, we propose Instruction-oriented Preference Alignment (IPA), a scalable framework designed to automatically construct alignment preferences grounded in instruction fulfillment efficacy. Our method involves an automated preference construction coupled with a dedicated verification process that identifies instruction-oriented factors, avoiding significant variability in response representations. Additionally, IPA incorporates a progressive preference collection pipeline, further recalling challenging samples through model self-evolution and reference-guided refinement. Experiments conducted on Qwen2VL-7B demonstrate IPA's effectiveness across multiple benchmarks, including hallucination evaluation, visual question answering, and text understanding tasks, highlighting its capability to enhance general comprehension.

Vision as LoRA

Han Wang,Yongjie Ye,Bingru Li,Yuxiang Nie,Jinghui Lu,Jingqun Tang,Yanjie Wang,Can Huang

Task: 将大型语言模型（LLM）转化为多模态语言模型（MLLM）的新范式Vision as LoRA（VoRA）。

Motivation: 现有MLLM架构依赖外部视觉模块进行视觉编码，VoRA通过直接在LLM中集成视觉特定的LoRA层，内部化视觉能力，减少结构复杂性和计算开销。

Details

Method: 集成视觉特定的LoRA层，采用块级蒸馏方法从预训练的ViT中转移视觉先验，并应用双向注意力掩码以更好地捕捉图像上下文信息。 Result: VoRA在额外预训练数据下表现与传统基于编码的MLLM相当。 Conclusion: VoRA通过内部化视觉能力和高效训练方法，实现了与现有MLLM相当的性能，同时降低了复杂性。 Abstract: We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.

Enabling Heterogeneous Adversarial Transferability via Feature Permutation Attacks

Tao Wu,Tie Luo

Task: 提出一种名为特征置换攻击（FPA）的方法，以提升黑盒对抗攻击在异构架构间的可迁移性。

Motivation: 现有基于迁移的黑盒对抗攻击在异构架构（如CNN、MLP和ViT）间的性能显著下降，因架构差异导致。

Details

Method: FPA通过特征置换操作重新排列特征图中的像素值，模拟长程依赖，使CNN更接近ViT和MLP的行为。 Result: 在14种先进架构上的实验显示，FPA在CNN、ViT和MLP上的攻击成功率分别提升7.68%、14.57%和14.48%。 Conclusion: FPA是一种高效、轻量且通用的方法，显著提升异构架构间的对抗可迁移性。 Abstract: Adversarial attacks in black-box settings are highly practical, with transfer-based attacks being the most effective at generating adversarial examples (AEs) that transfer from surrogate models to unseen target models. However, their performance significantly degrades when transferring across heterogeneous architectures -- such as CNNs, MLPs, and Vision Transformers (ViTs) -- due to fundamental architectural differences. To address this, we propose Feature Permutation Attack (FPA), a zero-FLOP, parameter-free method that enhances adversarial transferability across diverse architectures. FPA introduces a novel feature permutation (FP) operation, which rearranges pixel values in selected feature maps to simulate long-range dependencies, effectively making CNNs behave more like ViTs and MLPs. This enhances feature diversity and improves transferability both across heterogeneous architectures and within homogeneous CNNs. Extensive evaluations on 14 state-of-the-art architectures show that FPA achieves maximum absolute gains in attack success rates of 7.68% on CNNs, 14.57% on ViTs, and 14.48% on MLPs, outperforming existing black-box attacks. Additionally, FPA is highly generalizable and can seamlessly integrate with other transfer-based attacks to further boost their performance. Our findings establish FPA as a robust, efficient, and computationally lightweight strategy for enhancing adversarial transferability across heterogeneous architectures.

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu,Changyu Chen,Wenjun Li,Penghui Qi,Tianyu Pang,Chao Du,Wee Sun Lee,Min Lin

Task: 通过分析DeepSeek-R1-Zero的核心组件（基础模型和强化学习），提出一种无偏优化方法Dr. GRPO，并设计一个简约的R1-Zero训练方案。

Motivation: 探究基础模型的预训练特性如何影响强化学习性能，并解决GRPO方法中的优化偏差问题。

Details

Method: 分析多种基础模型（如DeepSeek-V3-Base和Qwen2.5），提出Dr. GRPO优化方法，设计简约的R1-Zero训练方案。 Result: 在AIME 2024上达到43.3%的准确率，创下新纪录。 Conclusion: 基础模型的预训练特性对强化学习性能有显著影响，Dr. GRPO能有效解决优化偏差问题，简约方案表现优异。 Abstract: DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

Wan: Open and Advanced Large-Scale Video Generative Models

WanTeam,:,Ang Wang,Baole Ai,Bin Wen,Chaojie Mao,Chen-Wei Xie,Di Chen,Feiwu Yu,Haiming Zhao,Jianxiao Yang,Jianyuan Zeng,Jiayu Wang,Jingfeng Zhang,Jingren Zhou,Jinkai Wang,Jixuan Chen,Kai Zhu,Kang Zhao,Keyu Yan,Lianghua Huang,Mengyang Feng,Ningyi Zhang,Pandeng Li,Pingyu Wu,Ruihang Chu,Ruili Feng,Shiwei Zhang,Siyang Sun,Tao Fang,Tianxing Wang,Tianyi Gui,Tingyu Weng,Tong Shen,Wei Lin,Wei Wang,Wei Wang,Wenmeng Zhou,Wente Wang,Wenting Shen,Wenyuan Yu,Xianzhong Shi,Xiaoming Huang,Xin Xu,Yan Kou,Yangyu Lv,Yifei Li,Yijing Liu,Yiming Wang,Yingya Zhang,Yitong Huang,Yong Li,You Wu,Yu Liu,Yulin Pan,Yun Zheng,Yuntao Hong,Yupeng Shi,Yutong Feng,Zeyinzi Jiang,Zhen Han,Zhi-Fan Wu,Ziyu Liu

Task: 开发并开源Wan，一套全面的视频基础模型，以推动视频生成技术的发展。

Motivation: 通过创新的VAE、可扩展的预训练策略、大规模数据整理和自动化评估指标，提升视频生成模型的性能和多功能性。

Details

Method: 基于扩散变换器范式，结合新型VAE、大规模数据集训练和自动化评估，开发了1.3B和14B参数的模型。 Result: Wan在多个基准测试中显著优于现有开源和商业模型，14B模型展示了数据和模型规模的扩展规律，1.3B模型在资源效率上表现出色。 Conclusion: Wan的开源和多功能性为视频生成社区和产业提供了高质量的基础模型，推动了视频创作的创新。 Abstract: This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1.

SpikeDerain: Unveiling Clear Videos from Rainy Sequences Using Color Spike Streams

Hanwen Liang,Xian Zhong,Wenxuan Liu,Yajing Zheng,Wenxin Huang,Zhaofei Yu,Tiejun Huang

Task: 从雨天视频中恢复清晰帧，去除雨条纹。

Motivation: 传统帧式视觉传感器难以准确捕捉快速移动的雨条纹，而现有的多模态方法在硬件同步和计算冗余方面存在问题。

Details

Method: 提出一种基于颜色尖峰流的去雨网络（SpikeDerain），并结合物理可解释的雨条纹合成模型生成合成数据。 Result: 实验表明，网络在极端降雨条件下仍保持高鲁棒性。 Conclusion: 该方法在不同降雨条件和数据集上表现出高效性和鲁棒性，为视频去雨任务设定了新标准。 Abstract: Restoring clear frames from rainy videos presents a significant challenge due to the rapid motion of rain streaks. Traditional frame-based visual sensors, which capture scene content synchronously, struggle to capture the fast-moving details of rain accurately. In recent years, neuromorphic sensors have introduced a new paradigm for dynamic scene perception, offering microsecond temporal resolution and high dynamic range. However, existing multimodal methods that fuse event streams with RGB images face difficulties in handling the complex spatiotemporal interference of raindrops in real scenes, primarily due to hardware synchronization errors and computational redundancy. In this paper, we propose a Color Spike Stream Deraining Network (SpikeDerain), capable of reconstructing spike streams of dynamic scenes and accurately removing rain streaks. To address the challenges of data scarcity in real continuous rainfall scenes, we design a physically interpretable rain streak synthesis model that generates parameterized continuous rain patterns based on arbitrary background images. Experimental results demonstrate that the network, trained with this synthetic data, remains highly robust even under extreme rainfall conditions. These findings highlight the effectiveness and robustness of our method across varying rainfall levels and datasets, setting new standards for video deraining tasks. The code will be released soon.

EditCLIP: Representation Learning for Image Editing

Qian Wang,Aleksandar Cvejic,Abdelrahman Eldesokey,Peter Wonka

Task: 提出一种名为EditCLIP的新型表示学习方法，用于图像编辑任务。

Motivation: 通过联合编码输入图像及其编辑后的版本，学习统一的编辑表示，以更好地捕捉图像变换。

Details

Method: 使用EditCLIP嵌入代替文本指令，应用于基于示例的图像编辑任务，并通过测量嵌入相似性进行自动化编辑评估。 Result: 实验表明，EditCLIP在性能和效率上优于现有方法，并在自动化评估中更接近人类判断。 Conclusion: EditCLIP是一种高效、多功能的图像编辑表示学习方法，能够可靠地评估编辑质量。 Abstract: We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image editing, we replace text-based instructions in InstructPix2Pix with EditCLIP embeddings computed from a reference exemplar image pair. Experiments demonstrate that our approach outperforms state-of-the-art methods while being more efficient and versatile. For automated evaluation, EditCLIP assesses image edits by measuring the similarity between the EditCLIP embedding of a given image pair and either a textual editing instruction or the EditCLIP embedding of another reference image pair. Experiments show that EditCLIP aligns more closely with human judgments than existing CLIP-based metrics, providing a reliable measure of edit quality and structural preservation.

Recovering Dynamic 3D Sketches from Videos

Jaeah Lee,Changwoon Choi,Young Min Kim,Jaesik Park

Task: 提出一种名为Liv3Stroke的新方法，用于通过可变形3D笔触抽象化视频中物体的运动。

Motivation: 理解视频中的3D运动存在挑战，因为运动类型多样，包括刚性、可变形和关节结构。

Details

Method: 利用一组参数化3D曲线捕捉空间平滑的运动元素，通过语义特征从视频帧中提取3D点云运动指导，并将曲线变形以抽象化关键运动特征。 Result: 该方法能够直接分析视频中的3D物体运动，减少将真实世界运动转化为视频时的不确定性。 Conclusion: Liv3Stroke通过3D笔触抽象化运动，增强了对运动主要组成部分的理解，同时对环境因素具有鲁棒性。 Abstract: Understanding 3D motion from videos presents inherent challenges due to the diverse types of movement, ranging from rigid and deformable objects to articulated structures. To overcome this, we propose Liv3Stroke, a novel approach for abstracting objects in motion with deformable 3D strokes. The detailed movements of an object may be represented by unstructured motion vectors or a set of motion primitives using a pre-defined articulation from a template model. Just as a free-hand sketch can intuitively visualize scenes or intentions with a sparse set of lines, we utilize a set of parametric 3D curves to capture a set of spatially smooth motion elements for general objects with unknown structures. We first extract noisy, 3D point cloud motion guidance from video frames using semantic features, and our approach deforms a set of curves to abstract essential motion features as a set of explicit 3D representations. Such abstraction enables an understanding of prominent components of motions while maintaining robustness to environmental factors. Our approach allows direct analysis of 3D object movements from video, tackling the uncertainty that typically occurs when translating real-world motion into recorded footage. The project page is accessible via: https://jaeah.me/liv3stroke_web}

Dynamic Pyramid Network for Efficient Multimodal Large Language Model

Hao Ai,Kunyi Wang,Zezhou Wang,Hao Lu,Jin Tian,Yaxin Luo,Peng Xing,Jen-Yuan Huang,Huaxia Li,Gen luo

Task: 提出一种动态金字塔网络（DPN）以高效压缩多模态大语言模型（MLLM）中的视觉特征，同时保留语义信息。

Motivation: 现有的视觉特征压缩方法会破坏MLLM中的视觉语义，尤其是在困难样本中。

Details

Method: 采用分层结构逐步压缩视觉特征，并结合动态池化专家（DPE）动态选择最优压缩率。 Result: DPN在LLaVA上平均节省56%的FLOPs，同时性能提升0.74%。 Conclusion: DPN通过动态分层压缩和自适应计算分配，显著提升了MLLM的计算效率和性能。 Abstract: Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. Our source codes are anonymously released at https://github.com/aihao2000/DPN-LLaVA.

Progressive Focused Transformer for Single Image Super-Resolution

Wei Long,Xingyu Zhou,Leheng Zhang,Shuhang Gu

Task: 提出一种名为渐进聚焦Transformer（PFT）的方法，用于解决图像超分辨率任务中Transformer模型计算冗余的问题。

Motivation: Transformer模型在图像超分辨率任务中因计算大量无关特征的相似性而带来计算开销和性能下降。

Details

Method: 通过渐进聚焦注意力（PFA）将网络中的孤立注意力图连接起来，聚焦于最重要的特征，减少无关特征的计算。 Result: 在多个单图像超分辨率基准测试中取得了最先进的性能。 Conclusion: PFT方法通过PFA显著降低了计算成本并提升了性能，为解决Transformer模型的计算冗余问题提供了有效方案。 Abstract: Transformer-based methods have achieved remarkable results in image super-resolution tasks because they can capture non-local dependencies in low-quality input images. However, this feature-intensive modeling approach is computationally expensive because it calculates the similarities between numerous features that are irrelevant to the query features when obtaining attention weights. These unnecessary similarity calculations not only degrade the reconstruction performance but also introduce significant computational overhead. How to accurately identify the features that are important to the current query features and avoid similarity calculations between irrelevant features remains an urgent problem. To address this issue, we propose a novel and effective Progressive Focused Transformer (PFT) that links all isolated attention maps in the network through Progressive Focused Attention (PFA) to focus attention on the most important tokens. PFA not only enables the network to capture more critical similar features, but also significantly reduces the computational cost of the overall network by filtering out irrelevant features before calculating similarities. Extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance on various single image super-resolution benchmarks.

VideoGEM: Training-free Action Grounding in Videos

Felix Vogel,Walid Bousselham,Anna Kukleva,Nina Shvetsova,Hilde Kuehne

Task: 提出一种无需训练的空间动作定位方法VideoGEM，基于预训练的视觉-语言模型。

Motivation: 现有视觉-语言基础模型在零样本任务中表现优异，但难以直接用于视频中的动作和事件定位，因为动作缺乏明确的物理轮廓且通常由高级概念描述。

Details

Method: 通过自注意力机制和分层加权策略，优先处理高级语义层，并动态调整层权重以适应不同提示；同时引入提示分解技术，分别处理动作、动词和对象提示。 Result: 在CLIP、OpenCLIP和ViCLIP三种模型及四个视频定位数据集上，VideoGEM表现优于当前需要训练的先进方法。 Conclusion: VideoGEM是一种无需训练的高效方法，能够显著提升视频中动作的空间定位性能。 Abstract: Vision-language foundation models have shown impressive capabilities across various zero-shot tasks, including training-free localization and grounding, primarily focusing on localizing objects in images. However, leveraging those capabilities to localize actions and events in videos is challenging, as actions have less physical outline and are usually described by higher-level concepts. In this work, we propose VideoGEM, the first training-free spatial action grounding method based on pretrained image- and video-language backbones. Namely, we adapt the self-self attention formulation of GEM to spatial activity grounding. We observe that high-level semantic concepts, such as actions, usually emerge in the higher layers of the image- and video-language models. We, therefore, propose a layer weighting in the self-attention path to prioritize higher layers. Additionally, we introduce a dynamic weighting method to automatically tune layer weights to capture each layer`s relevance to a specific prompt. Finally, we introduce a prompt decomposition, processing action, verb, and object prompts separately, resulting in a better spatial localization of actions. We evaluate the proposed approach on three image- and video-language backbones, CLIP, OpenCLIP, and ViCLIP, and on four video grounding datasets, V-HICO, DALY, YouCook-Interactions, and GroundingYouTube, showing that the proposed training-free approach is able to outperform current trained state-of-the-art approaches for spatial video grounding.

Consistency Trajectory Matching for One-Step Generative Super-Resolution

Weiyi You,Mingyang Zhang,Leheng Zhang,Kexuan Shi,Xingyu Zhou,Shuhang Gu

Task: 提出一种无需蒸馏的策略（CTMSR），用于一步生成超分辨率（SR）图像。

Motivation: 现有基于扩散的超分辨率方法虽然性能优秀，但推理开销大；蒸馏技术虽能加速，但训练成本高且学生模型性能受限于教师模型。

Details

Method: 通过概率流常微分方程（PF-ODE）轨迹建立确定性映射，并采用一致性训练（CT）策略直接学习一步映射，同时设计分布轨迹匹配（DTM）损失以提升图像真实性。 Result: 实验表明，该方法在合成和真实数据集上性能媲美或优于现有方法，且推理延迟极低。 Conclusion: CTMSR是一种高效且性能优越的超分辨率方法，无需依赖预训练扩散模型或蒸馏技术。 Abstract: Current diffusion-based super-resolution (SR) approaches achieve commendable performance at the cost of high inference overhead. Therefore, distillation techniques are utilized to accelerate the multi-step teacher model into one-step student model. Nevertheless, these methods significantly raise training costs and constrain the performance of the student model by the teacher model. To overcome these tough challenges, we propose Consistency Trajectory Matching for Super-Resolution (CTMSR), a distillation-free strategy that is able to generate photo-realistic SR results in one step. Concretely, we first formulate a Probability Flow Ordinary Differential Equation (PF-ODE) trajectory to establish a deterministic mapping from low-resolution (LR) images with noise to high-resolution (HR) images. Then we apply the Consistency Training (CT) strategy to directly learn the mapping in one step, eliminating the necessity of pre-trained diffusion model. To further enhance the performance and better leverage the ground-truth during the training process, we aim to align the distribution of SR results more closely with that of the natural images. To this end, we propose to minimize the discrepancy between their respective PF-ODE trajectories from the LR image distribution by our meticulously designed Distribution Trajectory Matching (DTM) loss, resulting in improved realism of our recovered HR images. Comprehensive experimental results demonstrate that the proposed methods can attain comparable or even superior capabilities on both synthetic and real datasets while maintaining minimal inference latency.

SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity

Ke Ma,Jiaqi Tang,Bin Guo,Fan Dang,Sicong Liu,Zhui Zhu,Lei Wu,Cheng Fang,Ying-Cong Chen,Zhiwen Yu,Yunhao Liu

Task: 提出一种名为SURGEON的方法，以减少内存成本并保持测试时适应的准确性。

Motivation: 由于移动终端资源有限，现有的基于反向传播的测试时适应方法内存成本过高，难以有效部署。

Details

Method: 采用动态激活稀疏策略，通过层特定的动态比率修剪激活，并结合梯度重要性和层激活内存两个指标确定修剪比率。 Result: 实验表明，SURGEON在减少内存使用的同时，实现了更高的准确性，并在多个数据集、架构和任务中达到SOTA性能。 Conclusion: SURGEON是一种高效且通用的测试时适应方法，适用于资源受限的移动终端。 Abstract: Despite the growing integration of deep models into mobile terminals, the accuracy of these models declines significantly due to various deployment interferences. Test-time adaptation (TTA) has emerged to improve the performance of deep models by adapting them to unlabeled target data online. Yet, the significant memory cost, particularly in resource-constrained terminals, impedes the effective deployment of most backward-propagation-based TTA methods. To tackle memory constraints, we introduce SURGEON, a method that substantially reduces memory cost while preserving comparable accuracy improvements during fully test-time adaptation (FTTA) without relying on specific network architectures or modifications to the original training procedure. Specifically, we propose a novel dynamic activation sparsity strategy that directly prunes activations at layer-specific dynamic ratios during adaptation, allowing for flexible control of learning ability and memory cost in a data-sensitive manner. Among this, two metrics, Gradient Importance and Layer Activation Memory, are considered to determine the layer-wise pruning ratios, reflecting accuracy contribution and memory efficiency, respectively. Experimentally, our method surpasses the baselines by not only reducing memory usage but also achieving superior accuracy, delivering SOTA performance across diverse datasets, architectures, and tasks.

Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding

Joao Pereira,Vasco Lopes,David Semedo,Joao Neves

Task: 提出一种非线性时空自反射采样方法（SelfReS），以改进大型视觉语言模型（LVLMs）在长视频理解中的表现。

Motivation: 传统的线性帧采样策略无法处理视频数据中关键事件的非线性分布，导致长视频中冗余信息或关键事件遗漏的问题。

Details

Method: SelfReS利用LVLMs的稀疏注意力图定义反射令牌，动态选择关键视频片段，无需额外训练或外部模块。 Result: 实验表明，SelfReS能无缝集成到现有LVLMs中，提升长视频任务准确性，并在相同GPU内存预算下实现高达46%的推理加速。 Conclusion: SelfReS是一种高效且无需额外资源的方法，显著提升了LVLMs在长视频理解中的性能。 Abstract: Large Vision-Language Models (LVLMs) demonstrate remarkable performance in short-video tasks such as video question answering, but struggle in long-video understanding. The linear frame sampling strategy, conventionally used by LVLMs, fails to account for the non-linear distribution of key events in video data, often introducing redundant or irrelevant information in longer contexts while risking the omission of critical events in shorter ones. To address this, we propose SelfReS, a non-linear spatiotemporal self-reflective sampling method that dynamically selects key video fragments based on user prompts. Unlike prior approaches, SelfReS leverages the inherently sparse attention maps of LVLMs to define reflection tokens, enabling relevance-aware token selection without requiring additional training or external modules. Experiments demonstrate that SelfReS can be seamlessly integrated into strong base LVLMs, improving long-video task accuracy and achieving up to 46% faster inference speed within the same GPU memory budget.

Pluggable Style Representation Learning for Multi-Style Transfer

Hongda Liu,Longguang Wang,Weijun Guan,Ye Zhang,Yulan Guo

Task: 开发一种高效的风格迁移框架，通过解耦风格建模和迁移来适应多样化的图像风格。

Motivation: 现有方法因模型规模大或计算成本高而难以在资源受限的设备上部署，需要一种更高效的方法。

Details

Method: 提出风格表示学习方案和风格感知多风格迁移网络（SaMST），通过可插拔风格表示适应多样风格。 Result: 实验证明风格表示能准确提取风格信息，方法在准确性和效率上均达到最优性能。 Conclusion: 该框架在保持高效的同时，成功适应了多样化的图像风格。 Abstract: Due to the high diversity of image styles, the scalability to various styles plays a critical role in real-world applications. To accommodate a large amount of styles, previous multi-style transfer approaches rely on enlarging the model size while arbitrary-style transfer methods utilize heavy backbones. However, the additional computational cost introduced by more model parameters hinders these methods to be deployed on resource-limited devices. To address this challenge, in this paper, we develop a style transfer framework by decoupling the style modeling and transferring. Specifically, for style modeling, we propose a style representation learning scheme to encode the style information into a compact representation. Then, for style transferring, we develop a style-aware multi-style transfer network (SaMST) to adapt to diverse styles using pluggable style representations. In this way, our framework is able to accommodate diverse image styles in the learned style representations without introducing additional overhead during inference, thereby maintaining efficiency. Experiments show that our style representation can extract accurate style information. Moreover, qualitative and quantitative results demonstrate that our method achieves state-of-the-art performance in terms of both accuracy and efficiency. The codes are available in https://github.com/The-Learning-And-Vision-Atelier-LAVA/SaMST.

RSRWKV: A Linear-Complexity 2D Attention Mechanism for Efficient Remote Sensing Vision Task

Chunshan Li,Rong Wang,Xiaofei Yang,Dianhui Chu

Task: 提出一种名为RSRWKV的新模型，用于解决高分辨率遥感分析中的全局上下文建模问题。

Motivation: 现有方法（如CNN和ViT）在高分辨率遥感分析中存在局部特征提取受限或计算复杂度高的问题，RWKV模型在视觉任务中表现不佳。

Details

Method: 提出RSRWKV模型，采用2D-WKV扫描机制，结合MVC-Shift模块和ECA模块，实现线性复杂度的全局上下文建模。 Result: 在NWPU RESISC45、VHR-10.v2和GLH-Water数据集上，RSRWKV在分类、检测和分割任务中表现优于CNN和Transformer基线。 Conclusion: RSRWKV为高分辨率遥感分析提供了一种可扩展的解决方案，解决了现有方法的局限性。 Abstract: High-resolution remote sensing analysis faces challenges in global context modeling due to scene complexity and scale diversity. While CNNs excel at local feature extraction via parameter sharing, their fixed receptive fields fundamentally restrict long-range dependency modeling. Vision Transformers (ViTs) effectively capture global semantic relationships through self-attention mechanisms but suffer from quadratic computational complexity relative to image resolution, creating critical efficiency bottlenecks for high-resolution imagery. The RWKV model's linear-complexity sequence modeling achieves breakthroughs in NLP but exhibits anisotropic limitations in vision tasks due to its 1D scanning mechanism. To address these challenges, we propose RSRWKV, featuring a novel 2D-WKV scanning mechanism that bridges sequential processing and 2D spatial reasoning while maintaining linear complexity. This enables isotropic context aggregation across multiple directions. The MVC-Shift module enhances multi-scale receptive field coverage, while the ECA module strengthens cross-channel feature interaction and semantic saliency modeling. Experimental results demonstrate RSRWKV's superior performance over CNN and Transformer baselines in classification, detection, and segmentation tasks on NWPU RESISC45, VHR-10.v2, and GLH-Water datasets, offering a scalable solution for high-resolution remote sensing analysis.

ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On

Ji Woo Hong,Tri Ton,Trung X. Pham,Gwanhyeong Koo,Sunjae Yoon,Chang D. Yoo

Task: This paper introduces ITA-MDT, a framework for Image-Based Virtual Try-On (IVTON) that improves garment context handling and fine-grained details.

Motivation: To overcome limitations of previous approaches by leveraging a Masked Diffusion Transformer (MDT) for better performance and reduced computational overhead.

Details

Method: ITA-MDT uses a lightweight transformer-based denoising diffusion model with a mask latent modeling scheme, along with the Image-Timestep Adaptive Feature Aggregator (ITAFA) and Salient Region Extractor (SRE) module. Result: ITA-MDT achieves competitive results with reduced computational overhead, reaching state-of-the-art performance in several metrics. Conclusion: The proposed ITA-MDT framework effectively balances global and local garment details while optimizing computational resources, making it a strong solution for IVTON tasks. Abstract: This paper introduces ITA-MDT, the Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On (IVTON), designed to overcome the limitations of previous approaches by leveraging the Masked Diffusion Transformer (MDT) for improved handling of both global garment context and fine-grained details. The IVTON task involves seamlessly superimposing a garment from one image onto a person in another, creating a realistic depiction of the person wearing the specified garment. Unlike conventional diffusion-based virtual try-on models that depend on large pre-trained U-Net architectures, ITA-MDT leverages a lightweight, scalable transformer-based denoising diffusion model with a mask latent modeling scheme, achieving competitive results while reducing computational overhead. A key component of ITA-MDT is the Image-Timestep Adaptive Feature Aggregator (ITAFA), a dynamic feature aggregator that combines all of the features from the image encoder into a unified feature of the same size, guided by diffusion timestep and garment image complexity. This enables adaptive weighting of features, allowing the model to emphasize either global information or fine-grained details based on the requirements of the denoising stage. Additionally, the Salient Region Extractor (SRE) module is presented to identify complex region of the garment to provide high-resolution local information to the denoising model as an additional condition alongside the global information of the full garment image. This targeted conditioning strategy enhances detail preservation of fine details in highly salient garment regions, optimizing computational resources by avoiding unnecessarily processing entire garment image. Comparative evaluations confirms that ITA-MDT improves efficiency while maintaining strong performance, reaching state-of-the-art results in several metrics.

Cherry Yield Forecast: Harvest Prediction for Individual Sweet Cherry Trees

Andreas Gilson,Peter Pietrzyk,Chiara Paglia,Annika Killer,Fabian Keil,Lukas Meyer,Dominikus Kittemann,Patrick Noack,Oliver Scholz

Task: 研究甜樱桃树在果实发育周期中不同阶段的物体计数对产量预测的适用性。

Motivation: 解决水果种植领域中早期季节产量预测不可靠的问题。

Details

Method: 通过跟踪三棵甜樱桃树在2023年的生长过程，收集从休眠期到收获期的准确地面数据，并评估和可视化这些数据。 Result: 研究发现所有调查的果实状态都适用于基于线性回归的产量预测，但自动计数方法在图像数据上难以实现。 Conclusion: 手动计数可以实现准确的甜樱桃树产量预测，而基于图像数据的自动特征提取仍是一个待解决的开放问题。 Abstract: This paper is part of a publication series from the For5G project that has the goal of creating digital twins of sweet cherry trees. At the beginning a brief overview of the revious work in this project is provided. Afterwards the focus shifts to a crucial problem in the fruit farming domain: the difficulty of making reliable yield predictions early in the season. Following three Satin sweet cherry trees along the year 2023 enabled the collection of accurate ground truth data about the development of cherries from dormancy until harvest. The methodology used to collect this data is presented, along with its valuation and visualization. The predictive power of counting objects at all relevant vegetative stages of the fruit development cycle in cherry trees with regards to yield predictions is investigated. It is found that all investigated fruit states are suitable for yield predictions based on linear regression. Conceptionally, there is a trade-off between earliness and external events with the potential to invalidate the prediction. Considering this, two optimal timepoints are suggested that are opening cluster stage before the start of the flowering and the early fruit stage right after the second fruit drop. However, both timepoints are challenging to solve with automated procedures based on image data. Counting developing cherries based on images is exceptionally difficult due to the small fruit size and their tendency to be occluded by leaves. It was not possible to obtain satisfying results relying on a state-of-the-art fruit-counting method. Counting the elements within a bursting bud is also challenging, even when using high resolution cameras. It is concluded that accurate yield prediction for sweet cherry trees is possible when objects are manually counted and that automated features extraction with similar accuracy remains an open problem yet to be solved.

Evaluating Facial Expression Recognition Datasets for Deep Learning: A Benchmark Study with Novel Similarity Metrics

F. Xavier Gaya-Morey,Cristina Manresa-Yee,Célia Martinie,Jose M. Buades-Rubio

Task: 研究广泛使用的面部表情识别（FER）数据集的关键特征及其对深度学习模型训练的适用性。

Motivation: 在情感计算领域，FER对解释人类情绪至关重要，但其性能高度依赖于数据集的多样性和质量。

Details

Method: 收集并分析了24个FER数据集，包括针对特定年龄组的数据集，并通过全面的标准化流程处理，同时自动标注年龄和性别以评估人口统计特性。引入三种新指标（局部、全局和配对相似性）量化数据集难度、泛化能力和跨数据集可迁移性。 Result: 实验表明，大规模自动收集的数据集（如AffectNet、FER2013）泛化能力更强，尽管存在标注噪声和人口统计偏差；而受控数据集标注质量更高但多样性有限。 Conclusion: 研究结果为数据集选择和设计提供了实用建议，推动了更鲁棒、公平和有效的FER系统的发展。 Abstract: This study investigates the key characteristics and suitability of widely used Facial Expression Recognition (FER) datasets for training deep learning models. In the field of affective computing, FER is essential for interpreting human emotions, yet the performance of FER systems is highly contingent on the quality and diversity of the underlying datasets. To address this issue, we compiled and analyzed 24 FER datasets, including those targeting specific age groups such as children, adults, and the elderly, and processed them through a comprehensive normalization pipeline. In addition, we enriched the datasets with automatic annotations for age and gender, enabling a more nuanced evaluation of their demographic properties. To further assess dataset efficacy, we introduce three novel metricsLocal, Global, and Paired Similarity, which quantitatively measure dataset difficulty, generalization capability, and cross-dataset transferability. Benchmark experiments using state-of-the-art neural networks reveal that large-scale, automatically collected datasets (e.g., AffectNet, FER2013) tend to generalize better, despite issues with labeling noise and demographic biases, whereas controlled datasets offer higher annotation quality but limited variability. Our findings provide actionable recommendations for dataset selection and design, advancing the development of more robust, fair, and effective FER systems.

Latent Beam Diffusion Models for Decoding Image Sequences

Guilherme Fernandes,Vasco Ramos,Regev Cohen,Idan Szpektor,João Magalhães

Task: 提出一种基于束搜索策略的潜在空间探索方法，用于生成具有视觉一致性的图像序列。

Motivation: 现有方法独立生成每张图像，导致图像序列缺乏连贯性，尤其是在非线性叙事中。

Details

Method: 引入动态束搜索策略和交叉注意力机制，优化潜在表示序列生成。 Result: 人类评估表明，该方法在连贯性、视觉连续性和文本对齐方面优于基线方法。 Conclusion: 通过结合搜索优化和潜在空间细化，为结构化图像序列生成设定了新标准。 Abstract: While diffusion models excel at generating high-quality images from text prompts, they struggle with visual consistency in image sequences. Existing methods generate each image independently, leading to disjointed narratives - a challenge further exacerbated in non-linear storytelling, where scenes must connect beyond adjacent frames. We introduce a novel beam search strategy for latent space exploration, enabling conditional generation of full image sequences with beam search decoding. Unlike prior approaches that use fixed latent priors, our method dynamically searches for an optimal sequence of latent representations, ensuring coherent visual transitions. To address beam search's quadratic complexity, we integrate a cross-attention mechanism that efficiently scores search paths and enables pruning, prioritizing alignment with both textual prompts and visual context. Human evaluations confirm that our approach outperforms baseline methods, producing full sequences with superior coherence, visual continuity, and textual alignment. By bridging advances in search optimization and latent space refinement, this work sets a new standard for structured image sequence generation.

Siformer: Feature-isolated Transformer for Efficient Skeleton-based Sign Language Recognition

Muxin Pu,Mei Kuan Lim,Chun Yong Chong

Task: 提出一种改进的基于骨架的手语识别方法，以解决现有方法在真实手部姿势、缺失数据适应性和不同复杂度手语词汇处理上的不足。

Motivation: 现有基于骨架的手语识别方法存在三个主要问题：忽略真实手部姿势的重要性、假设数据完整性以及未考虑不同手语词汇的复杂度差异。

Details

Method: 提出了一种运动学手部姿势矫正方法、特征隔离机制和输入自适应推理方法，分别用于增强手部姿势的真实性、处理缺失数据以及优化计算效率和准确性。 Result: 在WLASL100和LSA64数据集上取得了新的SOTA性能，分别达到86.50%和99.84%的top-1准确率。 Conclusion: 所提方法显著提升了手语识别的性能，尤其在处理真实手部姿势和适应不同复杂度手语词汇方面表现突出。 Abstract: Sign language recognition (SLR) refers to interpreting sign language glosses from given videos automatically. This research area presents a complex challenge in computer vision because of the rapid and intricate movements inherent in sign languages, which encompass hand gestures, body postures, and even facial expressions. Recently, skeleton-based action recognition has attracted increasing attention due to its ability to handle variations in subjects and backgrounds independently. However, current skeleton-based SLR methods exhibit three limitations: 1) they often neglect the importance of realistic hand poses, where most studies train SLR models on non-realistic skeletal representations; 2) they tend to assume complete data availability in both training or inference phases, and capture intricate relationships among different body parts collectively; 3) these methods treat all sign glosses uniformly, failing to account for differences in complexity levels regarding skeletal representations. To enhance the realism of hand skeletal representations, we present a kinematic hand pose rectification method for enforcing constraints. Mitigating the impact of missing data, we propose a feature-isolated mechanism to focus on capturing local spatial-temporal context. This method captures the context concurrently and independently from individual features, thus enhancing the robustness of the SLR model. Additionally, to adapt to varying complexity levels of sign glosses, we develop an input-adaptive inference approach to optimise computational efficiency and accuracy. Experimental results demonstrate the effectiveness of our approach, as evidenced by achieving a new state-of-the-art (SOTA) performance on WLASL100 and LSA64. For WLASL100, we achieve a top-1 accuracy of 86.50\%, marking a relative improvement of 2.39% over the previous SOTA. For LSA64, we achieve a top-1 accuracy of 99.84%.

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

Yucheng Suo,Fan Ma,Linchao Zhu,Tianyi Wang,Fengyun Rao,Yi Yang

Task: 通过视觉上下文采样和评分机制解决多模态大语言模型在长视频理解中的关键帧信息遗漏问题。

Motivation: 多模态大语言模型在长视频理解中因处理帧数有限可能遗漏关键视觉信息，需改进方法以提升性能。

Details

Method: 提出分箱采样策略生成多样答案，并通过线性结合频率分数、边际置信分数和推理分数选择最终预测。 Result: 实验表明该方法在七个数据集上显著提升了三种多模态大语言模型的性能。 Conclusion: 通过视觉上下文采样和评分机制，有效提升了长视频理解的准确性和鲁棒性。 Abstract: Multi-modal Large language models (MLLMs) show remarkable ability in video understanding. Nevertheless, understanding long videos remains challenging as the models can only process a finite number of frames in a single inference, potentially omitting crucial visual information. To address the challenge, we propose generating multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction. Specifically, we devise a bin-wise sampling strategy that enables MLLMs to generate diverse answers based on various combinations of keyframes, thereby enriching the visual context. To determine the final prediction from the sampled answers, we employ a self-reward by linearly combining three scores: (1) a frequency score indicating the prevalence of each option, (2) a marginal confidence score reflecting the inter-intra sample certainty of MLLM predictions, and (3) a reasoning score for different question types, including clue-guided answering for global questions and temporal self-refocusing for local questions. The frequency score ensures robustness through majority correctness, the confidence-aligned score reflects prediction certainty, and the typed-reasoning score addresses cases with sparse key visual information using tailored strategies. Experiments show that this approach covers the correct answer for a high percentage of long video questions, on seven datasets show that our method improves the performance of three MLLMs.

Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability

Yingdong Shi,Changming Li,Yifan Wang,Yongxiang Zhao,Anqi Pang,Sibei Yang,Jingyi Yu,Kan Ren

Task: 研究扩散模型内部机制中导致偏见的决策特征，并提出一种直接操纵这些特征的方法以减少生成内容中的偏见。

Motivation: 扩散模型在生成多样化内容时表现出色，但常会延续社会偏见，可能加剧现实中的不平等和刻板印象。现有研究多关注引导生成内容，而忽略了模型内部导致偏见的机制。

Details

Method: 识别扩散模型内部的偏见特征，并通过直接操纵这些特征来精确调整生成内容中的偏见水平。 Result: 实验表明，该方法能有效管理生成内容的分布，同时保持图像质量，并揭示了控制生成细节的内在特征。 Conclusion: 通过直接操纵内部偏见特征，可以精确控制扩散模型生成内容中的偏见水平，为扩散模型的机制可解释性研究提供了新方向。 Abstract: Diffusion models have demonstrated impressive capabilities in synthesizing diverse content. However, despite their high-quality outputs, these models often perpetuate social biases, including those related to gender and race. These biases can potentially contribute to harmful real-world consequences, reinforcing stereotypes and exacerbating inequalities in various social contexts. While existing research on diffusion bias mitigation has predominantly focused on guiding content generation, it often neglects the intrinsic mechanisms within diffusion models that causally drive biased outputs. In this paper, we investigate the internal processes of diffusion models, identifying specific decision-making mechanisms, termed bias features, embedded within the model architecture. By directly manipulating these features, our method precisely isolates and adjusts the elements responsible for bias generation, permitting granular control over the bias levels in the generated content. Through experiments on both unconditional and conditional diffusion models across various social bias attributes, we demonstrate our method's efficacy in managing generation distribution while preserving image quality. We also dissect the discovered model mechanism, revealing different intrinsic features controlling fine-grained aspects of generation, boosting further research on mechanistic interpretability of diffusion models.

Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation

Qi Si,Bo Wang,Zhao Zhang

Task: 提出一种零样本扩散模型方法（pix2pix-zeroCon），用于改进文本引导的图像翻译任务。

Motivation: 现有扩散模型在文本提示的制定和参考图像内容保留方面存在不足，影响生成图像的质量和一致性。

Details

Method: 通过结合补丁对比损失和交叉注意力引导损失，自动确定文本嵌入空间的编辑方向，无需额外训练。 Result: 实验表明，该方法在图像到图像翻译任务中优于现有模型，提升了保真度和可控性。 Conclusion: pix2pix-zeroCon是一种高效且无需训练的改进方法，显著提升了文本引导图像翻译的性能。 Abstract: The diffusion model has demonstrated superior performance in synthesizing diverse and high-quality images for text-guided image translation. However, there remains room for improvement in both the formulation of text prompts and the preservation of reference image content. First, variations in target text prompts can significantly influence the quality of the generated images, and it is often challenging for users to craft an optimal prompt that fully captures the content of the input image. Second, while existing models can introduce desired modifications to specific regions of the reference image, they frequently induce unintended alterations in areas that should remain unchanged. To address these challenges, we propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss. Specifically, we automatically determine the editing direction in the text embedding space based on the reference image and target prompts. Furthermore, to ensure precise content and structural preservation in the edited image, we introduce cross-attention guiding loss and patch-wise contrastive loss between the generated and original image embeddings within a pre-trained diffusion model. Notably, our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model. Extensive experiments demonstrate that our method surpasses existing models in image-to-image translation, achieving enhanced fidelity and controllability.

VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

Jiale Cheng,Ruiliang Lyu,Xiaotao Gu,Xiao Liu,Jiazheng Xu,Yida Lu,Jiayan Teng,Zhuoyi Yang,Yuxiao Dong,Jie Tang,Hongning Wang,Minlie Huang

Task: 优化文本到视频生成中的提示（prompt）以提高视频质量和安全性。

Motivation: 现实用户输入通常简洁、模糊或结构不佳，与训练数据中的详细描述存在差距，导致生成视频质量不佳或存在安全风险。

Details

Method: 提出VPO框架，基于无害性、准确性和有用性三原则，采用两阶段优化方法：构建并优化监督微调（SFT）数据集，并结合文本和视频级反馈进行偏好学习。 Result: VPO显著提升了生成视频的安全性、对齐性和质量，并在不同视频生成模型中表现出强泛化能力。 Conclusion: VPO是一种有效的提示优化框架，可提升视频生成模型的性能，且能与RLHF方法结合使用。 Abstract: Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at https://github.com/thu-coai/VPO.

Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models

Fanhu Zeng,Zhen Cheng,Fei Zhu,Xu-Yao Zhang

Task: 利用视觉语言模型（VLM）构建一个高效且通用的误分类检测框架。

Motivation: 现代神经网络在误分类时往往表现出过度自信，需要置信度估计来检测错误，但现有方法需从头训练且缺乏高效的大规模误分类检测方法。

Details

Method: 提出FSMisD框架，通过少量样本提示学习避免从头训练，结合自适应伪样本生成和负损失函数减少过度自信问题。 Result: 实验验证了方法的有效性、高效性和泛化能力，在多种数据集上表现显著且一致。 Conclusion: FSMisD框架为大规模和动态变化数据集提供了一种高效且通用的误分类检测解决方案。 Abstract: Reliable prediction by classifiers is crucial for their deployment in high security and dynamically changing situations. However, modern neural networks often exhibit overconfidence for misclassified predictions, highlighting the need for confidence estimation to detect errors. Despite the achievements obtained by existing methods on small-scale datasets, they all require training from scratch and there are no efficient and effective misclassification detection (MisD) methods, hindering practical application towards large-scale and ever-changing datasets. In this paper, we pave the way to exploit vision language model (VLM) leveraging text information to establish an efficient and general-purpose misclassification detection framework. By harnessing the power of VLM, we construct FSMisD, a Few-Shot prompt learning framework for MisD to refrain from training from scratch and therefore improve tuning efficiency. To enhance misclassification detection ability, we use adaptive pseudo sample generation and a novel negative loss to mitigate the issue of overconfidence by pushing category prompts away from pseudo features. We conduct comprehensive experiments with prompt learning methods and validate the generalization ability across various datasets with domain shift. Significant and consistent improvement demonstrates the effectiveness, efficiency and generalizability of our approach.

MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning

Yiwei Ma,Guohai Xu,Xiaoshuai Sun,Jiayi Ji,Jie Lou,Debing Zhang,Rongrong Ji

Task: 开发一种自动化方法（MLLM-Selector）来选择高质量的视觉指令调优数据。

Motivation: 当前缺乏对高质量视觉指令调优数据属性的理解以及自动化选择框架。

Details

Method: 通过随机采样子集微调预训练模型生成种子模型，计算数据池中样本的必要性分数，并结合多样性进行数据选择。 Result: MLLM-Selector在相同实验条件下，使用少于1%的数据在某些基准上超过LLaVA-1.5，使用少于50%的数据在所有基准上表现更优。 Conclusion: 结合必要性和多样性的数据选择方法（MLLM-Selector）能显著提升视觉指令调优的效果。 Abstract: Visual instruction tuning (VIT) has emerged as a crucial technique for enabling multi-modal large language models (MLLMs) to follow user instructions adeptly. Yet, a significant gap persists in understanding the attributes of high-quality instruction tuning data and frameworks for its automated selection. To address this, we introduce MLLM-Selector, an automated approach that identifies valuable data for VIT by weighing necessity and diversity. Our process starts by randomly sampling a subset from the VIT data pool to fine-tune a pretrained model, thus creating a seed model with an initial ability to follow instructions. Then, leveraging the seed model, we calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance. Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector, our methodology that fuses necessity scoring with strategic sampling for superior data refinement. Empirical results indicate that within identical experimental conditions, MLLM-Selector surpasses LLaVA-1.5 in some benchmarks with less than 1% of the data and consistently exceeds performance across all validated benchmarks when using less than 50%.

Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering

Zehui Liao,Shishuai Hu,Ke Zou,Huazhu Fu,Liangli Zhen,Yong Xia

Task: 提出一种名为VASE的方法，用于改进医学视觉问答（VQA）中多模态大语言模型（MLLMs）的幻觉检测。

Motivation: 医学MLLMs在临床决策中容易产生幻觉（与输入图像矛盾的错误回答），现有方法（如语义熵）在适应医学MLLMs时面临视觉扰动强度选择的困境。

Details

Method: 提出VASE方法，结合弱图像变换和视觉输入的影响放大，通过对比语义预测分布来估计熵。 Result: 在两种医学开放式VQA数据集上的实验表明，VASE在幻觉检测上优于现有方法。 Conclusion: VASE通过平衡视觉扰动强度和临床有效性，显著提升了医学MLLMs的幻觉检测能力。 Abstract: Multimodal large language models (MLLMs) have demonstrated significant potential in medical Visual Question Answering (VQA). Yet, they remain prone to hallucinations-incorrect responses that contradict input images, posing substantial risks in clinical decision-making. Detecting these hallucinations is essential for establishing trust in MLLMs among clinicians and patients, thereby enabling their real-world adoption. Current hallucination detection methods, especially semantic entropy (SE), have demonstrated promising hallucination detection capacity for LLMs. However, adapting SE to medical MLLMs by incorporating visual perturbations presents a dilemma. Weak perturbations preserve image content and ensure clinical validity, but may be overlooked by medical MLLMs, which tend to over rely on language priors. In contrast, strong perturbations can distort essential diagnostic features, compromising clinical interpretation. To address this issue, we propose Vision Amplified Semantic Entropy (VASE), which incorporates weak image transformations and amplifies the impact of visual input, to improve hallucination detection in medical VQA. We first estimate the semantic predictive distribution under weak visual transformations to preserve clinical validity, and then amplify visual influence by contrasting this distribution with that derived from a distorted image. The entropy of the resulting distribution is estimated as VASE. Experiments on two medical open-ended VQA datasets demonstrate that VASE consistently outperforms existing hallucination detection methods.

Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications

Mahya Nikouei,Bita Baroutian,Shahabedin Nabavi,Fateme Taraghi,Atefe Aghaei,Ayoob Sajedi,Mohsen Ebrahimi Moghaddam

Task: 综述2024-2025年Q1期刊中关于小目标检测（SOD）的最新进展。

Motivation: 小目标检测在计算机视觉中至关重要，但由于空间和上下文信息有限、低分辨率、遮挡等问题，检测难度大。

Details

Method: 分析了多尺度特征提取、超分辨率技术、注意力机制、Transformer架构等深度学习方法，以及数据增强、合成数据生成和迁移学习等技术。 Result: 总结了SOD的挑战、先进技术、数据集、评估指标和实际应用，并提出了轻量级网络、知识蒸馏等新兴趋势。 Conclusion: 指出了开放研究挑战和未来方向，如鲁棒的域适应技术、更好的特征融合策略和实时性能优化。 Abstract: Small object detection (SOD) is a critical yet challenging task in computer vision, with applications like spanning surveillance, autonomous systems, medical imaging, and remote sensing. Unlike larger objects, small objects contain limited spatial and contextual information, making accurate detection difficult. Challenges such as low resolution, occlusion, background interference, and class imbalance further complicate the problem. This survey provides a comprehensive review of recent advancements in SOD using deep learning, focusing on articles published in Q1 journals during 2024-2025. We analyzed challenges, state-of-the-art techniques, datasets, evaluation metrics, and real-world applications. Recent advancements in deep learning have introduced innovative solutions, including multi-scale feature extraction, Super-Resolution (SR) techniques, attention mechanisms, and transformer-based architectures. Additionally, improvements in data augmentation, synthetic data generation, and transfer learning have addressed data scarcity and domain adaptation issues. Furthermore, emerging trends such as lightweight neural networks, knowledge distillation (KD), and self-supervised learning offer promising directions for improving detection efficiency, particularly in resource-constrained environments like Unmanned Aerial Vehicles (UAV)-based surveillance and edge computing. We also review widely used datasets, along with standard evaluation metrics such as mean Average Precision (mAP) and size-specific AP scores. The survey highlights real-world applications, including traffic monitoring, maritime surveillance, industrial defect detection, and precision agriculture. Finally, we discuss open research challenges and future directions, emphasizing the need for robust domain adaptation techniques, better feature fusion strategies, and real-time performance optimization.

MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

Jinnan Chen,Lingting Zhu,Zeyu Hu,Shengju Qian,Yugang Chen,Xin Wang,Gim Hee Lee

Task: 解决自回归变换器在3D生成中的三个关键挑战：无序3D数据、压缩损失和高效缩放策略。

Motivation: 自回归变换器在语言和视觉生成中表现出色，但在3D生成中面临无序数据、压缩损失和缩放效率问题。

Details

Method: 提出MAR-3D，结合金字塔变分自编码器和级联掩码自回归变换器，支持渐进式潜在空间上采样。 Result: MAR-3D在性能和泛化能力上优于现有方法，并展现出更强的缩放能力。 Conclusion: MAR-3D有效解决了3D生成中的关键挑战，为未来研究提供了新方向。 Abstract: Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell,Anthony Hu,Lorenzo Bertoni,George Fedoseev,Jamie Shotton,Elahe Arani,Gianluca Corrado

Task: 开发一个统一的生成式世界模型GAIA-2，用于模拟自动驾驶环境中的多智能体交互、细粒度控制和多摄像头一致性。

Motivation: 当前生成模型在满足自动驾驶领域特定需求（如多智能体交互、细粒度控制等）方面存在不足。

Details

Method: 提出GAIA-2，一种基于潜在扩散的世界模型，支持通过结构化输入（如车辆动态、环境因素等）生成可控的高分辨率多摄像头视频。 Result: GAIA-2能够生成地理多样化的驾驶环境（如英国、美国、德国）中时空一致的多摄像头视频，并支持灵活的场景合成。 Conclusion: GAIA-2为自动驾驶系统的开发提供了一种可扩展的生成式世界模型工具，能够模拟常见和罕见驾驶场景。 Abstract: Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at https://wayve.ai/thinking/gaia-2.

Ziying Zhang,Xiang Gao,Zhixin Wang,Qiang hu,Xiaoyun Zhang

Task: 提出一种基于截断扩散模型的高效盲人脸恢复方法（TD-BFR），以解决现有扩散模型在训练和推理速度慢以及恢复细节不足的问题。

Motivation: 现有扩散模型在盲人脸恢复中存在速度慢和细节恢复不足的问题，需要一种更高效且能恢复精细细节的方法。

Details

Method: 采用三阶段渐进式恢复策略，结合截断采样方法和自适应退化去除模块，并利用预训练扩散模型的先验知识。 Result: TD-BFR在保持高质量恢复的同时，平均比现有扩散方法快4.75倍。 Conclusion: TD-BFR是一种高效且高质量的盲人脸恢复方法，显著提升了速度和细节恢复能力。 Abstract: Diffusion-based methodologies have shown significant potential in blind face restoration (BFR), leveraging their robust generative capabilities. However, they are often criticized for two significant problems: 1) slow training and inference speed, and 2) inadequate recovery of fine-grained facial details. To address these problems, we propose a novel Truncated Diffusion model for efficient Blind Face Restoration (TD-BFR), a three-stage paradigm tailored for the progressive resolution of degraded images. Specifically, TD-BFR utilizes an innovative truncated sampling method, starting from low-quality (LQ) images at low resolution to enhance sampling speed, and then introduces an adaptive degradation removal module to handle unknown degradations and connect the generation processes across different resolutions. Additionally, we further adapt the priors of pre-trained diffusion models to recover rich facial details. Our method efficiently restores high-quality images in a coarse-to-fine manner and experimental results demonstrate that TD-BFR is, on average, \textbf{4.75$\times$} faster than current state-of-the-art diffusion-based BFR methods while maintaining competitive quality.

Beyond Intermediate States: Explaining Visual Redundancy through Language

Dingchen Yang,Bowen Cao,Anran Zhang,Weibo Gu,Winston Hu,Guang Chen

Task: 提出一种更可靠的方法来识别和修剪多模态大语言模型（MLLMs）中的冗余视觉标记。

Motivation: 现有的视觉标记修剪方法无法精确捕捉视觉标记对MLLMs视觉理解的影响，导致冗余定义不准确。

Details

Method: 通过操纵视觉输入并从标记中心和上下文中心两个角度分析文本输出的变化，结合ViT-[cls]关联和文本-图像注意力分数，提出一种上下文无关的条件来识别冗余原型。 Result: 实验表明，该方法在单图像、多图像和视频理解任务中有效，修剪80%至90%的视觉标记后仍能保持90%至110%的性能。 Conclusion: 该方法能够高效识别并修剪冗余视觉标记，显著降低计算负担，同时保持模型性能。 Abstract: Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token pruning methods based on MLLMs' intermediate states (e.g., attention scores). However, they have limitations in precisely defining visual redundancy due to their inability to capture the influence of visual tokens on MLLMs' visual understanding (i.e., the predicted probabilities for textual token candidates). To address this issue, we manipulate the visual input and investigate variations in the textual output from both token-centric and context-centric perspectives, achieving intuitive and comprehensive analysis. Experimental results reveal that visual tokens with low ViT-[cls] association and low text-to-image attention scores can contain recognizable information and significantly contribute to images' overall information. To develop a more reliable method for identifying and pruning redundant visual tokens, we integrate these two perspectives and introduce a context-independent condition to identify redundant prototypes from training images, which probes the redundancy of each visual token during inference. Extensive experiments on single-image, multi-image and video comprehension tasks demonstrate the effectiveness of our method, notably achieving 90% to 110% of the performance while pruning 80% to 90% of visual tokens.

TerraTorch: The Geospatial Foundation Models Toolkit

Carlos Gomes,Benedikt Blumenstiel,Joao Lucas de Sousa Almeida,Pedro Henrique de Oliveira,Paolo Fraccaro,Francesc Marti Escofet,Daniela Szwarcman,Naomi Simumba,Romeo Kienzler,Bianca Zadrozny

Task: 开发一个基于PyTorch Lightning的地理空间基础模型微调和基准测试工具包。

Motivation: 为卫星、天气和气候数据的研究人员和实践者提供一个无需编码即可微调模型的工具，降低模型开发和基准测试的门槛。

Details

Method: 集成领域特定的数据模块、预定义任务和模块化模型工厂，支持自动超参数优化扩展Iterate。 Result: TerraTorch简化了地理空间基础模型的微调和基准测试流程，提高了效率和可重复性。 Conclusion: TerraTorch是一个开源工具，通过整合最佳实践和自动化功能，显著降低了地理空间模型开发的复杂性和时间成本。 Abstract: TerraTorch is a fine-tuning and benchmarking toolkit for Geospatial Foundation Models built on PyTorch Lightning and tailored for satellite, weather, and climate data. It integrates domain-specific data modules, pre-defined tasks, and a modular model factory that pairs any backbone with diverse decoder heads. These components allow researchers and practitioners to fine-tune supported models in a no-code fashion by simply editing a training configuration. By consolidating best practices for model development and incorporating the automated hyperparameter optimization extension Iterate, TerraTorch reduces the expertise and time required to fine-tune or benchmark models on new Earth Observation use cases. Furthermore, TerraTorch directly integrates with GEO-Bench, allowing for systematic and reproducible benchmarking of Geospatial Foundation Models. TerraTorch is open sourced under Apache 2.0, available at https://github.com/IBM/terratorch, and can be installed via pip install terratorch.

IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting

Hao Fu,Hanbin Zhao,Jiahua Dong,Chao Zhang,Hui Qian

Task: 优化多领域类增量学习（MCIL）中的提示设计，以解决预训练视觉语言模型（PT-VLMs）的前向和后向遗忘问题。

Motivation: 现有的方法仅考虑参数高效微调（PEFT）策略选择的影响，而忽略了参数设置（如提示）对任务适应的影响。

Details

Method: 提出实例感知提示（IAP）框架，包括实例感知门控提示（IA-GP）模块和实例感知类分布驱动提示（IA-CDDP）模块。 Result: 在11个数据集上使用三种性能指标验证了方法的有效性。 Conclusion: IAP框架通过动态分配提示和优化任务标签相关置信度，显著提升了新任务适应能力并减轻了遗忘问题。 Abstract: Recent pre-trained vision-language models (PT-VLMs) often face a Multi-Domain Class-Incremental Learning (MCIL) scenario in practice, where several classes and domains of multi-modal tasks are incrementally arrived. Without access to previously learned tasks and unseen tasks, memory-constrained MCIL suffers from forward and backward forgetting. To alleviate the above challenges, parameter-efficient fine-tuning techniques (PEFT), such as prompt tuning, are employed to adapt the PT-VLM to the diverse incrementally learned tasks. To achieve effective new task adaptation, existing methods only consider the effect of PEFT strategy selection, but neglect the influence of PEFT parameter setting (e.g., prompting). In this paper, we tackle the challenge of optimizing prompt designs for diverse tasks in MCIL and propose an Instance-Aware Prompting (IAP) framework. Specifically, our Instance-Aware Gated Prompting (IA-GP) module enhances adaptation to new tasks while mitigating forgetting by dynamically assigning prompts across transformer layers at the instance level. Our Instance-Aware Class-Distribution-Driven Prompting (IA-CDDP) improves the task adaptation process by determining an accurate task-label-related confidence score for each instance. Experimental evaluations across 11 datasets, using three performance metrics, demonstrate the effectiveness of our proposed method. Code can be found at https://github.com/FerdinandZJU/IAP.

Jiepeng Wang,Zhaoqing Wang,Hao Pan,Yuan Liu,Dongdong Yu,Changhu Wang,Wenping Wang

Task: 提出一个统一的扩散框架MMGen，用于多模态生成和理解任务。

Motivation: 实现无缝且可控的图像扩散及其他跨模态任务，提升多模态任务的效率和性能。

Details

Method: 开发了一种新型扩散变换器，支持多模态输出，并采用模态解耦策略统一多种任务。 Result: MMGen在多种任务和条件下表现出高效性和优越性。 Conclusion: MMGen在需要同时生成和理解的应用中具有巨大潜力。 Abstract: A unified diffusion framework for multi-modal generation and understanding has the transformative potential to achieve seamless and controllable image diffusion and other cross-modal tasks. In this paper, we introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model. This includes: (1) multi-modal category-conditioned generation, where multi-modal outputs are generated simultaneously through a single inference process, given category information; (2) multi-modal visual understanding, which accurately predicts depth, surface normals, and segmentation maps from RGB images; and (3) multi-modal conditioned generation, which produces corresponding RGB images based on specific modality conditions and other aligned modalities. Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy to unify various tasks. Extensive experiments and applications demonstrate the effectiveness and superiority of MMGen across diverse tasks and conditions, highlighting its potential for applications that require simultaneous generation and understanding.

Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly Classification

Theo Di Piazza,Carole Lazarus,Olivier Nempont,Loic Boussel

Task: 提出一种名为CT-Scroll的全局-局部注意力模型，用于模拟放射科医生在分析3D CT扫描时的滚动行为。

Motivation: 由于CT扫描检查数量的快速增加，需要自动化工具辅助放射科医生处理日益增长的工作量，而现有深度学习方法在捕捉长距离依赖和模拟放射科医生导航行为方面存在不足。

Details

Method: 设计了一种全局-局部注意力模型（CT-Scroll），专门模拟放射科医生在分析3D CT扫描时的滚动行为。 Result: 在两个公共数据集上进行了评估，通过综合实验和消融研究证明了模型的有效性及其各组件的贡献。 Conclusion: CT-Scroll模型能够有效模拟放射科医生的导航行为，为3D CT扫描的多标签分类任务提供了一种新方法。 Abstract: The rapid increase in the number of Computed Tomography (CT) scan examinations has created an urgent need for automated tools, such as organ segmentation, anomaly classification, and report generation, to assist radiologists with their growing workload. Multi-label classification of Three-Dimensional (3D) CT scans is a challenging task due to the volumetric nature of the data and the variety of anomalies to be detected. Existing deep learning methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies effectively, while Vision Transformers require extensive pre-training, posing challenges for practical use. Additionally, these existing methods do not explicitly model the radiologist's navigational behavior while scrolling through CT scan slices, which requires both global context understanding and local detail awareness. In this study, we present CT-Scroll, a novel global-local attention model specifically designed to emulate the scrolling behavior of radiologists during the analysis of 3D CT scans. Our approach is evaluated on two public datasets, demonstrating its efficacy through comprehensive experiments and an ablation study that highlights the contribution of each model component.

AccidentSim: Generating Physically Realistic Vehicle Collision Videos from Real-World Accident Reports

Xiangwen Zhang,Qian Zhang,Longfei Han,Qiang Qu,Xiaoming Chen

Task: 提出AccidentSim框架，用于生成物理真实的车辆碰撞视频。

Motivation: 由于真实车辆事故视频稀少且复杂，现有方法难以生成物理真实的碰撞轨迹。

Details

Method: 结合物理模拟器和事故报告信息生成碰撞轨迹数据集，利用语言模型预测轨迹，并通过NeRF渲染背景。 Result: 生成的视频在视觉和物理真实性上表现优异。 Conclusion: AccidentSim能够有效生成物理真实的车辆碰撞视频，为自动驾驶研究提供支持。 Abstract: Collecting real-world vehicle accident videos for autonomous driving research is challenging due to their rarity and complexity. While existing driving video generation methods may produce visually realistic videos, they often fail to deliver physically realistic simulations because they lack the capability to generate accurate post-collision trajectories. In this paper, we introduce AccidentSim, a novel framework that generates physically realistic vehicle collision videos by extracting and utilizing the physical clues and contextual information available in real-world vehicle accident reports. Specifically, AccidentSim leverages a reliable physical simulator to replicate post-collision vehicle trajectories from the physical and contextual information in the accident reports and to build a vehicle collision trajectory dataset. This dataset is then used to fine-tune a language model, enabling it to respond to user prompts and predict physically consistent post-collision trajectories across various driving scenarios based on user descriptions. Finally, we employ Neural Radiance Fields (NeRF) to render high-quality backgrounds, merging them with the foreground vehicles that exhibit physically realistic trajectories to generate vehicle collision videos. Experimental results demonstrate that the videos produced by AccidentSim excel in both visual and physical authenticity.

AutoRad-Lung: A Radiomic-Guided Prompting Autoregressive Vision-Language Model for Lung Nodule Malignancy Prediction

Sadaf Khademi,Mehran Shabanpour,Reza Taleei,Anastasia Oikonomou,Arash Mohammadi

Task: 提出AutoRad-Lung模型，通过结合自回归预训练的视觉语言模型和基于放射组学的提示，改进肺癌诊断。

Motivation: 解决现有CLIP-Lung模型在肺癌诊断中的局限性，包括依赖主观标注、文本信息仅在训练中使用以及忽视先验知识的卷积视觉编码器。

Details

Method: 结合自回归预训练的视觉语言模型（AIMv2）和基于放射组学的动态提示生成（条件上下文优化）。 Result: AutoRad-Lung在捕捉像素级差异和跨模态对齐方面优于基于CLIP的模型。 Conclusion: AutoRad-Lung通过动态提示和自回归预训练，显著提升了肺癌诊断的准确性和适用性。 Abstract: Lung cancer remains one of the leading causes of cancer-related mortality worldwide. A crucial challenge for early diagnosis is differentiating uncertain cases with similar visual characteristics and closely annotation scores. In clinical practice, radiologists rely on quantitative, hand-crafted Radiomic features extracted from Computed Tomography (CT) images, while recent research has primarily focused on deep learning solutions. More recently, Vision-Language Models (VLMs), particularly Contrastive Language-Image Pre-Training (CLIP)-based models, have gained attention for their ability to integrate textual knowledge into lung cancer diagnosis. While CLIP-Lung models have shown promising results, we identified the following potential limitations: (a) dependence on radiologists' annotated attributes, which are inherently subjective and error-prone, (b) use of textual information only during training, limiting direct applicability at inference, and (c) Convolutional-based vision encoder with randomly initialized weights, which disregards prior knowledge. To address these limitations, we introduce AutoRad-Lung, which couples an autoregressively pre-trained VLM, with prompts generated from hand-crafted Radiomics. AutoRad-Lung uses the vision encoder of the Large-Scale Autoregressive Image Model (AIMv2), pre-trained using a multi-modal autoregressive objective. Given that lung tumors are typically small, irregularly shaped, and visually similar to healthy tissue, AutoRad-Lung offers significant advantages over its CLIP-based counterparts by capturing pixel-level differences. Additionally, we introduce conditional context optimization, which dynamically generates context-specific prompts based on input Radiomics, improving cross-modal alignment.

ARMO: Autoregressive Rigging for Multi-Category Objects

Mingze Sun,Shiwei Mao,Keyi Chen,Yurun Chen,Shunlin Lu,Jingbo Wang,Junting Dong,Ruqi Huang

Task: 提出一种基于大规模数据集的动态3D模型骨骼绑定方法。

Motivation: 现有方法主要关注静态3D模型生成，忽略了动态形状（如人形、动物和昆虫）的需求。

Details

Method: 提出ARMO框架，利用自回归模型预测关节位置和连接关系，并结合网格条件化的潜在扩散模型生成骨骼。 Result: 在OmniRig数据集上实现了骨骼预测的最先进性能，并展示了跨类别泛化能力。 Conclusion: ARMO框架解决了传统回归方法的局限性，为动态3D模型骨骼绑定提供了新思路。 Abstract: Recent advancements in large-scale generative models have significantly improved the quality and diversity of 3D shape generation. However, most existing methods focus primarily on generating static 3D models, overlooking the potentially dynamic nature of certain shapes, such as humanoids, animals, and insects. To address this gap, we focus on rigging, a fundamental task in animation that establishes skeletal structures and skinning for 3D models. In this paper, we introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information. Unlike traditional benchmarks that rely on predefined standard poses (e.g., A-pose, T-pose), our dataset embraces diverse shape categories, styles, and poses. Leveraging this rich dataset, we propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner. By treating the skeletal structure as a complete graph and discretizing it into tokens, we encode the joints using an auto-encoder to obtain a latent embedding and an autoregressive model to predict the tokens. A mesh-conditioned latent diffusion model is used to predict the latent embedding for conditional skeleton generation. Our method addresses the limitations of regression-based approaches, which often suffer from error accumulation and suboptimal connectivity estimation. Through extensive experiments on the OmniRig dataset, our approach achieves state-of-the-art performance in skeleton prediction, demonstrating improved generalization across diverse object categories. The code and dataset will be made public for academic use upon acceptance.

BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation

Yuyang Peng,Shishi Xiao,Keming Wu,Qisheng Liao,Bohan Chen,Kevin Lin,Danqing Huang,Ji Li,Yuhui Yuan

Task: 生成基于用户提供的文章级描述性提示和超密集布局的高质量商业内容（如信息图和幻灯片）。

Motivation: 解决文章级视觉文本渲染中上下文长度显著增加和高质量商业内容数据稀缺的挑战。

Details

Method: 构建高质量商业内容数据集Infographics-650K，并提出布局引导的交叉注意力方案。 Result: 在BizEval提示集上展示了优于Flux和SD3等现有系统的性能。 Conclusion: Infographics-650K和BizEval有望推动商业内容生成领域的进一步发展。 Abstract: Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made significant progress in sentence-level visual text rendering. In this paper, we focus on the more challenging scenarios of article-level visual text rendering and address a novel task of generating high-quality business content, including infographics and slides, based on user provided article-level descriptive prompts and ultra-dense layouts. The fundamental challenges are twofold: significantly longer context lengths and the scarcity of high-quality business content data. In contrast to most previous works that focus on a limited number of sub-regions and sentence-level prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in business content is far more challenging. We make two key technical contributions: (i) the construction of scalable, high-quality business content dataset, i.e., Infographics-650K, equipped with ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions flexibly during inference using a layout conditional CFG. We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3 on our BizEval prompt set. Additionally, we conduct thorough ablation experiments to verify the effectiveness of each component. We hope our constructed Infographics-650K and BizEval can encourage the broader community to advance the progress of business content generation.

Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy

Yinan Sun,Xiongkuo Min,Zicheng Zhang,Yixuan Gao,Yuqin Cao,Guangtao Zhai

Task: 研究低层次视觉感知与理解（HLPU）中的幻觉问题，并提出解决方案以提高模型的自我意识和可靠性。

Motivation: 多模态大语言模型在视觉感知与理解方面取得了显著进展，但存在幻觉问题，尤其在低层次视觉任务中缺乏研究。

Details

Method: 引入HLPU指令数据库，提出SAFEQA模型和ESA-PO框架，结合图像特征、显著区域特征和质量特征，增强模型的自我意识。 Result: 实验表明，所提方法显著提升了模型的自我意识，减少了幻觉，并在多项评估指标上优于闭源模型。 Conclusion: 通过增强自我意识，可以有效减少低层次视觉任务中的幻觉问题，提升模型的准确性和可靠性。 Abstract: The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics.

Vision as LoRA

Han Wang,Yongjie Ye,Bingru Li,Yuxiang Nie,Jinghui Lu,Jingqun Tang,Yanjie Wang,Can Huang

Task: 将大型语言模型（LLM）转化为多模态语言模型（MLLM）的新范式Vision as LoRA（VoRA）。

Motivation: 现有MLLM架构依赖外部视觉模块进行视觉编码，VoRA通过将视觉特定的LoRA层直接集成到LLM中，内部化视觉能力，减少结构复杂性和计算开销。

Details

Method: 集成视觉特定的LoRA层，采用块级蒸馏方法从预训练的ViT中转移视觉先验知识，并应用双向注意力掩码以更好地捕捉图像上下文信息。 Result: VoRA在额外预训练数据下表现与传统基于编码的MLLM相当。 Conclusion: VoRA通过内部化视觉能力和高效训练方法，提供了一种简化且高效的MLLM构建方案。 Abstract: We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.

GLRD: Global-Local Collaborative Reason and Debate with PSL for 3D Open-Vocabulary Detection

Xingyu Peng,Si Liu,Chen Gao,Yan Bai,Beipeng Mu,Xiaofei Wang,Huaxia Xia

Task: LiDAR-based 3D Open-Vocabulary Detection (3D OVD)任务，旨在从点云中检测新对象而无需现成的训练标签。

Motivation: 现有方法主要关注对象级表示学习，忽略了场景级信息，导致难以区分相似类别的对象。

Details

Method: 提出Global-Local Collaborative Reason and Debate with PSL (GLRD)框架，结合对象级和场景级信息，利用LLM进行常识推理，并通过概率软逻辑求解器(OV-PSL)和辩论机制优化检测结果。此外，设计了静态平衡方案(SBC)、动态平衡方案(DBC)、反射伪标签生成(RPLG)和背景感知对象定位(BAOL)以解决类别分布不均和数据噪声问题。 Result: 在ScanNet和SUN RGB-D数据集上，GLRD在部分开放词汇设置下的平均精度分别提升+2.82%和+3.72%，在完全开放词汇设置下分别提升+4.03%和+14.11%。 Conclusion: GLRD框架通过结合全局和局部信息，显著提升了3D OVD任务的性能。 Abstract: The task of LiDAR-based 3D Open-Vocabulary Detection (3D OVD) requires the detector to learn to detect novel objects from point clouds without off-the-shelf training labels. Previous methods focus on the learning of object-level representations and ignore the scene-level information, thus it is hard to distinguish objects with similar classes. In this work, we propose a Global-Local Collaborative Reason and Debate with PSL (GLRD) framework for the 3D OVD task, considering both local object-level information and global scene-level information. Specifically, LLM is utilized to perform common sense reasoning based on object-level and scene-level information, where the detection result is refined accordingly. To further boost the LLM's ability of precise decisions, we also design a probabilistic soft logic solver (OV-PSL) to search for the optimal solution, and a debate scheme to confirm the class of confusable objects. In addition, to alleviate the uneven distribution of classes, a static balance scheme (SBC) and a dynamic balance scheme (DBC) are designed. In addition, to reduce the influence of noise in data and training, we further propose Reflected Pseudo Labels Generation (RPLG) and Background-Aware Object Localization (BAOL). Extensive experiments conducted on ScanNet and SUN RGB-D demonstrate the superiority of GLRD, where absolute improvements in mean average precision are $+2.82\%$ on SUN RGB-D and $+3.72\%$ on ScanNet in the partial open-vocabulary setting. In the full open-vocabulary setting, the absolute improvements in mean average precision are $+4.03\%$ on ScanNet and $+14.11\%$ on SUN RGB-D.

Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound

Yuhao Huang,Ao Chang,Haoran Dou,Xing Tao,Xinrui Zhou,Yan Cao,Ruobing Huang,Alejandro F Frangi,Lingyun Bao,Xin Yang,Dong Ni

Task: 开发一种基于多智能体强化学习的弱监督分割框架（Flip Learning），用于2D/3D乳腺超声图像中结节的精确分割。

Motivation: 自动化结节分割系统可减少人工标注的复杂性，提高临床分析的效率，但现有弱监督分割方法依赖不准确的激活图或低效的伪掩模生成算法，难以实现精确分割。

Details

Method: 提出Flip Learning框架，利用多智能体通过擦除目标区域实现分类标签翻转，生成分割掩模；采用超像素/超体素编码环境，设计三种奖励机制（分类得分奖励和两种强度分布奖励），并结合渐进式课程学习策略。 Result: 在大规模内部BUS和ABUS数据集上验证，Flip Learning优于现有弱监督分割方法和基础模型，性能接近全监督学习算法。 Conclusion: Flip Learning通过创新的多智能体强化学习框架和奖励设计，显著提升了弱监督分割的精度，为临床诊断提供了高效工具。 Abstract: Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.

MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion

Saron Samuel,Dan DeGenaro,Jimena Guallar-Blasco,Kate Sanders,Oluwaseun Eisape,Arun Reddy,Alexander Martin,Andrew Yates,Eugene Yang,Cameron Carpenter,David Etter,Efsun Kayi,Matthew Wiesner,Kenton Murray,Reno Kriz

Task: 开发一个多模态视频检索系统MMMORRF，整合视觉和音频模态的文本与特征。

Motivation: 现有视频检索系统过度依赖视觉信号，忽略了其他模态的重要性。

Details

Method: 提取视觉和音频模态的文本与特征，结合新颖的模态感知加权互惠排名融合方法。 Result: 在MultiVENT 2.0和TVR基准测试中，nDCG@20分别提升81%和37%。 Conclusion: MMMORRF展示了整合多模态信息对提升视频检索效果的价值。 Abstract: Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retrieval, demonstrating the value of integrating diverse modalities.

A weakly-supervised deep learning model for fast localisation and delineation of the skeleton, internal organs, and spinal canal on Whole-Body Diffusion-Weighted MRI (WB-DWI)

A. Candito,A. Dragan,R. Holbrey,A. Ribeiro,R. Donners,C. Messiou,N. Tunariu,D. -M. Koh,M. D. Blackledge,The Institute of Cancer Research,London,United Kingdom,The Royal Marsden NHS Foundation Trust,London,United Kingdom,University Hospital Basel,Basel,Switzerland

Task: 提出一种基于深度学习的自动化算法，用于在全身扩散加权MRI（WB-DWI）上生成骨骼、邻近内部器官和脊髓管的概率图。

Motivation: 手动勾画疾病区域以测量ADC和TDV在临床实践中不可行，需要自动化方法。

Details

Method: 基于3D patch-based Residual U-Net架构的深度学习管道，使用“软标签”进行训练，并在多中心WB-DWI数据集上进行验证。 Result: 模型在骨骼、内部器官和脊髓管的分割上表现良好，计算速度比基于图谱的方法快12倍，且与专家手动勾画的结果差异较小。 Conclusion: 该模型能够快速、可重复地生成概率图，支持ADC和TDV的量化，有助于临床疾病分期和治疗反应评估。 Abstract: Background: Apparent Diffusion Coefficient (ADC) values and Total Diffusion Volume (TDV) from Whole-body diffusion-weighted MRI (WB-DWI) are recognized cancer imaging biomarkers. However, manual disease delineation for ADC and TDV measurements is unfeasible in clinical practice, demanding automation. As a first step, we propose an algorithm to generate fast and reproducible probability maps of the skeleton, adjacent internal organs (liver, spleen, urinary bladder, and kidneys), and spinal canal. Methods: We developed an automated deep-learning pipeline based on a 3D patch-based Residual U-Net architecture that localizes and delineates these anatomical structures on WB-DWI. The algorithm was trained using "soft-labels" (non-binary segmentations) derived from a computationally intensive atlas-based approach. For training and validation, we employed a multi-center WB-DWI dataset comprising 532 scans from patients with Advanced Prostate Cancer (APC) or Multiple Myeloma (MM), with testing on 45 patients. Results: Our weakly-supervised deep learning model achieved an average dice score/precision/recall of 0.66/0.6/0.73 for skeletal delineations, 0.8/0.79/0.81 for internal organs, and 0.85/0.79/0.94 for spinal canal, with surface distances consistently below 3 mm. Relative median ADC and log-transformed volume differences between automated and manual expert-defined full-body delineations were below 10% and 4%, respectively. The computational time for generating probability maps was 12x faster than the atlas-based registration algorithm (25 s vs. 5 min). An experienced radiologist rated the model's accuracy "good" or "excellent" on test datasets. Conclusion: Our model offers fast and reproducible probability maps for localizing and delineating body regions on WB-DWI, enabling ADC and TDV quantification, potentially supporting clinicians in disease staging and treatment response assessment.

Dynamic Motion Blending for Versatile Motion Editing

Nan Jiang,Hongjie Li,Ziye Yuan,Zimo He,Yixin Chen,Tengyu Liu,Yixin Zhu,Siyuan Huang

Task: 提出一种基于文本引导的运动编辑方法，通过动态生成训练数据和改进模型架构提升编辑效果。

Motivation: 现有方法依赖有限的预收集训练数据，限制了其在多样化编辑场景中的适用性。

Details

Method: 结合MotionCutMix（动态生成训练数据）和MotionReFit（自回归扩散模型与运动协调器）的方法。 Result: MotionReFit在文本引导的运动编辑中实现了最先进的性能。 Conclusion: 该方法无需额外规范或大型语言模型即可直接处理高级人类指令，具有广泛适用性。 Abstract: Text-guided motion editing enables high-level semantic control and iterative modifications beyond traditional keyframe animation. Existing methods rely on limited pre-collected training triplets, which severely hinders their versatility in diverse editing scenarios. We introduce MotionCutMix, an online data augmentation technique that dynamically generates training triplets by blending body part motions based on input text. While MotionCutMix effectively expands the training distribution, the compositional nature introduces increased randomness and potential body part incoordination. To model such a rich distribution, we present MotionReFit, an auto-regressive diffusion model with a motion coordinator. The auto-regressive architecture facilitates learning by decomposing long sequences, while the motion coordinator mitigates the artifacts of motion composition. Our method handles both spatial and temporal motion edits directly from high-level human instructions, without relying on additional specifications or Large Language Models. Through extensive experiments, we show that MotionReFit achieves state-of-the-art performance in text-guided motion editing.

SChanger: Change Detection from a Semantic Change and Spatial Consistency Perspective

Ziyu Zhou,Keyan Hu,Yutian Fang,Xiaoping Rui

Task: 开发一种名为语义变化网络（SCN）的微调策略，以解决变化检测任务中的数据稀缺问题。

Motivation: 变化检测面临数据稀缺问题，因为准确对齐遥感图像的过程劳动密集，限制了深度学习算法的性能。

Details

Method: 通过预训练模型获取实例特征提取的先验知识，采用共享权重的Siamese架构和扩展的时间融合模块（TFM）进行微调，并引入空间一致性的注意力机制。 Result: 在六个数据集上验证了模型，F1分数分别为92.87%、86.43%、68.95%、97.62%、84.58%和93.20%，超越了所有基准方法。 Conclusion: 提出的SCN方法有效解决了数据稀缺问题，显著提升了变化检测的性能。 Abstract: Change detection is a key task in Earth observation applications. Recently, deep learning methods have demonstrated strong performance and widespread application. However, change detection faces data scarcity due to the labor-intensive process of accurately aligning remote sensing images of the same area, which limits the performance of deep learning algorithms. To address the data scarcity issue, we develop a fine-tuning strategy called the Semantic Change Network (SCN). We initially pre-train the model on single-temporal supervised tasks to acquire prior knowledge of instance feature extraction. The model then employs a shared-weight Siamese architecture and extended Temporal Fusion Module (TFM) to preserve this prior knowledge and is fine-tuned on change detection tasks. The learned semantics for identifying all instances is changed to focus on identifying only the changes. Meanwhile, we observe that the locations of changes between the two images are spatially identical, a concept we refer to as spatial consistency. We introduce this inductive bias through an attention map that is generated by large-kernel convolutions and applied to the features from both time points. This enhances the modeling of multi-scale changes and helps capture underlying relationships in change detection semantics. We develop a binary change detection model utilizing these two strategies. The model is validated against state-of-the-art methods on six datasets, surpassing all benchmark methods and achieving F1 scores of 92.87%, 86.43%, 68.95%, 97.62%, 84.58%, and 93.20% on the LEVIR-CD, LEVIR-CD+, S2Looking, CDD, SYSU-CD, and WHU-CD datasets, respectively.

Emotion Detection and Music Recommendation System

Swetha Kambham,Hubert Jhonson,Sai Prathap Reddy Kambham

Task: 开发一种基于深度学习的音乐推荐系统，通过实时面部表情分析检测情绪并播放相应音乐。

Motivation: 利用人工智能技术提升音乐推荐的情感匹配度，通过音乐疗法改善用户情绪健康。

Details

Method: 结合DeepFace框架和面部识别技术，实时分析用户表情，从本地存储中匹配情绪对应的播放列表。 Result: 系统能够实时检测情绪并自动播放匹配的音乐，同时支持用户手动调整歌曲选择。 Conclusion: 该系统通过自动化和响应式的音乐推荐，有效提升了用户的情感体验和心理健康。 Abstract: As artificial intelligence becomes more and more ingrained in daily life, we present a novel system that uses deep learning for music recommendation and emotion-based detection. Through the use of facial recognition and the DeepFace framework, our method analyses human emotions in real-time and then plays music that reflects the mood it has discovered. The system uses a webcam to take pictures, analyses the most common facial expression, and then pulls a playlist from local storage that corresponds to the mood it has detected. An engaging and customised experience is ensured by allowing users to manually change the song selection via a dropdown menu or navigation buttons. By continuously looping over the playlist, the technology guarantees continuity. The objective of our system is to improve emotional well-being through music therapy by offering a responsive and automated music-selection experience.

High Quality Diffusion Distillation on a Single GPU with Relative and Absolute Position Matching

Guoqiang Zhang,Kenta Niwa,J. P. Lewis,Cedric Mesnage,W. Bastiaan Kleijn

Task: 提出一种名为相对和绝对位置匹配（RAPM）的扩散蒸馏方法，用于高效单GPU训练的高质量生成。

Motivation: 现有扩散蒸馏方法（如PCM和DMD2）需要多GPU和大批量训练，资源消耗大，限制了部分研究者的使用。RAPM旨在解决这一问题。

Details

Method: 通过匹配相对和绝对位置来模拟教师模型的采样轨迹，引入两个判别器分别处理相对和绝对位置。 Result: 在StableDiffusion V1.5和SDXL上的实验表明，RAPM在4个时间步下与1个时间步的最佳方法在有限计算资源下表现相当。 Conclusion: RAPM是一种高效且资源友好的扩散蒸馏方法，适用于单GPU训练。 Abstract: We introduce relative and absolute position matching (RAPM), a diffusion distillation method resulting in high quality generation that can be trained efficiently on a single GPU. Recent diffusion distillation research has achieved excellent results for high-resolution text-to-image generation with methods such as phased consistency models (PCM) and improved distribution matching distillation (DMD2). However, these methods generally require many GPUs (e.g.~8-64) and significant batchsizes (e.g.~128-2048) during training, resulting in memory and compute requirements that are beyond the resources of some researchers. RAPM provides effective single-GPU diffusion distillation training with a batchsize of 1. The new method attempts to mimic the sampling trajectories of the teacher model by matching the relative and absolute positions. The design of relative positions is inspired by PCM. Two discriminators are introduced accordingly in RAPM, one for matching relative positions and the other for absolute positions. Experimental results on StableDiffusion (SD) V1.5 and SDXL indicate that RAPM with 4 timesteps produces comparable FID scores as the best method with 1 timestep under very limited computational resources.

MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams

Yanpeng Sun,Shan Zhang,Wei Tang,Aotian Chen,Piotr Koniusz,Kai Zou,Yuan Xue,Anton van den Hengel

Task: 评估和改进多模态大型语言模型（MLLMs）对数学图表的感知能力。

Motivation: 当前基准测试混淆了感知和推理任务，难以评估MLLMs是否真正理解数学图表，而非仅进行表面模式识别。

Details

Method: 引入MATHGLANCE基准测试和GeoPeP数据集，分别用于评估和训练MLLMs的数学图表感知能力。 Result: MLLMs在细粒度定位任务中表现有限，但通过GeoPeP训练后感知准确性和数学推理能力显著提升。 Conclusion: MATHGLANCE和GeoPeP为评估和提升多模态数学理解能力提供了关键标准和资源。 Abstract: Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements. Unlike natural images, their inherently symbolic and abstract nature poses significant challenges for Multimodal Large Language Models (MLLMs). However, current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether MLLMs genuinely understand mathematical diagrams beyond superficial pattern recognition. To address this gap, we introduce MATHGLANCE, a benchmark specifically designed to isolate and evaluate mathematical perception in MLLMs. MATHGLANCE comprises 1.2K images and 1.6K carefully curated questions spanning four perception tasks: shape classification, object counting, relationship identification, and object grounding, covering diverse domains including plane geometry, solid geometry, and graphical representations. Our evaluation of MLLMs reveals that their ability to understand diagrams is notably limited, particularly in fine-grained grounding tasks. In response, we construct GeoPeP, a perception-oriented dataset of 200K structured geometry image-text pairs explicitly annotated with geometric primitives and precise spatial relationships. Training MLLM on GeoPeP leads to significant gains in perceptual accuracy, which in turn substantially improves mathematical reasoning. Our benchmark and dataset establish critical standards for evaluating and advancing multimodal mathematical understanding, providing valuable resources and insights to foster future MLLM research.

PhysGen3D: Crafting a Miniature Interactive World from a Single Image

Boyuan Chen,Hanxiao Jiang,Shaowei Liu,Saurabh Gupta,Yunzhu Li,Hao Zhao,Shenlong Wang

Task: 从单张图像生成具有物理合理性的交互式3D场景。

Motivation: 通过结合图像几何与语义理解以及物理模拟，实现从静态图像预测和模拟未来场景的能力。

Details

Method: 提出PhysGen3D框架，估计3D形状、姿态、物理和光照属性，结合用户输入生成交互式3D场景。 Result: PhysGen3D在生成具有物理合理性的视频方面优于现有技术，同时提供更高的灵活性和控制性。 Conclusion: PhysGen3D在真实感、物理合理性和用户交互性之间取得了平衡，为动态视频生成开辟了新途径。 Abstract: Envisioning physically plausible outcomes from a single image requires a deep understanding of the world's dynamics. To address this, we introduce PhysGen3D, a novel framework that transforms a single image into an amodal, camera-centric, interactive 3D scene. By combining advanced image-based geometric and semantic understanding with physics-based simulation, PhysGen3D creates an interactive 3D world from a static image, enabling us to "imagine" and simulate future scenarios based on user input. At its core, PhysGen3D estimates 3D shapes, poses, physical and lighting properties of objects, thereby capturing essential physical attributes that drive realistic object interactions. This framework allows users to specify precise initial conditions, such as object speed or material properties, for enhanced control over generated video outcomes. We evaluate PhysGen3D's performance against closed-source state-of-the-art (SOTA) image-to-video models, including Pika, Kling, and Gen-3, showing PhysGen3D's capacity to generate videos with realistic physics while offering greater flexibility and fine-grained control. Our results show that PhysGen3D achieves a unique balance of photorealism, physical plausibility, and user-driven interactivity, opening new possibilities for generating dynamic, physics-grounded video from an image.

UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines

Chen Tang,Xinzhu Ma,Encheng Su,Xiufeng Song,Xiaohong Liu,Wei-Hong Li,Lei Bai,Wanli Ouyang,Xiangyu Yue

Task: 提出一种统一的基于Transformer的时空建模框架UniSTD，支持多任务学习。

Motivation: 传统时空模型依赖任务特定架构，限制了通用性和可扩展性。

Details

Method: 采用两阶段预训练-适应范式，结合任务无关预训练和专门联合训练，引入秩自适应专家混合适应和时间模块。 Result: 在大规模数据集上验证了统一时空模型的多任务学习能力，支持10个任务同时训练并降低训练成本。 Conclusion: UniSTD展示了统一时空模型在多任务学习和跨领域应用中的潜力。 Abstract: Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbf{UniSTD}, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at https://github.com/1hunters/UniSTD.

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning

Huajie Tan,Yuheng Ji,Xiaoshuai Hao,Minglan Lin,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang

Task: 提出一种名为Reason-RFT的强化微调框架，以提升视觉语言模型在视觉推理任务中的泛化能力。

Motivation: 现有基于Chain-of-Thought监督微调的方法可能导致过拟合和认知僵化，限制了模型在跨领域视觉推理任务中的实际应用能力。

Details

Method: 采用两阶段训练框架：1) 使用精心标注的Chain-of-Thought数据进行监督微调；2) 基于Group Relative Policy Optimization的强化学习生成多组推理-响应对。 Result: 在多个任务中达到最先进性能，泛化能力强，且在少样本学习场景中表现优异。 Conclusion: Reason-RFT显著提升了视觉推理任务的性能、泛化能力和数据效率，为视觉语言模型的实际应用提供了新思路。 Abstract: Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods improve VLM reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities. However, this training paradigm may lead to overfitting and cognitive rigidity, restricting the model's ability to transfer visual reasoning skills across domains and limiting its real-world applicability. To address these limitations, we propose Reason-RFT, a novel reinforcement fine-tuning framework that significantly enhances generalization capabilities in visual reasoning tasks. Reason-RFT introduces a two-phase training framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated Chain-of-Thought (CoT) data activates the reasoning potential of Vision-Language Models (VLMs), followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing generalization in visual reasoning tasks. To evaluate Reason-RFT's visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation.cExperimental results demonstrate Reasoning-RFT's three key advantages: (1) Performance Enhancement: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models; (2) Generalization Superiority: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms; (3) Data Efficiency: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines.

Disentangled Source-Free Personalization for Facial Expression Recognition with Neutral Target Data

Masoumeh Sharafi,Emma Ollivier,Muhammad Osama Zeeshan,Soufiane Belharbi,Marco Pedersoli,Alessandro Lameiras Koerich,Simon Bacon,Eric~Granger

Task: 提出一种解耦的无源域适应方法（DSFDA），用于解决目标域中缺失非中性表情数据的挑战。

Motivation: 在医疗等应用中，收集包含所有表情类别的目标域数据困难，而中性控制视频易于获取，因此需要一种方法利用中性视频适应模型。

Details

Method: DSFDA通过中性目标控制视频生成缺失的非中性目标数据，并解耦表情和身份特征，同时采用自监督策略重建目标图像。 Result: 该方法提高了模型在处理目标域中表情变异性时的准确性。 Conclusion: DSFDA有效解决了目标域数据缺失问题，提升了面部表情识别的性能。 Abstract: Facial Expression Recognition (FER) from videos is a crucial task in various application areas, such as human-computer interaction and health monitoring (e.g., pain, depression, fatigue, and stress). Beyond the challenges of recognizing subtle emotional or health states, the effectiveness of deep FER models is often hindered by the considerable variability of expressions among subjects. Source-free domain adaptation (SFDA) methods are employed to adapt a pre-trained source model using only unlabeled target domain data, thereby avoiding data privacy and storage issues. Typically, SFDA methods adapt to a target domain dataset corresponding to an entire population and assume it includes data from all recognition classes. However, collecting such comprehensive target data can be difficult or even impossible for FER in healthcare applications. In many real-world scenarios, it may be feasible to collect a short neutral control video (displaying only neutral expressions) for target subjects before deployment. These videos can be used to adapt a model to better handle the variability of expressions among subjects. This paper introduces the Disentangled Source-Free Domain Adaptation (DSFDA) method to address the SFDA challenge posed by missing target expression data. DSFDA leverages data from a neutral target control video for end-to-end generation and adaptation of target data with missing non-neutral data. Our method learns to disentangle features related to expressions and identity while generating the missing non-neutral target data, thereby enhancing model accuracy. Additionally, our self-supervision strategy improves model adaptation by reconstructing target images that maintain the same identity and source expression.

Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

Shijie Zhou,Hui Ren,Yijia Weng,Shuwang Zhang,Zhen Wang,Dejia Xu,Zhiwen Fan,Suya You,Zhangyang Wang,Leonidas Guibas,Achuta Kadambi

Task: 扩展2D视觉基础模型的功能到4D领域，仅使用单目视频输入。

Motivation: 由于缺乏大规模标注的3D/4D或多视角数据集，实现复杂3D/4D场景的自由交互和高级语义操作具有挑战性。

Details

Method: 提出Feature4X框架，通过动态优化策略和模型条件化的4D特征场蒸馏，将视频基础模型的特征提取为显式4D特征场。 Result: 实现了新颖视角分割、场景几何和外观编辑，以及跨时间步的自由形式视觉问答（VQA）。 Conclusion: Feature4X为可扩展、上下文和时空感知的智能体AI应用提供了基础，支持沉浸式动态4D场景交互。 Abstract: Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g. SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation

Yulu Pan,Ce Zhang,Gedas Bertasius

Task: 提出并构建一个大规模篮球视频数据集BASKET，用于细粒度技能评估。

Motivation: 现有技能评估数据集缺乏多样性和规模，BASKET填补了这一空白，提供全球范围内多样化的球员数据。

Details

Method: BASKET包含4,477小时视频，涵盖32,232名球员的20项篮球技能，要求模型通过视频分析预测技能水平。 Result: 当前最先进的视频模型在此任务上表现不佳，显著落后于人类基准。 Conclusion: BASKET可作为开发具有高级长距离、细粒度识别能力的新视频模型的资源，并支持篮球领域的具体应用。 Abstract: We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation. BASKET contains 4,477 hours of video capturing 32,232 basketball players from all over the world. Compared to prior skill estimation datasets, our dataset includes a massive number of skilled participants with unprecedented diversity in terms of gender, age, skill level, geographical location, etc. BASKET includes 20 fine-grained basketball skills, challenging modern video recognition models to capture the intricate nuances of player skill through in-depth video analysis. Given a long highlight video (8-10 minutes) of a particular player, the model needs to predict the skill level (e.g., excellent, good, average, fair, poor) for each of the 20 basketball skills. Our empirical analysis reveals that the current state-of-the-art video models struggle with this task, significantly lagging behind the human baseline. We believe that BASKET could be a useful resource for developing new video models with advanced long-range, fine-grained recognition capabilities. In addition, we hope that our dataset will be useful for domain-specific applications such as fair basketball scouting, personalized player development, and many others. Dataset and code are available at https://github.com/yulupan00/BASKET.

Yan-Bo Lin,Kevin Lin,Zhengyuan Yang,Linjie Li,Jianfeng Wang,Chung-Ching Lin,Xiaofei Wang,Gedas Bertasius,Lijuan Wang

Task: 提出零样本音视频编辑任务，无需额外模型训练即可根据文本提示修改原始音视频内容。

Motivation: 现有零样本音视频编辑方法在模态同步和一致性上存在不足，导致结果不一致。

Details

Method: 提出AvED框架，利用音视频交互实现同步和一致的编辑。 Result: 在AvED-Bench和OAVE数据集上表现优异，验证了其泛化能力。 Conclusion: AvED框架解决了音视频编辑中的同步和一致性问题，展示了强大的零样本编辑能力。 Abstract: In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html

FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks

Jinwei Li,Huan-ang Gao,Wenyi Li,Haohan Chi,Chenyu Liu,Chenxi Du,Yiqian Liu,Mingju Gao,Guiyu Zhang,Zongzheng Zhang,Li Yi,Yao Yao,Jingwei Zhao,Hongyang Li,Yikai Wang,Hao Zhao

Task: 提出一种名为FB-4D的新框架，用于增强动态3D内容生成的空间-时间一致性。

Motivation: 动态3D内容生成中实现高保真4D生成且保持强时空一致性仍具挑战性。

Details

Method: 通过引入Feature Bank机制，存储并融合先前帧的特征，结合动态合并机制更新特征库，并通过多轮自回归迭代生成参考序列。 Result: FB-4D在渲染质量、时空一致性和鲁棒性上显著优于现有方法，甚至媲美基于训练的方法。 Conclusion: FB-4D通过Feature Bank机制有效提升了动态3D生成的一致性和性能。 Abstract: With the rapid advancements in diffusion models and 3D generation techniques, dynamic 3D content generation has become a crucial research area. However, achieving high-fidelity 4D (dynamic 3D) generation with strong spatial-temporal consistency remains a challenging task. Inspired by recent findings that pretrained diffusion features capture rich correspondences, we propose FB-4D, a novel 4D generation framework that integrates a Feature Bank mechanism to enhance both spatial and temporal consistency in generated frames. In FB-4D, we store features extracted from previous frames and fuse them into the process of generating subsequent frames, ensuring consistent characteristics across both time and multiple views. To ensure a compact representation, the Feature Bank is updated by a proposed dynamic merging mechanism. Leveraging this Feature Bank, we demonstrate for the first time that generating additional reference sequences through multiple autoregressive iterations can continuously improve generation performance. Experimental results show that FB-4D significantly outperforms existing methods in terms of rendering quality, spatial-temporal consistency, and robustness. It surpasses all multi-view generation tuning-free approaches by a large margin and achieves performance on par with training-based methods.

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

Tianqi Liu,Zihao Huang,Zhaoxi Chen,Guangcong Wang,Shoukang Hu,Liao Shen,Huiqiang Sun,Zhiguo Cao,Wei Li,Ziwei Liu

Task: 提出一种无需调优的框架Free4D，用于从单张图像生成4D场景。

Motivation: 现有方法要么局限于对象级生成，难以实现场景级生成，要么依赖大规模多视角视频数据集进行昂贵训练，且由于4D场景数据稀缺，泛化能力有限。

Details

Method: 1) 使用图像到视频扩散模型对输入图像进行动画化，并初始化4D几何结构；2) 设计自适应引导机制，包括点引导去噪策略（空间一致性）和潜在替换策略（时间一致性）；3) 提出基于调制的细化方法，以提升生成的4D表示的一致性。 Result: 生成的4D表示支持实时可控渲染，显著推进了基于单张图像的4D场景生成。 Conclusion: Free4D框架通过利用预训练基础模型，实现了高效且泛化能力强的4D场景生成。 Abstract: We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.

Mapping fMRI Signal and Image Stimuli in an Artificial Neural Network Latent Space: Bringing Artificial and Natural Minds Together

Cesare Maria Dalbagno,Manuel de Castro Ribeiro Jardim,Mihnea Angheluţă

Task: 研究视觉刺激和fMRI数据的潜在空间表征是否共享共同信息。

Motivation: 解码和从fMRI数据重建刺激是AI和神经科学中的挑战，对理解神经表征和提升人工神经网络（ANNs）的可解释性具有重要意义。

Details

Method: 通过比较一个基于fMRI数据的自编码器（AE）和一个基于图像数据的视觉变换器（ViT）的潜在空间相似性，使用表征相似性分析（RSA）。 Result: 发现两个领域的潜在空间存在差异，但初步结果尚不明确。 Conclusion: 需要进一步研究以更深入地探索这种关系。 Abstract: The goal of this study is to investigate whether latent space representations of visual stimuli and fMRI data share common information. Decoding and reconstructing stimuli from fMRI data remains a challenge in AI and neuroscience, with significant implications for understanding neural representations and improving the interpretability of Artificial Neural Networks (ANNs). In this preliminary study, we investigate the feasibility of such reconstruction by examining the similarity between the latent spaces of one autoencoder (AE) and one vision transformer (ViT) trained on fMRI and image data, respectively. Using representational similarity analysis (RSA), we found that the latent spaces of the two domains appear different. However, these initial findings are inconclusive, and further research is needed to explore this relationship more thoroughly.

Optimizing Breast Cancer Detection in Mammograms: A Comprehensive Study of Transfer Learning, Resolution Reduction, and Multi-View Classification

Daniel G. P. Petrini,Hae Yong Kim

Task: 探索机器学习在乳腺X光片乳腺癌检测中的应用中的五个关键问题。

Motivation: 当前方法通常采用两阶段迁移学习过程，但对其效果和必要性缺乏系统研究。

Details

Method: 系统研究了五个关键问题，包括中间补丁分类器的必要性、骨干模型性能、分辨率调整技术、双视图分类器的效果以及图像质量的影响。 Result: 开发了优于先前结果的单视图和双视图分类器模型。 Conclusion: 研究结果为模型架构和迁移学习策略提供了见解，有助于更准确和高效的乳腺X光片分析。 Abstract: This study explores open questions in the application of machine learning for breast cancer detection in mammograms. Current approaches often employ a two-stage transfer learning process: first, adapting a backbone model trained on natural images to develop a patch classifier, which is then used to create a single-view whole-image classifier. Additionally, many studies leverage both mammographic views to enhance model performance. In this work, we systematically investigate five key questions: (1) Is the intermediate patch classifier essential for optimal performance? (2) Do backbone models that excel in natural image classification consistently outperform others on mammograms? (3) When reducing mammogram resolution for GPU processing, does the learn-to-resize technique outperform conventional methods? (4) Does incorporating both mammographic views in a two-view classifier significantly improve detection accuracy? (5) How do these findings vary when analyzing low-quality versus high-quality mammograms? By addressing these questions, we developed models that outperform previous results for both single-view and two-view classifiers. Our findings provide insights into model architecture and transfer learning strategies contributing to more accurate and efficient mammogram analysis.

Thin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields

Navami Kairanda,Marc Habermann,Shanthika Naik,Christian Theobalt,Vladislav Golyanik

Task: 从单目RGB视频中高精度重建高度可变形表面（如布料）的3D形状。

Motivation: 现有方法无法一致且准确地恢复细粒度表面细节，且存在离散表面表示、逐帧优化导致的误差传播以及基于网格的可微分渲染器梯度不佳等问题。

Details

Method: 提出ThinShell-SfT方法，使用隐式连续时空神经场表示表面，结合基于Kirchhoff-Love模型的连续薄壳物理先验，并利用3D高斯泼溅进行可微分渲染。 Result: ThinShell-SfT在定性和定量上优于现有方法，能够更准确地恢复细粒度表面细节。 Conclusion: 通过连续表面表示、定制化物理先验和3D高斯泼溅，ThinShell-SfT显著提升了高度可变形表面的3D重建精度。 Abstract: 3D reconstruction of highly deformable surfaces (e.g. cloths) from monocular RGB videos is a challenging problem, and no solution provides a consistent and accurate recovery of fine-grained surface details. To account for the ill-posed nature of the setting, existing methods use deformation models with statistical, neural, or physical priors. They also predominantly rely on nonadaptive discrete surface representations (e.g. polygonal meshes), perform frame-by-frame optimisation leading to error propagation, and suffer from poor gradients of the mesh-based differentiable renderers. Consequently, fine surface details such as cloth wrinkles are often not recovered with the desired accuracy. In response to these limitations, we propose ThinShell-SfT, a new method for non-rigid 3D tracking that represents a surface as an implicit and continuous spatiotemporal neural field. We incorporate continuous thin shell physics prior based on the Kirchhoff-Love model for spatial regularisation, which starkly contrasts the discretised alternatives of earlier works. Lastly, we leverage 3D Gaussian splatting to differentiably render the surface into image space and optimise the deformations based on analysis-bysynthesis principles. Our Thin-Shell-SfT outperforms prior works qualitatively and quantitatively thanks to our continuous surface formulation in conjunction with a specially tailored simulation prior and surface-induced 3D Gaussians. See our project page at https://4dqv.mpiinf.mpg.de/ThinShellSfT.

Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis

Yu Xin,Gorkem Can Ates,Kuang Gong,Wei Shao

Task: 扩展视觉语言模型（VLM）到3D医学图像分析，解决计算需求和特征对齐的挑战。

Motivation: 3D医学图像分析中，VLMs面临高计算需求和3D空间特征与临床文本对齐的困难。

Details

Method: 提出Med3DVLM，包含DCFormer高效编码器、SigLIP对比学习策略和双流MLP-Mixer投影器。 Result: 在M3D数据集上表现优异，图像-文本检索R@1达61.00%，报告生成METEOR得分36.42%，VQA任务表现显著提升。 Conclusion: Med3DVLM成功弥合3D影像与语言间的鸿沟，支持临床多任务推理。 Abstract: Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images. Results show that Med3DVLM achieves superior performance across multiple benchmarks. For image-text retrieval, it reaches 61.00% R@1 on 2,000 samples, significantly outperforming the current state-of-the-art M3D model (19.10%). For report generation, it achieves a METEOR score of 36.42% (vs. 14.38%). In open-ended visual question answering (VQA), it scores 36.76% METEOR (vs. 33.58%), and in closed-ended VQA, it achieves 79.95% accuracy (vs. 75.78%). These results highlight Med3DVLM's ability to bridge the gap between 3D imaging and language, enabling scalable, multi-task reasoning across clinical applications. Our code is publicly available at https://github.com/mirthAI/Med3DVLM.

Learning Scene-Level Signed Directional Distance Function with Ellipsoidal Priors and Neural Residuals

Zhirui Dai,Hojoon Shin,Yulun Tian,Ki Myung Brian Lee,Nikolay Atanasov

Task: 探索一种称为有向符号距离函数（SDDF）的方向性符号距离表示方法，用于密集几何环境表示。

Motivation: 传统的离散表示方法（如网格、点云和体素）在重建保真度、效率和可微分性方面不如基于神经网络的隐式连续表示方法（如SDF和NeRF）。SDDF结合了SDF和NeRF的优点，直接提供沿方向的表面距离，支持高效的视图合成。

Details

Method: 提出了一种可微分的混合表示方法，结合显式椭球先验和隐式神经残差，以高效学习和预测场景级SDDF。 Result: SDDF在重建精度和渲染效率方面与最先进的神经隐式场景模型竞争，同时支持可微分的视图预测用于机器人轨迹优化。 Conclusion: SDDF是一种有效的密集几何环境表示方法，兼具高保真重建和高效视图合成的能力。 Abstract: Dense geometric environment representations are critical for autonomous mobile robot navigation and exploration. Recent work shows that implicit continuous representations of occupancy, signed distance, or radiance learned using neural networks offer advantages in reconstruction fidelity, efficiency, and differentiability over explicit discrete representations based on meshes, point clouds, and voxels. In this work, we explore a directional formulation of signed distance, called signed directional distance function (SDDF). Unlike signed distance function (SDF) and similar to neural radiance fields (NeRF), SDDF has a position and viewing direction as input. Like SDF and unlike NeRF, SDDF directly provides distance to the observed surface along the direction, rather than integrating along the view ray, allowing efficient view synthesis. To learn and predict scene-level SDDF efficiently, we develop a differentiable hybrid representation that combines explicit ellipsoid priors and implicit neural residuals. This approach allows the model to effectively handle large distance discontinuities around obstacle boundaries while preserving the ability for dense high-fidelity prediction. We show that SDDF is competitive with the state-of-the-art neural implicit scene models in terms of reconstruction accuracy and rendering efficiency, while allowing differentiable view prediction for robot trajectory optimization.

Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors

Yuke Lou,Yiming Wang,Zhen Wu,Rui Zhao,Wenjia Wang,Mingyi Shi,Taku Komura

Task: 提出一种无需依赖有限3D HOI数据集训练的零样本HOI合成框架。

Motivation: 由于3D HOI数据的复杂性和高成本，现有方法受限于训练数据集中对象类型和交互模式的狭窄多样性。

Details

Method: 利用预训练多模态模型的HOI知识，通过文本描述生成2D HOI图像序列，再将其提升为3D HOI里程碑，结合物理跟踪优化结果。 Result: 实验表明，该方法能够生成具有物理真实性和语义多样性的开放词汇HOI。 Conclusion: 该框架为HOI合成提供了一种高效且多样化的解决方案。 Abstract: Human-object interaction (HOI) synthesis is important for various applications, ranging from virtual reality to robotics. However, acquiring 3D HOI data is challenging due to its complexity and high cost, limiting existing methods to the narrow diversity of object types and interaction patterns in training datasets. This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets. The core idea of our method lies in leveraging extensive HOI knowledge from pre-trained Multimodal Models. Given a text description, our system first obtains temporally consistent 2D HOI image sequences using image or video generation models, which are then uplifted to 3D HOI milestones of human and object poses. We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images. Our estimation method is adaptive to various object templates obtained from text-to-3D models or online retrieval. A physics-based tracking of the 3D HOI kinematic milestone is further applied to refine both body motions and object poses, yielding more physically plausible HOI generation results. The experimental results demonstrate that our method is capable of generating open-vocabulary HOIs with physical realism and semantic diversity.

Network Inversion for Generating Confidently Classified Counterfeits

Pirzada Suhail,Amit Sethi

Task: 扩展网络反演技术以生成被模型自信分类但与训练数据显著不同的合成样本。

Motivation: 理解模型的决策边界和行为需要生成被自信分类的输入，但传统方法难以确保样本既自信分类又不同于训练数据分布。

Details

Method: 通过将生成器的条件机制从软向量条件修改为一热向量条件，并应用Kullback-Leibler散度（KLD）来鼓励生成既合理又自信分类的样本。 Result: 成功生成了Confidently Classified Counterfeits，挑战了高置信度预测总是与训练数据分布一致的假设。 Conclusion: 生成此类样本有助于更深入地理解模型的局限性和决策过程，提升机器学习系统的安全性和可靠性。 Abstract: In machine learning, especially with vision classifiers, generating inputs that are confidently classified by the model is essential for understanding its decision boundaries and behavior. However, creating such samples that are confidently classified yet distinct from the training data distribution is a challenge. Traditional methods often modify existing inputs, but they don't always ensure confident classification. In this work, we extend network inversion techniques to generate Confidently Classified Counterfeits-synthetic samples that are confidently classified by the model despite being significantly different from the training data. We achieve this by modifying the generator's conditioning mechanism from soft vector conditioning to one-hot vector conditioning and applying Kullback-Leibler divergence (KLD) between the one-hot vectors and the classifier's output distribution. This encourages the generator to produce samples that are both plausible and confidently classified. Generating Confidently Classified Counterfeits is crucial for ensuring the safety and reliability of machine learning systems, particularly in safety-critical applications where models must exhibit confidence only on data within the training distribution. By generating such counterfeits, we challenge the assumption that high-confidence predictions are always indicative of in-distribution data, providing deeper insights into the model's limitations and decision-making process.

Qwen2.5-Omni Technical Report

Jin Xu,Zhifang Guo,Jinzheng He,Hangrui Hu,Ting He,Shuai Bai,Keqin Chen,Jialin Wang,Yang Fan,Kai Dang,Bin Zhang,Xiong Wang,Yunfei Chu,Junyang Lin

Task: 设计并实现一个端到端的多模态模型Qwen2.5-Omni，能够感知文本、图像、音频和视频，并以流式方式生成文本和自然语音响应。

Motivation: 为了在多模态信息输入中实现流式处理，并解决视频与音频时间戳同步问题，同时避免文本和语音生成之间的干扰。

Details

Method: 采用块处理方式的音频和视觉编码器，提出TMRoPE位置嵌入方法，以及Thinker-Talker架构，其中Thinker负责文本生成，Talker利用Thinker的隐藏表示生成音频令牌。 Result: Qwen2.5-Omni在性能上与Qwen2.5-VL相当，优于Qwen2-Audio，并在多模态基准测试中达到最先进水平，同时在语音生成方面表现出色。 Conclusion: Qwen2.5-Omni在多模态感知和生成任务中表现出色，尤其在流式处理和语音生成方面具有显著优势。 Abstract: In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

3D Convolutional Neural Networks for Improved Detection of Intracranial bleeding in CT Imaging

Bargava Subramanian,Naveen Kumarasami,Praveen Shastry,Kalyan Sivasailam,Anandakumar D,Elakkiya R,Harsha KG,Rithanya V,Harini T,Afshin Hussain,Kishore Prasath Venkatesh

Task: 探索人工智能在急诊环境中颅内出血检测中的应用。

Motivation: 传统影像检测颅内出血速度慢且易受变异性影响，人工智能可快速分析医学图像，提高诊断速度和准确性。

Details

Method: 使用U形3D卷积神经网络（CNN）自动检测和分类容积CT扫描中的颅内出血，结合CLAHE和强度归一化等预处理技术。 Result: 模型在多种出血类型中表现优异，多数情况下精确度、召回率和准确率超过90%，其中硬膜外出血精确度达96%，蛛网膜下腔出血准确率达94%。 Conclusion: U形3D CNN为自动化颅内出血检测提供了可扩展的解决方案，未来将优化实时处理并整合多模态数据以增强临床适用性。 Abstract: Background: Intracranial bleeding (IB) is a life-threatening condition caused by traumatic brain injuries, including epidural, subdural, subarachnoid, and intraparenchymal hemorrhages. Rapid and accurate detection is crucial to prevent severe complications. Traditional imaging can be slow and prone to variability, especially in high-pressure scenarios. Artificial Intelligence (AI) provides a solution by quickly analyzing medical images, identifying subtle hemorrhages, and flagging urgent cases. By enhancing diagnostic speed and accuracy, AI improves workflows and patient care. This article explores AI's role in transforming IB detection in emergency settings. Methods: A U-shaped 3D Convolutional Neural Network (CNN) automates IB detection and classification in volumetric CT scans. Advanced preprocessing, including CLAHE and intensity normalization, enhances image quality. The architecture preserves spatial and contextual details for precise segmentation. A dataset of 2,912 annotated CT scans was used for training and evaluation. Results: The model achieved high performance across major bleed types, with precision, recall, and accuracy exceeding 90 percent in most cases 96 percent precision for epidural hemorrhages and 94 percent accuracy for subarachnoid hemorrhages. Its ability to classify and localize hemorrhages highlights its clinical reliability. Conclusion: This U-shaped 3D CNN offers a scalable solution for automating IB detection, reducing diagnostic delays, and improving emergency care outcomes. Future work will expand dataset diversity, optimize real-time processing, and integrate multimodal data for enhanced clinical applicability.

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

Lee Chae-Yeon,Oh Hyun-Bin,Han EunGi,Kim Sung-Bin,Suekyeong Nam,Tae-Hyun Oh

Task: 提出一种基于语音驱动的3D说话头生成方法，重点关注唇部运动的感知对齐。

Motivation: 现有模型在捕捉语音特征与唇部运动之间的感知对齐方面存在不足，需要满足时间同步、唇部可读性和表现力三个关键标准。

Details

Method: 引入一种语音-网格同步表示，捕捉语音信号与3D面部网格之间的复杂对应关系，并将其作为感知损失应用于现有模型。 Result: 实验表明，使用感知损失训练的模型在时间同步、唇部可读性和表现力三个方面均有显著提升。 Conclusion: 提出的语音-网格同步表示和感知损失有效提升了3D说话头生成模型的感知对齐能力。 Abstract: Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria -- Temporal Synchronization, Lip Readability, and Expressiveness -- are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization. Codes and datasets are available at https://perceptual-3d-talking-head.github.io/.

AI-Driven MRI Spine Pathology Detection: A Comprehensive Deep Learning Approach for Automated Diagnosis in Diverse Clinical Settings

Bargava Subramanian,Naveen Kumarasami,Praveen Shastry,Raghotham Sripadraj,Kalyan Sivasailam,Anandakumar D,Abinaya Ramachandran,Sudhir MP,Gunakutti G,Kishore Prasath Venkatesh

Task: 开发一种用于MRI脊柱病理检测的自主AI系统。

Motivation: 解决印度多样化医疗环境中脊柱病理诊断的效率和准确性问题。

Details

Method: 整合Vision Transformers、U-Net with cross-attention、MedSAM和Cascade R-CNN等先进架构，训练于200万份MRI脊柱扫描数据。 Result: AI系统在多病理检测中达到97.9%的准确率，正常/异常分类准确率为98.0%，并在13家医疗机构部署。 Conclusion: 该系统的高精度和可扩展性填补了诊断空白，优化了放射工作流程，提升了患者护理质量。 Abstract: Study Design This study presents the development of an autonomous AI system for MRI spine pathology detection, trained on a dataset of 2 million MRI spine scans sourced from diverse healthcare facilities across India. The AI system integrates advanced architectures, including Vision Transformers, U-Net with cross-attention, MedSAM, and Cascade R-CNN, enabling comprehensive classification, segmentation, and detection of 43 distinct spinal pathologies. The dataset is balanced across age groups, genders, and scanner manufacturers to ensure robustness and adaptability. Subgroup analyses were conducted to validate the model's performance across different patient demographics, imaging conditions, and equipment types. Performance The AI system achieved up to 97.9 percent multi-pathology detection, demonstrating consistent performance across age, gender, and manufacturer subgroups. The normal vs. abnormal classification achieved 98.0 percent accuracy, and the system was deployed across 13 major healthcare enterprises in India, encompassing diagnostic centers, large hospitals, and government facilities. During deployment, it processed approximately 100,000 plus MRI spine scans, leading to reduced reporting times and increased diagnostic efficiency by automating the identification of common spinal conditions. Conclusion The AI system's high precision and recall validate its capability as a reliable tool for autonomous normal/abnormal classification, pathology segmentation, and detection. Its scalability and adaptability address critical diagnostic gaps, optimize radiology workflows, and improve patient care across varied healthcare environments in India.

Euclidean Distance to Convex Polyhedra and Application to Class Representation in Spectral Images

Antoine Bottenmuller,Florent Magaud,Arnaud Demortière,Etienne Decencière,Petr Dokladal

Task: 从观测数据中估计丰度图。

Motivation: 线性解混方法在波段数过少或光谱数据相关性过高时效果不佳，需要一种更通用的方法。

Details

Method: 提出一种基于任意线性分类器的空间密度函数方法，并提供了计算欧几里得距离到多面体集的数学公式及高效算法。 Result: 在Samson高光谱数据集上表现优于现有方法，并在锂离子电池光谱图像中验证了方法的通用性和有效性。 Conclusion: 该方法在丰度图重建中具有优越性和广泛适用性。 Abstract: With the aim of estimating the abundance map from observations only, linear unmixing approaches are not always suitable to spectral images, especially when the number of bands is too small or when the spectra of the observed data are too correlated. To address this issue in the general case, we present a novel approach which provides an adapted spatial density function based on any arbitrary linear classifier. A robust mathematical formulation for computing the Euclidean distance to polyhedral sets is presented, along with an efficient algorithm that provides the exact minimum-norm point in a polyhedron. An empirical evaluation on the widely-used Samson hyperspectral dataset demonstrates that the proposed method surpasses state-of-the-art approaches in reconstructing abundance maps. Furthermore, its application to spectral images of a Lithium-ion battery, incompatible with linear unmixing models, validates the method's generality and effectiveness.

Attention Xception UNet (AXUNet): A Novel Combination of CNN and Self-Attention for Brain Tumor Segmentation

Farzan Moodi,Fereshteh Khodadadi Shoushtari,Gelareh Valizadeh,Dornaz Mazinani,Hanieh Mobarak Salari,Hamidreza Saligheh Rad

Task: 提出一种名为Attention Xception UNet（AXUNet）的深度学习架构，用于精确分割胶质瘤脑肿瘤。

Motivation: 胶质瘤脑肿瘤的精确分割对诊断和治疗计划至关重要，而深度学习技术提供了有前景的解决方案，但最优模型架构仍需研究。

Details

Method: 使用BraTS 2021数据集，整合Xception主干网络和点积自注意力模块，构建AXUNet架构，并与现有SOTA模型进行比较。 Result: AXUNet在测试集上表现优于基线模型，平均Dice分数为93.73，在WT、TC和ET区域均取得最高分数。 Conclusion: AXUNet通过整合Xception主干和自注意力机制，展示了在捕捉空间和上下文信息方面的优越性能，有望用于精确肿瘤分割。 Abstract: Accurate segmentation of glioma brain tumors is crucial for diagnosis and treatment planning. Deep learning techniques offer promising solutions, but optimal model architectures remain under investigation. We used the BraTS 2021 dataset, selecting T1 with contrast enhancement (T1CE), T2, and Fluid-Attenuated Inversion Recovery (FLAIR) sequences for model development. The proposed Attention Xception UNet (AXUNet) architecture integrates an Xception backbone with dot-product self-attention modules, inspired by state-of-the-art (SOTA) large language models such as Google Bard and OpenAI ChatGPT, within a UNet-shaped model. We compared AXUNet with SOTA models. Comparative evaluation on the test set demonstrated improved results over baseline models. Inception-UNet and Xception-UNet achieved mean Dice scores of 90.88 and 93.24, respectively. Attention ResUNet (AResUNet) attained a mean Dice score of 92.80, with the highest score of 84.92 for enhancing tumor (ET) among all models. Attention Gate UNet (AGUNet) yielded a mean Dice score of 90.38. AXUNet outperformed all models with a mean Dice score of 93.73. It demonstrated superior Dice scores across whole tumor (WT) and tumor core (TC) regions, achieving 92.59 for WT, 86.81 for TC, and 84.89 for ET. The integration of the Xception backbone and dot-product self-attention mechanisms in AXUNet showcases enhanced performance in capturing spatial and contextual information. The findings underscore the potential utility of AXUNet in facilitating precise tumor delineation.

Lipschitz Constant Meets Condition Number: Learning Robust and Compact Deep Neural Networks

Yangqi Feng,Shing-Ho J. Lin,Baoyuan Gao,Xian Wei

Task: 开发一种新型联合约束方法（TSCNC）以提高高度剪枝深度神经网络的鲁棒性。

Motivation: 高度剪枝的权重矩阵易导致条件数增加，从而降低模型对输入噪声的鲁棒性。

Details

Method: 提出TSCNC方法，结合变换稀疏约束和条件数约束，调整权重分布以减少条件数。 Result: 实验表明，TSCNC显著提高了高剪枝率DNN的鲁棒性。 Conclusion: TSCNC通过减少条件数，有效解决了高度剪枝模型的非鲁棒性问题。 Abstract: Recent research has revealed that high compression of Deep Neural Networks (DNNs), e.g., massive pruning of the weight matrix of a DNN, leads to a severe drop in accuracy and susceptibility to adversarial attacks. Integration of network pruning into an adversarial training framework has been proposed to promote adversarial robustness. It has been observed that a highly pruned weight matrix tends to be ill-conditioned, i.e., increasing the condition number of the weight matrix. This phenomenon aggravates the vulnerability of a DNN to input noise. Although a highly pruned weight matrix is considered to be able to lower the upper bound of the local Lipschitz constant to tolerate large distortion, the ill-conditionedness of such a weight matrix results in a non-robust DNN model. To overcome this challenge, this work develops novel joint constraints to adjust the weight distribution of networks, namely, the Transformed Sparse Constraint joint with Condition Number Constraint (TSCNC), which copes with smoothing distribution and differentiable constraint functions to reduce condition number and thus avoid the ill-conditionedness of weight matrices. Furthermore, our theoretical analyses unveil the relevance between the condition number and the local Lipschitz constant of the weight matrix, namely, the sharply increasing condition number becomes the dominant factor that restricts the robustness of over-sparsified models. Extensive experiments are conducted on several public datasets, and the results show that the proposed constraints significantly improve the robustness of a DNN with high pruning rates.

Exploring Robustness of Cortical Morphometry in the presence of white matter lesions, using Diffusion Models for Lesion Filling

Vinzenz Uhr,Ivan Diaz,Christian Rummel,Richard McKinley

Task: 研究深度学习在存在白质病变的情况下提高皮质厚度测量的准确性和效率的潜力。

Motivation: 白质病变会影响脑部分割方法的输出，进而导致皮质厚度测量的偏差，而基于深度学习的分割工具尚未被广泛研究其影响。

Details

Method: 使用基于去噪扩散网络的高质量病变填充算法，训练伪3D U-Net架构生成合成健康组织，并在病变填充前后应用形态测量方法分析皮质厚度测量的鲁棒性。 Result: 基于深度学习的脑部分割方法（如Fastsurfer、DL+DiReCT、ANTsPyNet）比传统分割方法（如Freesurfer、ANTs）表现出更高的鲁棒性。 Conclusion: 深度学习在提高皮质厚度测量的鲁棒性方面具有潜力，尤其是在存在白质病变的情况下。 Abstract: Cortical thickness measurements from magnetic resonance imaging, an important biomarker in many neurodegenerative and neurological disorders, are derived by many tools from an initial voxel-wise tissue segmentation. White matter (WM) hypointensities in T1-weighted imaging, such as those arising from multiple sclerosis or small vessel disease, are known to affect the output of brain segmentation methods and therefore bias cortical thickness measurements. These effects are well-documented among traditional brain segmentation tools but have not been studied extensively in tools based on deep-learning segmentations, which promise to be more robust. In this paper, we explore the potential of deep learning to enhance the accuracy and efficiency of cortical thickness measurement in the presence of WM lesions, using a high-quality lesion filling algorithm leveraging denoising diffusion networks. A pseudo-3D U-Net architecture trained on the OASIS dataset to generate synthetic healthy tissue, conditioned on binary lesion masks derived from the MSSEG dataset, allows realistic removal of white matter lesions in multiple sclerosis patients. By applying morphometry methods to patient images before and after lesion filling, we analysed robustness of global and regional cortical thickness measurements in the presence of white matter lesions. Methods based on a deep learning-based segmentation of the brain (Fastsurfer, DL+DiReCT, ANTsPyNet) exhibited greater robustness than those using classical segmentation methods (Freesurfer, ANTs).

Diffusion Counterfactuals for Image Regressors

Trung Duc Ha,Sidney Bender

Task: 提出两种方法，为图像回归任务生成反事实解释。

Motivation: 反事实解释在分类模型中广泛应用，但在回归任务中研究不足，尤其是在图像领域。

Details

Method: 基于扩散生成模型的两种方法：1）像素空间的去噪扩散概率模型；2）潜在空间的扩散自编码器。 Result: 在CelebA-HQ和合成数据集上生成真实、语义清晰且平滑的反事实解释，揭示了回归模型的决策过程及虚假相关性。 Conclusion: 回归任务的反事实解释中，特征变化依赖于预测值区域，且潜在空间方法质量更高，而像素空间方法更稀疏。 Abstract: Counterfactual explanations have been successfully applied to create human interpretable explanations for various black-box models. They are handy for tasks in the image domain, where the quality of the explanations benefits from recent advances in generative models. Although counterfactual explanations have been widely applied to classification models, their application to regression tasks remains underexplored. We present two methods to create counterfactual explanations for image regression tasks using diffusion-based generative models to address challenges in sparsity and quality: 1) one based on a Denoising Diffusion Probabilistic Model that operates directly in pixel-space and 2) another based on a Diffusion Autoencoder operating in latent space. Both produce realistic, semantic, and smooth counterfactuals on CelebA-HQ and a synthetic data set, providing easily interpretable insights into the decision-making process of the regression model and reveal spurious correlations. We find that for regression counterfactuals, changes in features depend on the region of the predicted value. Large semantic changes are needed for significant changes in predicted values, making it harder to find sparse counterfactuals than with classifiers. Moreover, pixel space counterfactuals are more sparse while latent space counterfactuals are of higher quality and allow bigger semantic changes.

Robust Flower Cluster Matching Using The Unscented Transform

Andy Chu,Rashik Shrestha,Yu Gu,Jason N. Gross

Task: 提出一种基于RGB-D数据生成描述符的方法，用于匹配花卉簇，并考虑簇内的空间不确定性。

Motivation: 由于授粉过程和生长及相机角度导致的视觉外观变化，图像配准成为精准机器人授粉中的重大挑战。

Details

Method: 利用Unscented Transform估计植物描述符的不确定性容忍度，并通过蒙特卡洛模拟验证其有效性。 Result: 提出的方法在动态环境中实现了鲁棒的花卉簇匹配，支持机器人授粉。 Conclusion: 该方法能有效应对植物生长和视觉变化，提升动态环境中的机器人授粉效果。 Abstract: Monitoring flowers over time is essential for precision robotic pollination in agriculture. To accomplish this, a continuous spatial-temporal observation of plant growth can be done using stationary RGB-D cameras. However, image registration becomes a serious challenge due to changes in the visual appearance of the plant caused by the pollination process and occlusions from growth and camera angles. Plants flower in a manner that produces distinct clusters on branches. This paper presents a method for matching flower clusters using descriptors generated from RGB-D data and considers allowing for spatial uncertainty within the cluster. The proposed approach leverages the Unscented Transform to efficiently estimate plant descriptor uncertainty tolerances, enabling a robust image-registration process despite temporal changes. The Unscented Transform is used to handle the nonlinear transformations by propagating the uncertainty of flower positions to determine the variations in the descriptor domain. A Monte Carlo simulation is used to validate the Unscented Transform results, confirming our method's effectiveness for flower cluster matching. Therefore, it can facilitate improved robotics pollination in dynamic environments.

UWarp: A Whole Slide Image Registration Pipeline to Characterize Scanner-Induced Local Domain Shift

Antoine Schieb,Bilal Hadjadji,Daniel Tshokola Mweze,Natalia Fernanda Valderrama,Valentin Derangère,Laurent Arnould,Sylvain Ladoire,Alain Lalande,Louis-Oscar Morel,Nathan Vinçon

Task: 提出一种基于UWarp的域偏移分析框架，用于解决组织病理学切片数字化中扫描仪引起的局部域偏移问题。

Motivation: 当前域偏移分析多集中在宏观层面（如切片或数据集级别），而忽略了局部组织特性对深度学习模型准确性的影响。

Details

Method: 采用分层配准方法，结合全局仿射变换和局部精细校正，实现组织切片的精确对齐。 Result: UWarp在CypathLung和BosomShieldBreast数据集上表现优于现有开源配准方法，中位目标配准误差（TRE）小于4像素（40倍放大下小于1微米），并显著减少计算时间。此外，UWarp揭示了预测变异性与组织密度的强相关性。 Conclusion: 局部域偏移分析的重要性得到验证，UWarp可作为提升计算病理学模型鲁棒性和域适应策略的有力工具。 Abstract: Histopathology slide digitization introduces scanner-induced domain shift that can significantly impact computational pathology models based on deep learning methods. In the state-of-the-art, this shift is often characterized at a broad scale (slide-level or dataset-level) but not patch-level, which limits our comprehension of the impact of localized tissue characteristics on the accuracy of the deep learning models. To address this challenge, we present a domain shift analysis framework based on UWarp, a novel registration tool designed to accurately align histological slides scanned under varying conditions. UWarp employs a hierarchical registration approach, combining global affine transformations with fine-grained local corrections to achieve robust tissue patch alignment. We evaluate UWarp using two private datasets, CypathLung and BosomShieldBreast, containing whole slide images scanned by multiple devices. Our experiments demonstrate that UWarp outperforms existing open-source registration methods, achieving a median target registration error (TRE) of less than 4 pixels (<1 micrometer at 40x magnification) while significantly reducing computational time. Additionally, we apply UWarp to characterize scanner-induced local domain shift in the predictions of Breast-NEOprAIdict, a deep learning model for breast cancer pathological response prediction. We find that prediction variability is strongly correlated with tissue density on a given patch. Our findings highlight the importance of localized domain shift analysis and suggest that UWarp can serve as a valuable tool for improving model robustness and domain adaptation strategies in computational pathology.

Benchmarking Machine Learning Methods for Distributed Acoustic Sensing

Shuaikai Shi,Qijun Zong

Task: 比较经典机器学习方法和深度学习模型在分布式声学传感（DAS）数据识别与解释中的性能。

Motivation: 通过集成机器学习（ML）提升DAS技术的自动化和智能化分析能力，以应对关键基础设施领域的监测需求。

Details

Method: 利用经典机器学习方法和深度学习模型对DAS数据进行识别与解释。 Result: 展示了ML增强的DAS技术在数据采集精度和智能决策可靠性方面的潜力。 Conclusion: ML与DAS的结合为智能传感技术的发展提供了新的可能性，尤其在交通、能源和自然灾害监测等领域具有广阔应用前景。 Abstract: Distributed acoustic sensing (DAS) technology represents an innovative fiber-optic-based sensing methodology that enables real-time acoustic signal monitoring through the detection of minute perturbations along optical fibers. This sensing approach offers compelling advantages, including extensive measurement ranges, exceptional spatial resolution, and an expansive dynamic measurement spectrum. The integration of machine learning (ML) paradigms presents transformative potential for DAS technology, encompassing critical domains such as data augmentation, sophisticated preprocessing techniques, and advanced acoustic event classification and recognition. By leveraging ML algorithms, DAS systems can transition from traditional data processing methodologies to more automated and intelligent analytical frameworks. The computational intelligence afforded by ML-enhanced DAS technologies facilitates unprecedented monitoring capabilities across diverse critical infrastructure sectors. Particularly noteworthy are the technology's applications in transportation infrastructure, energy management systems, and Natural disaster monitoring frameworks, where the precision of data acquisition and the reliability of intelligent decision-making mechanisms are paramount. This research critically examines the comparative performance characteristics of classical machine learning methodologies and state-of-the-art deep learning models in the context of DAS data recognition and interpretation, offering comprehensive insights into the evolving landscape of intelligent sensing technologies.

Demand Estimation with Text and Image Data

Giovanni Compiani,Ilya Morozov,Stephan Seiler

Task: 提出一种利用非结构化文本和图像数据推断替代模式的需求估计方法。

Motivation: 解决缺乏产品属性数据或消费者难以量化属性（如视觉设计或功能效益）时的需求估计问题。

Details

Method: 使用预训练的深度学习模型从产品图像和文本描述中提取嵌入，并将其纳入随机系数logit模型。 Result: 在反事实预测消费者第二选择时，该方法优于标准基于属性的模型，并在Amazon.com的40个产品类别中验证了文本和图像数据有助于识别紧密替代品。 Conclusion: 该方法为需求估计提供了更全面的数据支持，尤其在缺乏传统属性数据时表现优异。 Abstract: We propose a demand estimation method that leverages unstructured text and image data to infer substitution patterns. Using pre-trained deep learning models, we extract embeddings from product images and textual descriptions and incorporate them into a random coefficients logit model. This approach enables researchers to estimate demand even when they lack data on product attributes or when consumers value hard-to-quantify attributes, such as visual design or functional benefits. Using data from a choice experiment, we show that our approach outperforms standard attribute-based models in counterfactual predictions of consumers' second choices. We also apply it across 40 product categories on Amazon.com and consistently find that text and image data help identify close substitutes within each category.

ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems

Chenxi Wang,Jizhan Fang,Xiang Chen,Bozhong Tian,Ziwen Xu,Huajun Chen,Ningyu Zhang

Task: 提出一种基于知识编辑的方法（ADS-Edit）以解决大型多模态模型在自动驾驶系统中的直接应用问题。

Motivation: 大型多模态模型在自动驾驶系统中的直接应用面临交通知识误解、复杂路况和车辆状态多样性等挑战。

Details

Method: 使用知识编辑技术对模型行为进行针对性修改，无需完全重新训练，并引入专为自动驾驶设计的ADS-Edit数据集。 Result: 通过全面实验验证了方法的有效性，并得出多项有趣结论。 Conclusion: 该工作有望推动知识编辑技术在自动驾驶领域的进一步应用。 Abstract: Recent advancements in Large Multimodal Models (LMMs) have shown promise in Autonomous Driving Systems (ADS). However, their direct application to ADS is hindered by challenges such as misunderstanding of traffic knowledge, complex road conditions, and diverse states of vehicle. To address these challenges, we propose the use of Knowledge Editing, which enables targeted modifications to a model's behavior without the need for full retraining. Meanwhile, we introduce ADS-Edit, a multimodal knowledge editing dataset specifically designed for ADS, which includes various real-world scenarios, multiple data types, and comprehensive evaluation metrics. We conduct comprehensive experiments and derive several interesting conclusions. We hope that our work will contribute to the further advancement of knowledge editing applications in the field of autonomous driving. Code and data are available in https://github.com/zjunlp/EasyEdit.

MindfulLIME: A Stable Solution for Explanations of Machine Learning Models with Enhanced Localization Precision -- A Medical Image Case Study

Shakiba Rahimiaghdam,Hande Alemdar

Task: 提出一种名为MindfulLIME的新算法，以解决LIME在生成解释时的不稳定性问题。

Motivation: 在敏感领域（如医疗、金融和司法）中，机器学习决策的透明性至关重要，但现有解释算法（如LIME）因随机扰动样本导致解释不稳定，影响模型的可信度和采用。

Details

Method: MindfulLIME通过基于图的剪枝算法和不确定性采样智能生成有目的的样本，显著提高解释的稳定性。 Result: 实验表明，MindfulLIME在相同条件下提供可靠解释的成功率为100%，并提高了视觉解释的定位精度。 Conclusion: MindfulLIME通过解决LIME在图像数据中的稳定性问题，增强了机器学习模型在医疗影像等关键领域的可信度和可解释性。 Abstract: Ensuring transparency in machine learning decisions is critically important, especially in sensitive sectors such as healthcare, finance, and justice. Despite this, some popular explainable algorithms, such as Local Interpretable Model-agnostic Explanations (LIME), often produce unstable explanations due to the random generation of perturbed samples. Random perturbation introduces small changes or noise to modified instances of the original data, leading to inconsistent explanations. Even slight variations in the generated samples significantly affect the explanations provided by such models, undermining trust and hindering the adoption of interpretable models. To address this challenge, we propose MindfulLIME, a novel algorithm that intelligently generates purposive samples using a graph-based pruning algorithm and uncertainty sampling. MindfulLIME substantially improves the consistency of visual explanations compared to random sampling approaches. Our experimental evaluation, conducted on a widely recognized chest X-ray dataset, confirms MindfulLIME's stability with a 100% success rate in delivering reliable explanations under identical conditions. Additionally, MindfulLIME improves the localization precision of visual explanations by reducing the distance between the generated explanations and the actual local annotations compared to LIME. We also performed comprehensive experiments considering various segmentation algorithms and sample numbers, focusing on stability, quality, and efficiency. The results demonstrate the outstanding performance of MindfulLIME across different segmentation settings, generating fewer high-quality samples within a reasonable processing time. By addressing the stability limitations of LIME in image data, MindfulLIME enhances the trustworthiness and interpretability of machine learning models in specific medical imaging applications, a critical domain.