2025 03 28

ECLAIR: Enhanced Clarification for Interactive Responses in an Enterprise AI Assistant

John Murzaku,Zifan Liu,Vaishnavi Muppala,Md Mehrab Tanjim,Xiang Chen,Yunyao Li

Task: 提出一种多智能体框架ECLAIR，用于解决大语言模型在现实企业级交互中的歧义问题。

Motivation: 大语言模型在理解和生成自然语言方面表现出色，但在处理现实企业级交互中的歧义时表现不佳，尤其是涉及上下文和领域知识时。

Details

Method: ECLAIR通过定义定制智能体、进行歧义推理、生成澄清问题并利用用户反馈优化最终响应，实现交互式消歧。 Result: 在真实客户数据测试中，ECLAIR在澄清问题生成方面显著优于标准少样本方法。 Conclusion: ECLAIR框架有效提升了大语言模型在复杂交互中的歧义处理能力。 Abstract: Large language models (LLMs) have shown remarkable progress in understanding and generating natural language across various applications. However, they often struggle with resolving ambiguities in real-world, enterprise-level interactions, where context and domain-specific knowledge play a crucial role. In this demonstration, we introduce ECLAIR (Enhanced CLArification for Interactive Responses), a multi-agent framework for interactive disambiguation. ECLAIR enhances ambiguous user query clarification through an interactive process where custom agents are defined, ambiguity reasoning is conducted by the agents, clarification questions are generated, and user feedback is leveraged to refine the final response. When tested on real-world customer data, ECLAIR demonstrates significant improvements in clarification question generation compared to standard few-shot methods.

Can Zero-Shot Commercial APIs Deliver Regulatory-Grade Clinical Text DeIdentification?

Veysel Kocaman,Muhammed Santas,Yigit Gul,Mehmet Butgul,David Talby

Task: 评估三种API驱动的去标识化系统（Azure Health Data Services、AWS Comprehend Medical和OpenAI GPT-4o）与自研系统Healthcare NLP在临床文档去标识化任务中的性能。

Motivation: 验证商业API在临床去标识化任务中的准确性、适应性和成本效益是否满足监管要求，并展示自研系统的优势。

Details

Method: 在48份专家标注的临床文档上，从实体级别和标记级别对四种系统进行性能对比分析。 Result: Healthcare NLP以96%的F1分数显著优于Azure（91%）、AWS（83%）和GPT-4o（79%），且成本降低80%以上。 Conclusion: 商业API无法满足临床去标识化的监管要求，而Healthcare NLP凭借高准确性、可定制性和经济性成为更优选择。 Abstract: We systematically assess the performance of three leading API-based de-identification systems - Azure Health Data Services, AWS Comprehend Medical, and OpenAI GPT-4o - against our de-identification systems on a ground truth dataset of 48 clinical documents annotated by medical experts. Our analysis, conducted at both entity-level and token-level, demonstrates that our solution, Healthcare NLP, achieves the highest accuracy, with a 96% F1-score in protected health information (PHI) detection, significantly outperforming Azure (91%), AWS (83%), and GPT-4o (79%). Beyond accuracy, Healthcare NLP is also the most cost-effective solution, reducing processing costs by over 80% compared to Azure and GPT-4o. Its fixed-cost local deployment model avoids the escalating per-request fees of cloud-based services, making it a scalable and economical choice. Our results underscore a critical limitation: zero-shot commercial APIs fail to meet the accuracy, adaptability, and cost-efficiency required for regulatory-grade clinical de-identification. Healthcare NLP's superior performance, customization capabilities, and economic advantages position it as the more viable solution for healthcare organizations seeking compliance and scalability in clinical NLP workflows.

"Whose Side Are You On?" Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection

Muhammad Haroon,Magdalena Wojcieszak,Anshuman Chhabra

Task: 利用大型语言模型（LLMs）通过上下文学习（ICL）对美国两党政治光谱中的在线内容进行政治意识形态分类。

Motivation: 现有意识形态分类方法需要大量人工标注且难以适应动态变化的意识形态环境，因此探索LLMs的潜力以解决这些问题。

Details

Method: 采用上下文学习（ICL）方法，通过标签平衡的演示选择在三个数据集（新闻文章和YouTube视频）上进行实验。 Result: 该方法在意识形态分类上显著优于零样本和传统监督方法，并评估了元数据（如内容来源和描述）对分类的影响。 Conclusion: LLMs在政治意识形态分类中表现出色，且内容来源对分类结果有显著影响。 Abstract: The rapid growth of social media platforms has led to concerns about radicalization, filter bubbles, and content bias. Existing approaches to classifying ideology are limited in that they require extensive human effort, the labeling of large datasets, and are not able to adapt to evolving ideological contexts. This paper explores the potential of Large Language Models (LLMs) for classifying the political ideology of online content in the context of the two-party US political spectrum through in-context learning (ICL). Our extensive experiments involving demonstration selection in label-balanced fashion, conducted on three datasets comprising news articles and YouTube videos, reveal that our approach significantly outperforms zero-shot and traditional supervised methods. Additionally, we evaluate the influence of metadata (e.g., content source and descriptions) on ideological classification and discuss its implications. Finally, we show how providing the source for political and non-political content influences the LLM's classification.

SE-GNN: Seed Expanded-Aware Graph Neural Network with Iterative Optimization for Semi-supervised Entity Alignment

Tao Meng,Shuo Shan,Hongen Shao,Yuntao Shou,Wei Ai,Keqin Li

Task: 提出一种名为SE-GNN的种子扩展感知图神经网络，用于半监督实体对齐。

Motivation: 解决知识图谱（KGs）规模增大时手动标注种子对困难，以及现有方法因结构异质性和噪声种子对嵌入失真的影响导致对齐效果不佳的问题。

Details

Method: 结合语义属性和结构特征获取高质量初始种子对，设计局部和全局感知机制优化嵌入表示，并采用阈值最近邻嵌入校正策略筛选迭代种子对。 Result: SE-GNN能够缓解KGs结构异质性影响，减少嵌入失真，提高实体对齐效果。 Conclusion: SE-GNN通过综合优化初始种子对和嵌入表示，有效提升了半监督实体对齐的性能。 Abstract: Entity alignment aims to use pre-aligned seed pairs to find other equivalent entities from different knowledge graphs (KGs) and is widely used in graph fusion-related fields. However, as the scale of KGs increases, manually annotating pre-aligned seed pairs becomes difficult. Existing research utilizes entity embeddings obtained by aggregating single structural information to identify potential seed pairs, thus reducing the reliance on pre-aligned seed pairs. However, due to the structural heterogeneity of KGs, the quality of potential seed pairs obtained using only a single structural information is not ideal. In addition, although existing research improves the quality of potential seed pairs through semi-supervised iteration, they underestimate the impact of embedding distortion produced by noisy seed pairs on the alignment effect. In order to solve the above problems, we propose a seed expanded-aware graph neural network with iterative optimization for semi-supervised entity alignment, named SE-GNN. First, we utilize the semantic attributes and structural features of entities, combined with a conditional filtering mechanism, to obtain high-quality initial potential seed pairs. Next, we designed a local and global awareness mechanism. It introduces initial potential seed pairs and combines local and global information to obtain a more comprehensive entity embedding representation, which alleviates the impact of KGs structural heterogeneity and lays the foundation for the optimization of initial potential seed pairs. Then, we designed the threshold nearest neighbor embedding correction strategy. It combines the similarity threshold and the bidirectional nearest neighbor method as a filtering mechanism to select iterative potential seed pairs and also uses an embedding correction strategy to eliminate the embedding distortion.

Multimodal Image Matching based on Frequency-domain Information of Local Energy Response

Meng Yang,Jun Chen,Wenping Gong,Longsheng Wei,Xin Tian

Task: 提出一种基于频域信息的局部能量响应模型（FILER）来解决多模态图像匹配中的非线性强度差异、局部几何失真、噪声和旋转变换问题。

Motivation: 多模态图像匹配面临非线性强度差异、局部几何失真、噪声和旋转变换等挑战，需要一种鲁棒性强且通用的方法。

Details

Method: 设计了基于频域信息的局部能量响应模型（FILER），包括边缘结构增强的特征检测器和卷积特征加权描述符。 Result: 实验表明，FILER在多模态图像匹配中优于其他先进算法，具有鲁棒性和通用性。 Conclusion: FILER是一种有效的多模态图像匹配方法，能够克服多种挑战并实现旋转不变性。 Abstract: Complicated nonlinear intensity differences, nonlinear local geometric distortions, noises and rotation transformation are main challenges in multimodal image matching. In order to solve these problems, we propose a method based on Frequency-domain Information of Local Energy Response called FILER. The core of FILER is the local energy response model based on frequency-domain information, which can overcome the effect of nonlinear intensity differences. To improve the robustness to local nonlinear geometric distortions and noises, we design a new edge structure enhanced feature detector and convolutional feature weighted descriptor, respectively. In addition, FILER overcomes the sensitivity of the frequency-domain information to the rotation angle and achieves rotation invariance. Extensive experiments multimodal image pairs show that FILER outperforms other state-of-the-art algorithms and has good robustness and universality.

Comprehensive Manuscript Assessment with Text Summarization Using 69707 articles

Qichen Sun,Yuxing Lu,Kun Xia,Li Chen,He Sun,Jinzhuo Wang

Task: 利用深度学习方法对科学论文的未来影响力进行分类预测。

Motivation: 快速高效地评估研究论文的未来影响力对作者和审稿人至关重要，但现有方法多局限于特定领域或依赖早期引用数据，难以用于早期评估。

Details

Method: 利用Scopus构建多学科大规模数据集，结合Transformer模型提取语义特征，设计文本融合层捕捉标题与摘要的共享信息。 Result: 实验证明所提模型在影响力预测任务上表现优越，并具备生成论文反馈和改进建议的潜力。 Conclusion: 该方法为早期评估论文影响力提供了有效工具，并展示了在多学科应用中的潜力。 Abstract: Rapid and efficient assessment of the future impact of research articles is a significant concern for both authors and reviewers. The most common standard for measuring the impact of academic papers is the number of citations. In recent years, numerous efforts have been undertaken to predict citation counts within various citation windows. However, most of these studies focus solely on a specific academic field or require early citation counts for prediction, rendering them impractical for the early-stage evaluation of papers. In this work, we harness Scopus to curate a significantly comprehensive and large-scale dataset of information from 69707 scientific articles sourced from 99 journals spanning multiple disciplines. We propose a deep learning methodology for the impact-based classification tasks, which leverages semantic features extracted from the manuscripts and paper metadata. To summarize the semantic features, such as titles and abstracts, we employ a Transformer-based language model to encode semantic features and design a text fusion layer to capture shared information between titles and abstracts. We specifically focus on the following impact-based prediction tasks using information of scientific manuscripts in pre-publication stage: (1) The impact of journals in which the manuscripts will be published. (2) The future impact of manuscripts themselves. Extensive experiments on our datasets demonstrate the superiority of our proposed model for impact-based prediction tasks. We also demonstrate potentials in generating manuscript's feedback and improvement suggestions.

MedSegNet10: A Publicly Accessible Network Repository for Split Federated Medical Image Segmentation

Chamani Shiranthika,Zahra Hafezi Kafshgari,Hadi Hadizadeh,Parvaneh Saeedi

Task: 介绍一个名为MedSegNet10的公开存储库，用于基于分割联邦学习的医学图像分割。

Motivation: 解决医学图像分割中的数据隐私、标注数据有限和训练数据不足等问题。

Details

Method: 利用分割联邦学习（SplitFed/SFL）技术，提供预训练的神经网络架构，支持多种医学图像类型。 Result: MedSegNet10实现了在私有数据上的协作训练，确保数据隐私和完整性。 Conclusion: MedSegNet10为研究人员和从业者提供了一个工具，推动医学图像分割的发展，同时保护患者数据隐私。 Abstract: Machine Learning (ML) and Deep Learning (DL) have shown significant promise in healthcare, particularly in medical image segmentation, which is crucial for accurate disease diagnosis and treatment planning. Despite their potential, challenges such as data privacy concerns, limited annotated data, and inadequate training data persist. Decentralized learning approaches such as federated learning (FL), split learning (SL), and split federated learning (SplitFed/SFL) address these issues effectively. This paper introduces "MedSegNet10," a publicly accessible repository designed for medical image segmentation using split-federated learning. MedSegNet10 provides a collection of pre-trained neural network architectures optimized for various medical image types, including microscopic images of human blastocysts, dermatoscopic images of skin lesions, and endoscopic images of lesions, polyps, and ulcers, with applications extending beyond these examples. By leveraging SplitFed's benefits, MedSegNet10 allows collaborative training on privately stored, horizontally split data, ensuring privacy and integrity. This repository supports researchers, practitioners, trainees, and data scientists, aiming to advance medical image segmentation while maintaining patient data privacy. The repository is available at: https://vault.sfu.ca/index.php/s/ryhf6t12O0sobuX (password upon request to the authors).

Named Entity Recognition in Context

Colin Brisson,Ayoub Kahfy,Marc Bui,Frédéric Constant

Task: 开发一个用于EvaHan2025竞赛的命名实体识别系统。

Motivation: 通过结合现代预训练模型、检索模块和生成推理步骤，提升古典中文文本中的命名实体识别性能。

Details

Method: 整合了基于现代Transformer的双向编码器Pindola、检索模块和生成推理步骤。 Result: 平均F1得分为85.58，比竞赛基线提高了近5分。 Conclusion: 该方法在古典中文命名实体识别任务中表现优异，显著优于基线模型。 Abstract: We present the Named Entity Recognition system developed by the Edit Dunhuang team for the EvaHan2025 competition. Our approach integrates three core components: (1) Pindola, a modern transformer-based bidirectional encoder pretrained on a large corpus of Classical Chinese texts; (2) a retrieval module that fetches relevant external context for each target sequence; and (3) a generative reasoning step that summarizes retrieved context in Classical Chinese for more robust entity disambiguation. Using this approach, we achieve an average F1 score of 85.58, improving upon the competition baseline by nearly 5 points.

Unified Multimodal Discrete Diffusion

Alexander Swerdlow,Mihir Prabhudesai,Siddharth Gandhi,Deepak Pathak,Katerina Fragkiadaki

Task: 探索离散扩散模型作为联合文本和图像领域的统一生成框架。

Motivation: 多模态生成模型主要依赖自回归方法，但离散扩散模型在质量与多样性控制、联合多模态修复和生成可控性方面具有优势。

Details

Method: 提出统一多模态离散扩散模型（UniDisc），支持联合理解和生成文本与图像。 Result: UniDisc在多任务中表现优于自回归模型，具有更高的性能、推理效率、可控性和编辑能力。 Conclusion: 离散扩散模型在多模态生成任务中具有显著优势，UniDisc为未来研究提供了新方向。 Abstract: Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches, which process tokens sequentially from left to right, or top to bottom. These models jointly handle images, text, video, and audio for various tasks such as image captioning, question answering, and image generation. In this work, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain, building upon their recent success in text generation. Discrete diffusion models offer several advantages over AR models, including improved control over quality versus diversity of generated samples, the ability to perform joint multimodal inpainting (across both text and image domains), and greater controllability in generation through guidance. Leveraging these benefits, we present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images for a variety of downstream tasks. We compare UniDisc to multimodal AR models, performing a scaling analysis and demonstrating that UniDisc outperforms them in terms of both performance and inference-time compute, enhanced controllability, editability, inpainting, and flexible trade-off between inference time and generation quality. Code and additional visualizations are available at https://unidisc.github.io.

Both Direct and Indirect Evidence Contribute to Dative Alternation Preferences in Language Models

Qing Yao,Kanishka Misra,Leonie Weissweiler,Kyle Mahowald

Task: 探究语言模型（LMs）对英语双宾交替现象（DO与PO）的偏好是否源于直接暴露于该现象或更一般的语言特性。

Motivation: 理解语言模型在句法现象上的偏好是源于直接输入还是更普遍的语言特性。

Details

Method: 通过控制输入数据，训练小型语言模型，并系统性地操纵长度和生命度等属性，分析其对双宾交替选择的影响。 Result: 直接证据（长度和生命度）对偏好有影响，但即使缺乏直接证据，易优先偏好仍存在；间接证据（全局长度效应）也能导致偏好出现。 Conclusion: 语言模型的句法偏好是直接和间接证据共同作用的结果。 Abstract: Language models (LMs) tend to show human-like preferences on a number of syntactic phenomena, but the extent to which these are attributable to direct exposure to the phenomena or more general properties of language is unclear. We explore this with the English dative alternation (DO: "gave Y the X" vs. PO: "gave the X to Y"), using a controlled rearing paradigm wherein we iteratively train small LMs on systematically manipulated input. We focus on properties that affect the choice of alternant: length and animacy. Both properties are directly present in datives but also reflect more global tendencies for shorter elements to precede longer ones and animates to precede inanimates. First, by manipulating and ablating datives for these biases in the input, we show that direct evidence of length and animacy matters, but easy-first preferences persist even without such evidence. Then, using LMs trained on systematically perturbed datasets to manipulate global length effects (re-linearizing sentences globally while preserving dependency structure), we find that dative preferences can emerge from indirect evidence. We conclude that LMs' emergent syntactic preferences come from a mix of direct and indirect sources.

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

Silin Gao,Sheryl Mathew,Li Mi,Sepideh Mamooler,Mengjie Zhao,Hiromi Wakaki,Yuki Mitsufuji,Syrielle Montariol,Antoine Bosselut

Task: 提出一个新的基准VinaBench，用于解决视觉叙事生成中忠实性和一致性的挑战。

Motivation: 视觉叙事生成缺乏知识约束，导致生成的图像序列难以忠实于输入文本且自一致。

Details

Method: 通过标注视觉叙事样本中的常识和语篇约束，提供系统性支架，并基于此提出新的评估指标。 Result: 实验表明，结合VinaBench的知识约束能有效提升生成视觉叙事的忠实性和连贯性。 Conclusion: VinaBench为视觉叙事生成提供了有效的知识约束和评估方法。 Abstract: Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.

GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations

Yupei Li,Qiyang Sun,Sunil Munthumoduku Krishna Murthy,Emran Alturki,Björn W. Schuller

Task: 提出一种名为GatedxLSTM的新型语音-文本多模态对话情感识别模型，以解决现有方法在多模态特征对齐和情感演变解释上的不足。

Motivation: 人类情感具有动态性，受个体表达和他人互动影响，而传统单模态或多模态方法难以全面捕捉情感动态。

Details

Method: 结合语音和文本信号，利用对比语言-音频预训练（CLAP）改进跨模态对齐，并通过门控机制强调情感关键语句，同时引入对话情感解码器（DED）建模上下文依赖。 Result: 在IEMOCAP数据集上，GatedxLSTM在四类情感分类任务中达到开源方法中的最优性能。 Conclusion: GatedxLSTM不仅提升了情感识别的性能，还从心理学角度提供了可解释性分析，验证了其在对话情感识别中的有效性。 Abstract: Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual's expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverages multiple signals but traditionally relies on utterance-level analysis, overlooking the dynamic nature of emotions in conversations. Emotion Recognition in Conversation (ERC) addresses this limitation, yet existing methods struggle to align multimodal features and explain why emotions evolve within dialogues. To bridge this gap, we propose GatedxLSTM, a novel speech-text multimodal ERC model that explicitly considers voice and transcripts of both the speaker and their conversational partner(s) to identify the most influential sentences driving emotional shifts. By integrating Contrastive Language-Audio Pretraining (CLAP) for improved cross-modal alignment and employing a gating mechanism to emphasise emotionally impactful utterances, GatedxLSTM enhances both interpretability and performance. Additionally, the Dialogical Emotion Decoder (DED) refines emotion predictions by modelling contextual dependencies. Experiments on the IEMOCAP dataset demonstrate that GatedxLSTM achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification. These results validate its effectiveness for ERC applications and provide an interpretability analysis from a psychological perspective.

BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology

Amaya Gallagher-Syed,Henry Senior,Omnia Alwazzan,Elena Pontarini,Michele Bombardieri,Costantino Pitzalis,Myles J. Lewis,Michael R. Barnes,Luca Rossi,Gregory Slabaugh

Task: 开发一种可解释的图神经网络架构BioX-CPath，用于多染色免疫组化（IHC）分析的全幻灯片图像（WSI）分类。

Motivation: 解决计算病理学中生物可解释和可解释模型开发的关键挑战，特别是在多染色IHC分析中。

Details

Method: 引入一种新颖的Stain-Aware Attention Pooling（SAAP）模块，利用空间和语义特征生成生物意义明确的染色感知患者嵌入。 Result: 在类风湿关节炎和干燥综合征多染色数据集上实现了最先进的性能，并提供可解释的染色注意力分数、熵度量和染色交互分数。 Conclusion: BioX-CPath结合了生物可解释性和强大的分类性能，特别适用于临床应用中需要高解释性的场景。 Abstract: The development of biologically interpretable and explainable models remains a key challenge in computational pathology, particularly for multistain immunohistochemistry (IHC) analysis. We present BioX-CPath, an explainable graph neural network architecture for whole slide image (WSI) classification that leverages both spatial and semantic features across multiple stains. At its core, BioX-CPath introduces a novel Stain-Aware Attention Pooling (SAAP) module that generates biologically meaningful, stain-aware patient embeddings. Our approach achieves state-of-the-art performance on both Rheumatoid Arthritis and Sjogren's Disease multistain datasets. Beyond performance metrics, BioX-CPath provides interpretable insights through stain attention scores, entropy measures, and stain interaction scores, that permit measuring model alignment with known pathological mechanisms. This biological grounding, combined with strong classification performance, makes BioX-CPath particularly suitable for clinical applications where interpretability is key. Source code and documentation can be found at: https://github.com/AmayaGS/BioX-CPath.

Hacia la interpretabilidad de la detección anticipada de riesgos de depresión utilizando grandes modelos de lenguaje

Horacio Thompson,Maximiliano Sapino,Edgardo Ferretti,Marcelo Errecalde

Task: 利用大型语言模型（LLMs）在西班牙语文本中解决与抑郁相关的早期风险检测（EDR）问题。

Motivation: 评估LLMs在特定领域中的推理能力，并探索其在早期风险检测中的应用潜力。

Details

Method: 通过定义推理标准、应用上下文学习于Gemini模型，并进行定量和定性评估。 Result: 获得了准确的预测结果，并提供了可解释的推理，深化了对解决方案的理解。 Conclusion: 该方法为利用LLMs解决EDR问题提供了新的视角。 Abstract: Early Detection of Risks (EDR) on the Web involves identifying at-risk users as early as possible. Although Large Language Models (LLMs) have proven to solve various linguistic tasks efficiently, assessing their reasoning ability in specific domains is crucial. In this work, we propose a method for solving depression-related EDR using LLMs on Spanish texts, with responses that can be interpreted by humans. We define a reasoning criterion to analyze users through a specialist, apply in-context learning to the Gemini model, and evaluate its performance both quantitatively and qualitatively. The results show that accurate predictions can be obtained, supported by explanatory reasoning, providing a deeper understanding of the solution. Our approach offers new perspectives for addressing EDR problems by leveraging the power of LLMs.

Feature Modulation for Semi-Supervised Domain Generalization without Domain Labels

Venuri Amarasinghe,Asini Jayakody,Isun Randila,Kalinga Bandara,Chamuditha Jayanga Galappaththige,Ranga Rodrigo

Task: 解决无域标签的半监督域泛化问题，通过特征调制和动态损失缩放提升模型性能。

Motivation: 现有方法依赖伪标签和域标签，但域偏移导致伪标签不一致，影响模型性能。

Details

Method: 提出特征调制策略和动态损失缩放函数，增强类区分性特征并优化伪标签使用。 Result: 在四个主要域泛化基准上取得显著改进，无需域标签。 Conclusion: 方法有效提升了无域标签条件下的半监督域泛化性能，代码将公开。 Abstract: Semi-supervised domain generalization (SSDG) leverages a small fraction of labeled data alongside unlabeled data to enhance model generalization. Most of the existing SSDG methods rely on pseudo-labeling (PL) for unlabeled data, often assuming access to domain labels-a privilege not always available. However, domain shifts introduce domain noise, leading to inconsistent PLs that degrade model performance. Methods derived from FixMatch suffer particularly from lower PL accuracy, reducing the effectiveness of unlabeled data. To address this, we tackle the more challenging domain-label agnostic SSDG, where domain labels for unlabeled data are not available during training. First, we propose a feature modulation strategy that enhances class-discriminative features while suppressing domain-specific information. This modulation shifts features toward Similar Average Representations-a modified version of class prototypes-that are robust across domains, encouraging the classifier to distinguish between closely related classes and feature extractor to form tightly clustered, domain-invariant representations. Second, to mitigate domain noise and improve pseudo-label accuracy, we introduce a loss-scaling function that dynamically lowers the fixed confidence threshold for pseudo-labels, optimizing the use of unlabeled data. With these key innovations, our approach achieves significant improvements on four major domain generalization benchmarks-even without domain labels. We will make the code available.

Clean & Clear: Feasibility of Safe LLM Clinical Guidance

Julia Ive,Felix Jozsa,Nick Jackson,Paulina Bondaronek,Ciaran Scott Hill,Richard Dobson

Task: 开发和初步评估一个基于LLM的聊天机器人软件，能够可靠地回答临床指南问题。

Motivation: 临床指南在现代医疗中至关重要，而LLM赋能的聊天机器人在医疗问答任务中表现出巨大潜力，能够快速准确地回答医学问题。

Details

Method: 使用开源的Llama-3.1-8B LLM从UCLH指南中提取相关信息回答问题，强调信息引用的安全性和可靠性。七名医生评估了聊天机器人的表现。 Result: 聊天机器人在相关性、召回率和完整性方面表现良好，73%的回答被评为非常相关，召回率为0.98，78%的回答在完整性上令人满意。平均完成时间为10秒，72%的回答无临床推理缺陷。 Conclusion: 该聊天机器人显示出显著潜力，能够加速并改善医疗专业人员获取本地相关临床信息的过程。 Abstract: Background: Clinical guidelines are central to safe evidence-based medicine in modern healthcare, providing diagnostic criteria, treatment options and monitoring advice for a wide range of illnesses. LLM-empowered chatbots have shown great promise in Healthcare Q&A tasks, offering the potential to provide quick and accurate responses to medical inquiries. Our main objective was the development and preliminary assessment of an LLM-empowered chatbot software capable of reliably answering clinical guideline questions using University College London Hospital (UCLH) clinical guidelines. Methods: We used the open-weight Llama-3.1-8B LLM to extract relevant information from the UCLH guidelines to answer questions. Our approach highlights the safety and reliability of referencing information over its interpretation and response generation. Seven doctors from the ward assessed the chatbot's performance by comparing its answers to the gold standard. Results: Our chatbot demonstrates promising performance in terms of relevance, with ~73% of its responses rated as very relevant, showcasing a strong understanding of the clinical context. Importantly, our chatbot achieves a recall of 0.98 for extracted guideline lines, substantially minimising the risk of missing critical information. Approximately 78% of responses were rated satisfactory in terms of completeness. A small portion (~14.5%) contained minor unnecessary information, indicating occasional lapses in precision. The chatbot' showed high efficiency, with an average completion time of 10 seconds, compared to 30 seconds for human respondents. Evaluation of clinical reasoning showed that 72% of the chatbot's responses were without flaws. Our chatbot demonstrates significant potential to speed up and improve the process of accessing locally relevant clinical information for healthcare professionals.

Prototype Guided Backdoor Defense

Venkat Adithya Amula,Sunayana Samavedam,Saurabh Saini,Avani Gupta,Narayanan P J

Task: 提出一种针对深度学习模型后门攻击的防御方法Prototype Guided Backdoor Defense (PGBD)。

Motivation: 深度学习模型容易受到后门攻击，尤其是语义触发器的攻击，而现有防御方法难以应对多种触发器类型。

Details

Method: 利用激活空间中的几何位移，通过后微调步骤中的新型净化损失函数来惩罚触发器引起的位移。 Result: PGBD在多种攻击设置下表现优异，并首次成功防御了针对名人面部图像的新型语义攻击。 Conclusion: PGBD是一种可扩展且高效的防御方法，能够应对包括语义触发器在内的多种后门攻击。 Abstract: Deep learning models are susceptible to {\em backdoor attacks} involving malicious attackers perturbing a small subset of training data with a {\em trigger} to causes misclassifications. Various triggers have been used, including semantic triggers that are easily realizable without requiring the attacker to manipulate the image. The emergence of generative AI has eased the generation of varied poisoned samples. Robustness across types of triggers is crucial to effective defense. We propose Prototype Guided Backdoor Defense (PGBD), a robust post-hoc defense that scales across different trigger types, including previously unsolved semantic triggers. PGBD exploits displacements in the geometric spaces of activations to penalize movements toward the trigger. This is done using a novel sanitization loss of a post-hoc fine-tuning step. The geometric approach scales easily to all types of attacks. PGBD achieves better performance across all settings. We also present the first defense against a new semantic attack on celebrity face images. Project page: \hyperlink{https://venkatadithya9.github.io/pgbd.github.io/}{this https URL}.

Sociotechnical Effects of Machine Translation

Joss Moorkens,Andy Way,Séamus Lankford

Task: 探讨机器翻译（MT）的副作用和风险及其缓解方法。

Motivation: 随着神经机器翻译和大型语言模型（LLMs）的应用，其对气候变化的影响以及对翻译者和用户的潜在负面影响引起了关注。

Details

Method: 分析大型模型的碳足迹，提出使用小型高质量模型和微调预训练模型以减少碳排放；讨论MT对翻译者和用户的负面影响，以及数据版权和伦理问题；提出在危机场景中正确使用MT的方法。 Result: 小型模型和微调预训练模型能显著降低碳足迹；MT在危机场景中可挽救生命。 Conclusion: 通过合理方法，可以缓解MT的副作用和风险，同时发挥其积极作用。 Abstract: While the previous chapters have shown how machine translation (MT) can be useful, in this chapter we discuss some of the side-effects and risks that are associated, and how they might be mitigated. With the move to neural MT and approaches using Large Language Models (LLMs), there is an associated impact on climate change, as the models built by multinational corporations are massive. They are hugely expensive to train, consume large amounts of electricity, and output huge volumes of kgCO2 to boot. However, smaller models which still perform to a high level of quality can be built with much lower carbon footprints, and tuning pre-trained models saves on the requirement to train from scratch. We also discuss the possible detrimental effects of MT on translators and other users. The topics of copyright and ownership of data are discussed, as well as ethical considerations on data and MT use. Finally, we show how if done properly, using MT in crisis scenarios can save lives, and we provide a method of how this might be done.

LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular Videos

Daniel Etaat,Dvij Kalaria,Nima Rahmanian,Shankar Sastry

Task: 设计一个能够预测对手意图的代理系统，以提高乒乓球比赛中的反应能力。

Motivation: 在快速变化的乒乓球比赛中，冠军选手通过预测对手意图获得反应时间，现有系统缺乏有效的预测能力或受限于数据集。

Details

Method: 提出了一种可扩展的系统，用于从单目视频重建3D乒乓球比赛，并开发了一个不确定性感知控制器来预测对手动作。 Result: 在模拟中，与无预测能力的基线策略相比，该策略将高速击球的回球率从49.9%提高到59.0%。 Conclusion: 该系统通过预测对手动作显著提升了回球率，为设计更智能的乒乓球代理提供了有效方法。 Abstract: Physical agility is a necessary skill in competitive table tennis, but by no means sufficient. Champions excel in this fast-paced and highly dynamic environment by anticipating their opponent's intent - buying themselves the necessary time to react. In this work, we take one step towards designing such an anticipatory agent. Previous works have developed systems capable of real-time table tennis gameplay, though they often do not leverage anticipation. Among the works that forecast opponent actions, their approaches are limited by dataset size and variety. Our paper contributes (1) a scalable system for reconstructing monocular video of table tennis matches in 3D and (2) an uncertainty-aware controller that anticipates opponent actions. We demonstrate in simulation that our policy improves the ball return rate against high-speed hits from 49.9% to 59.0% as compared to a baseline non-anticipatory policy.

Arnav Arora,Srishti Yadav,Maria Antoniak,Serge Belongie,Isabelle Augenstein

Task: 开发一种多模态、多标签的框架分析方法，用于大规模分析新闻中的文本和图像。

Motivation: 现有研究局限于预定义的框架和纯文本分析，忽略了视觉上下文，无法全面理解媒体偏见。

Details

Method: 利用大型（视觉）语言模型，从文本和图像中提取潜在意义，并进行对比分析。 Result: 提出了一种可扩展的整合性框架分析方法，能够更全面地理解媒体偏见。 Conclusion: 多模态框架分析方法为理解媒体偏见提供了更完整的视角。 Abstract: Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, this leaves out valuable information about editorial choices, which include not just the written article but also accompanying photographs. To overcome such limitations, we present a method for conducting multi-modal, multi-label framing analysis at scale using large (vision-)language models. Grounding our work in framing theory, we extract latent meaning embedded in images used to convey a certain point and contrast that to the text by comparing the respective frames used. We also identify highly partisan framing of topics with issue-specific frame analysis found in prior qualitative work. We demonstrate a method for doing scalable integrative framing analysis of both text and image in news, providing a more complete picture for understanding media bias.

Eyes Tell the Truth: GazeVal Highlights Shortcomings of Generative AI in Medical Imaging

David Wong,Bin Wang,Gorkem Durak,Marouane Tliba,Akshay Chaudhari,Aladine Chetouani,Ahmet Enis Cetin,Cagdas Topel,Nicolo Gennaro,Camila Lopes Vendrami,Tugce Agirlar Trabzonlu,Amir Ali Rahsepar,Laetitia Perronne,Matthew Antalek,Onural Ozturk,Gokcan Okur,Andrew C. Gordon,Ayis Pyrros,Frank H. Miller,Amir Borhani,Hatice Savas,Eric Hart,Drew Torigian,Jayaram K. Udupa,Elizabeth Krupinski,Ulas Bagci

Task: 提出GazeVal框架，结合专家眼动数据和放射学评估来评估合成医学图像的质量。

Motivation: 当前合成医学图像的评估主要依赖计算指标，未能与人类专家识别对齐，导致图像缺乏临床真实性。

Details

Method: 结合专家眼动数据和直接放射学评估，利用放射科医生的注视模式评估合成图像质量。 Result: 实验显示96.6%的合成图像被识别为假，揭示了生成式AI在临床准确性上的局限性。 Conclusion: GazeVal框架有效填补了合成医学图像评估的空白，强调了临床真实性的重要性。 Abstract: The demand for high-quality synthetic data for model training and augmentation has never been greater in medical imaging. However, current evaluations predominantly rely on computational metrics that fail to align with human expert recognition. This leads to synthetic images that may appear realistic numerically but lack clinical authenticity, posing significant challenges in ensuring the reliability and effectiveness of AI-driven medical tools. To address this gap, we introduce GazeVal, a practical framework that synergizes expert eye-tracking data with direct radiological evaluations to assess the quality of synthetic medical images. GazeVal leverages gaze patterns of radiologists as they provide a deeper understanding of how experts perceive and interact with synthetic data in different tasks (i.e., diagnostic or Turing tests). Experiments with sixteen radiologists revealed that 96.6% of the generated images (by the most recent state-of-the-art AI algorithm) were identified as fake, demonstrating the limitations of generative AI in producing clinically accurate images.

ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

Yiqiao Jin,Stefano Petrangeli,Yu Shen,Gang Wu

Task: 提出一种名为ScreenLLM的多模态大语言模型，用于高级UI理解和动作预测。

Motivation: GUI代理的训练面临监督信号稀疏、大规模数据集的可扩展性以及对用户意图的细致理解等挑战。

Details

Method: 引入状态化屏幕模式（stateful screen schema）作为GUI交互的高效表示，并基于此开发ScreenLLM。 Result: 实验表明，ScreenLLM能准确建模用户行为并预测动作。 Conclusion: 该研究为构建可扩展、鲁棒且智能的GUI代理奠定了基础，提升了多样化软件环境中的用户交互体验。 Abstract: Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI interactions that captures key user actions and intentions over time. Building on this foundation, we introduce ScreenLLM, a set of multimodal large language models (MLLMs) tailored for advanced UI understanding and action prediction. Extensive experiments on both open-source and proprietary models show that ScreenLLM accurately models user behavior and predicts actions. Our work lays the foundation for scalable, robust, and intelligent GUI agents that enhance user interaction in diverse software environments.

MVFNet: Multipurpose Video Forensics Network using Multiple Forms of Forensic Evidence

Tai D. Nguyen,Matthew C. Stamm

Task: 开发一个多用途视频取证网络（MVFNet），用于检测多种视频篡改类型（如修复、深度伪造、拼接和编辑）。

Motivation: 现有取证网络通常只能检测单一篡改类型，而实际中视频篡改方式未知，因此需要一种通用解决方案。

Details

Method: 通过提取和分析多种取证特征模态（空间和时间异常），并采用多尺度分层Transformer模块检测不同空间尺度的不一致性。 Result: 实验表明，MVFNet在多种篡改可能的通用场景中表现最优，并在特定场景中与专用检测器相当。 Conclusion: MVFNet为解决视频篡改检测的通用性问题提供了一种有效方法。 Abstract: While videos can be falsified in many different ways, most existing forensic networks are specialized to detect only a single manipulation type (e.g. deepfake, inpainting). This poses a significant issue as the manipulation used to falsify a video is not known a priori. To address this problem, we propose MVFNet - a multipurpose video forensics network capable of detecting multiple types of manipulations including inpainting, deepfakes, splicing, and editing. Our network does this by extracting and jointly analyzing a broad set of forensic feature modalities that capture both spatial and temporal anomalies in falsified videos. To reliably detect and localize fake content of all shapes and sizes, our network employs a novel Multi-Scale Hierarchical Transformer module to identify forensic inconsistencies across multiple spatial scales. Experimental results show that our network obtains state-of-the-art performance in general scenarios where multiple different manipulations are possible, and rivals specialized detectors in targeted scenarios.

Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction

Xiaoran Xu,Zhaoqian Xue,Chi Zhang,Jhonatan Medri,Junjie Xiong,Jiayan Zhou,Jin Jin,Yongfeng Zhang,Siyuan Ma,Lingyao Li

Task: 分析公众对紧急护理设施的体验，以促进社区医疗发展。

Motivation: 传统调查方法在范围、时间和空间覆盖上存在局限，而通过在线评论或社交媒体众包可以更全面地获取公众意见。

Details

Method: 利用Google Maps评论数据，结合GPT模型进行提示工程，分析紧急护理的多方面情感，并研究人口统计和社会经济因素对公众感知的影响。 Result: 人际因素和运营效率是患者满意度的主要决定因素，而技术质量、财务和设施在多元模型中无显著独立影响；人口密度与评分有轻微关联。 Conclusion: 众包方法能有效揭示影响公众满意度的关键因素，为利益相关者提供改进紧急护理服务的见解。 Abstract: Investigating the public experience of urgent care facilities is essential for promoting community healthcare development. Traditional survey methods often fall short due to limited scope, time, and spatial coverage. Crowdsourcing through online reviews or social media offers a valuable approach to gaining such insights. With recent advancements in large language models (LLMs), extracting nuanced perceptions from reviews has become feasible. This study collects Google Maps reviews across the DMV and Florida areas and conducts prompt engineering with the GPT model to analyze the aspect-based sentiment of urgent care. We first analyze the geospatial patterns of various aspects, including interpersonal factors, operational efficiency, technical quality, finances, and facilities. Next, we determine Census Block Group(CBG)-level characteristics underpinning differences in public perception, including population density, median income, GINI Index, rent-to-income ratio, household below poverty rate, no insurance rate, and unemployment rate. Our results show that interpersonal factors and operational efficiency emerge as the strongest determinants of patient satisfaction in urgent care, while technical quality, finances, and facilities show no significant independent effects when adjusted for in multivariate models. Among socioeconomic and demographic factors, only population density demonstrates a significant but modest association with patient ratings, while the remaining factors exhibit no significant correlations. Overall, this study highlights the potential of crowdsourcing to uncover the key factors that matter to residents and provide valuable insights for stakeholders to improve public satisfaction with urgent care.

Forensic Self-Descriptions Are All You Need for Zero-Shot Detection, Open-Set Source Attribution, and Clustering of AI-generated Images

Tai D. Nguyen,Aref Azizpour,Matthew C. Stamm

Task: 提出一种基于自监督学习的方法，用于检测和溯源合成图像。

Motivation: 随着AI生成图像技术的快速发展，传统检测方法难以泛化到未知生成器，亟需新方法解决这一问题。

Details

Method: 通过自监督学习建模图像的微观结构，提取多尺度残差，生成图像的独特法医自描述。 Result: 实验表明，该方法在零样本检测、开集溯源和聚类任务中优于现有技术。 Conclusion: 该方法在合成媒体法医领域取得了显著进展，具有更高的准确性和适应性。 Abstract: The emergence of advanced AI-based tools to generate realistic images poses significant challenges for forensic detection and source attribution, especially as new generative techniques appear rapidly. Traditional methods often fail to generalize to unseen generators due to reliance on features specific to known sources during training. To address this problem, we propose a novel approach that explicitly models forensic microstructures - subtle, pixel-level patterns unique to the image creation process. Using only real images in a self-supervised manner, we learn a set of diverse predictive filters to extract residuals that capture different aspects of these microstructures. By jointly modeling these residuals across multiple scales, we obtain a compact model whose parameters constitute a unique forensic self-description for each image. This self-description enables us to perform zero-shot detection of synthetic images, open-set source attribution of images, and clustering based on source without prior knowledge. Extensive experiments demonstrate that our method achieves superior accuracy and adaptability compared to competing techniques, advancing the state of the art in synthetic media forensics.

Hannah Kim,Sofia Martinez,Jason Lee

Task: 提出一种跨模态状态空间图推理（CSS-GR）框架，用于从大规模多模态数据中提取紧凑且有意义的摘要。

Motivation: 现有跨模态摘要方法存在计算开销高和可解释性差的问题，需要一种更高效且可解释的解决方案。

Details

Method: 结合状态空间模型和图消息传递，构建捕捉模态间和模态内关系的图结构，实现更全面的推理。 Result: 在标准多模态摘要基准测试中，显著提高了摘要质量和可解释性，同时保持计算效率。 Conclusion: CSS-GR框架在多模态摘要任务中表现出色，并通过消融研究验证了各组件的重要性。 Abstract: The ability to extract compact, meaningful summaries from large-scale and multimodal data is critical for numerous applications, ranging from video analytics to medical reports. Prior methods in cross-modal summarization have often suffered from high computational overheads and limited interpretability. In this paper, we propose a \textit{Cross-Modal State-Space Graph Reasoning} (\textbf{CSS-GR}) framework that incorporates a state-space model with graph-based message passing, inspired by prior work on efficient state-space models. Unlike existing approaches relying on purely sequential models, our method constructs a graph that captures inter- and intra-modal relationships, allowing more holistic reasoning over both textual and visual streams. We demonstrate that our approach significantly improves summarization quality and interpretability while maintaining computational efficiency, as validated on standard multimodal summarization benchmarks. We also provide a thorough ablation study to highlight the contributions of each component.

Reconstructing Gridded Data from Higher Autocorrelations

W. Riley Casper,Bobby Orozco

Task: 从高阶自相关中重建网格化数据集。

Motivation: 高阶自相关在X射线晶体学、计算机视觉、相关断层扫描等领域有广泛应用，研究其重建问题具有实际意义。

Details

Method: 提出一种显式重建算法，并证明3r + 3阶自相关足以确定数据（r为网格维度）。 Result: 证明了3r + 3阶自相关的充分性，并给出了3r + 2阶不足的实例。 Conclusion: 高阶自相关在网格化数据集重建中具有重要作用，3r + 3阶是充分条件。 Abstract: The higher-order autocorrelations of integer-valued or rational-valued gridded data sets appear naturally in X-ray crystallography, and have applications in computer vision systems, correlation tomography, correlation spectroscopy, and pattern recognition. In this paper, we consider the problem of reconstructing a gridded data set from its higher-order autocorrelations. We describe an explicit reconstruction algorithm, and prove that the autocorrelations up to order 3r + 3 are always sufficient to determine the data up to translation, where r is the dimension of the grid. We also provide examples of rational-valued gridded data sets which are not determined by their autocorrelations up to order 3r + 2.

Multi-head Reward Aggregation Guided by Entropy

Xiaomin Li,Xupeng Chen,Jingxuan Fan,Eric Hanchen Jiang,Mingye Gao

Task: 提出一种基于熵的多头奖励建模方法（ENCORE），用于改进大型语言模型（LLM）与安全指南的对齐。

Motivation: 传统的基于人类反馈的强化学习（RLHF）依赖一致性较低的整体质量评分，而基于多安全准则的详细评估更可靠。

Details

Method: 通过熵值评估安全规则的可靠性，并基于熵值对多头奖励进行加权，提出训练无关的ENCORE方法。 Result: 在RewardBench安全任务中，ENCORE显著优于随机加权、均匀加权、单头Bradley-Terry模型和基于LLM的评判方法。 Conclusion: ENCORE是一种实用且高效的多属性奖励建模方法，具有广泛适用性和可解释性。 Abstract: Aligning large language models (LLMs) with safety guidelines typically involves reinforcement learning from human feedback (RLHF), relying on human-generated preference annotations. However, assigning consistent overall quality ratings is challenging, prompting recent research to shift towards detailed evaluations based on multiple specific safety criteria. This paper uncovers a consistent observation: safety rules characterized by high rating entropy are generally less reliable in identifying responses preferred by humans. Leveraging this finding, we introduce ENCORE, a straightforward entropy-guided approach that composes multi-head rewards by downweighting rules exhibiting high rating entropy. Theoretically, we demonstrate that rules with elevated entropy naturally receive minimal weighting in the Bradley-Terry optimization framework, justifying our entropy-based penalization. Through extensive experiments on RewardBench safety tasks, our method significantly surpasses several competitive baselines, including random weighting, uniform weighting, single-head Bradley-Terry models, and LLM-based judging methods. Our proposed approach is training-free, broadly applicable to various datasets, and maintains interpretability, offering a practical and effective solution for multi-attribute reward modeling.

What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

Chi-Hsi Kung,Frangil Ramirez,Juhyung Ha,Yi-Ting Chen,David Crandall,Yi-Hsuan Tsai

Task: 研究如何通过结合大型语言模型生成的状态变化描述及其反事实推理，学习过程感知的视频表示。

Motivation: 现有工作未明确学习场景状态变化，而理解过程活动需要建模动作步骤如何改变场景以及场景变化如何影响动作序列。

Details

Method: 利用大型语言模型生成的状态变化描述作为监督信号，并生成状态变化反事实以模拟假设失败场景。 Result: 在时间动作分割和错误检测等任务上取得了显著改进。 Conclusion: 提出的状态变化描述及其反事实推理有效提升了模型对过程的理解能力。 Abstract: Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Existing work has studied procedure-aware video representations by proposing novel approaches such as modeling the temporal order of actions and has not explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by Large Language Models (LLMs) as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining the unseen ``What if'' scenarios. This counterfactual reasoning facilitates the model's ability to understand the cause and effect of each step in an activity. To verify the procedure awareness of our model, we conduct extensive experiments on procedure-aware tasks, including temporal action segmentation and error detection. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals and achieve significant improvements on multiple tasks. We will make our source code and data publicly available soon.

Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters

Mahmoud Alwakeel,Emory Buck,Jonathan G. Martin,Imran Aslam,Sudarshan Rajagopal,Jian Pei,Mihai V. Podgoreanu,Christopher J. Lindsell,An-Kwok Ian Wong

Task: 评估大型语言模型（LLMs）在从CTPE报告中提取PE相关概念的准确性。

Motivation: 由于异质性和难以获取的放射学文档，对肺栓塞（PE）最佳管理的理解有限，而PERT Consortium注册表依赖资源密集型手动提取。

Details

Method: 回顾性分析MIMIC-IV和Duke Health的CTPE报告，使用多个LLaMA模型进行比较。 Result: 较大模型（70B）表现优于小模型（8B），kappa值在PE检测、位置、右心应变和图像伪影方面表现良好；双模型审查框架实现了80-90%的精确度。 Conclusion: LLMs在自动化PE注册表提取方面具有潜力，可减少人工工作量并保持准确性。 Abstract: Pulmonary embolism (PE) is a leading cause of cardiovascular mortality, yet our understanding of optimal management remains limited due to heterogeneous and inaccessible radiology documentation. The PERT Consortium registry standardizes PE management data but depends on resource-intensive manual abstraction. Large language models (LLMs) offer a scalable alternative for automating concept extraction from computed tomography PE (CTPE) reports. This study evaluated the accuracy of LLMs in extracting PE-related concepts compared to a human-curated criterion standard. We retrospectively analyzed MIMIC-IV and Duke Health CTPE reports using multiple LLaMA models. Larger models (70B) outperformed smaller ones (8B), achieving kappa values of 0.98 (PE detection), 0.65-0.75 (PE location), 0.48-0.51 (right heart strain), and 0.65-0.70 (image artifacts). Moderate temperature tuning (0.2-0.5) improved accuracy, while excessive in-context examples reduced performance. A dual-model review framework achieved >80-90% precision. LLMs demonstrate strong potential for automating PE registry abstraction, minimizing manual workload while preserving accuracy.

Online Reasoning Video Segmentation with Just-in-Time Digital Twins

Yiqing Shen,Bohan Liu,Chenjia Li,Lalithkumar Seenivasan,Mathias Unberath

Task: 提出一种无需微调多模态大语言模型（LLM）的在线视频推理分割（RS）代理框架。

Motivation: 当前RS方法依赖多模态LLM的视觉感知能力，存在推理步骤复杂、需要频繁微调LLM以及难以处理在线视频数据的局限性。

Details

Method: 引入即时数字孪生概念，通过LLM规划从高级视频中构建低级场景表示，并仅请求特定信息子集进行推理。 Result: 提出了包含200个视频和895个隐式文本查询的新综合视频推理分割基准。 Conclusion: 该方法有效解决了现有RS方法的局限性，支持在线视频推理分割且无需LLM微调。 Abstract: Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where -- given an implicit query -- a LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as "just-in-time" because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity.

Can Large Language Models Predict Associations Among Human Attitudes?

Ana Ma,Derek Powell

Task: 研究大型语言模型（GPT-4o）能否预测人类在不同主题间的态度关联。

Motivation: 探索人类态度在看似不相关主题间的强关联性，并验证语言模型是否能捕捉这种深层结构。

Details

Method: 使用新颖的人类态度数据集，测试GPT-4o在无表面相似性情况下的预测能力。 Result: GPT-4o能重建态度间的成对相关性，并在无表面相似性时仍能生成有意义的社会推断。 Conclusion: 大型语言模型能捕捉人类信念系统的深层潜在结构。 Abstract: Prior work has shown that large language models (LLMs) can predict human attitudes based on other attitudes, but this work has largely focused on predictions from highly similar and interrelated attitudes. In contrast, human attitudes are often strongly associated even across disparate and dissimilar topics. Using a novel dataset of human responses toward diverse attitude statements, we found that a frontier language model (GPT-4o) was able to recreate the pairwise correlations among individual attitudes and to predict individuals' attitudes from one another. Crucially, in an advance over prior work, we tested GPT-4o's ability to predict in the absence of surface-similarity between attitudes, finding that while surface similarity improves prediction accuracy, the model was still highly-capable of generating meaningful social inferences between dissimilar attitudes. Altogether, our findings indicate that LLMs capture crucial aspects of the deeper, latent structure of human belief systems.

Neural Architecture Search by Learning a Hierarchical Search Space

Mehraveh Javan Roshtkhari,Matthew Toews,Marco Pedersoli

Task: 研究蒙特卡洛树搜索（MCTS）在神经架构搜索（NAS）中的应用，特别是针对图像分类任务。

Motivation: MCTS的性能高度依赖于节点分支的顺序，而在NAS中，优化分支顺序可以提高搜索效率，因为只有最终架构的性能是关键。

Details

Method: 通过层次聚类方法学习分支顺序，基于架构输出向量的相似性度量。 Result: 在CIFAR10和ImageNet上的实验表明，MCTS在良好分支层次结构下，比其他NAS方法更高效地找到有前景的解决方案。 Conclusion: MCTS结合层次聚类方法可以有效提升NAS的搜索效率，特别是在图像分类任务中。 Abstract: Monte-Carlo Tree Search (MCTS) is a powerful tool for many non-differentiable search related problems such as adversarial games. However, the performance of such approach highly depends on the order of the nodes that are considered at each branching of the tree. If the first branches cannot distinguish between promising and deceiving configurations for the final task, the efficiency of the search is exponentially reduced. In Neural Architecture Search (NAS), as only the final architecture matters, the visiting order of the branching can be optimized to improve learning. In this paper, we study the application of MCTS to NAS for image classification. We analyze several sampling methods and branching alternatives for MCTS and propose to learn the branching by hierarchical clustering of architectures based on their similarity. The similarity is measured by the pairwise distance of output vectors of architectures. Extensive experiments on two challenging benchmarks on CIFAR10 and ImageNet show that MCTS, if provided with a good branching hierarchy, can yield promising solutions more efficiently than other approaches for NAS problems.

Enhancing Korean Dependency Parsing with Morphosyntactic Features

Jungyeul Park,Yige Chen,Kyuwon Kim,KyungTae Lim,Chulwoo Park

Task: 提出UniDive框架，将Universal Dependencies (UD)和Universal Morphology (UniMorph)结合，以改进韩语形态句法的表示与处理。

Motivation: 韩语的丰富屈折形态和灵活词序对现有框架提出挑战，这些框架通常将形态和句法分开处理，导致语言分析不一致。

Details

Method: 通过保留句法依赖关系并整合UniMorph特征，UniDive统一了句法和形态标注，构建了集成数据集并应用于依存句法分析。 Result: 实验表明，丰富的形态句法特征提高了句法分析的准确性，尤其是在区分受形态影响的语法关系时。 Conclusion: 显式形态信息有助于更准确的句法分析，这一结论在编码器和解码器模型实验中均得到验证。 Abstract: This paper introduces UniDive for Korean, an integrated framework that bridges Universal Dependencies (UD) and Universal Morphology (UniMorph) to enhance the representation and processing of Korean {morphosyntax}. Korean's rich inflectional morphology and flexible word order pose challenges for existing frameworks, which often treat morphology and syntax separately, leading to inconsistencies in linguistic analysis. UniDive unifies syntactic and morphological annotations by preserving syntactic dependencies while incorporating UniMorph-derived features, improving consistency in annotation. We construct an integrated dataset and apply it to dependency parsing, demonstrating that enriched morphosyntactic features enhance parsing accuracy, particularly in distinguishing grammatical relations influenced by morphology. Our experiments, conducted with both encoder-only and decoder-only models, confirm that explicit morphological information contributes to more accurate syntactic analysis.

Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing

Fan Qi,Yu Duan,Changsheng Xu

Task: 通过Janus-Pro驱动的提示解析和MIGLoRA模块，解决文本引导扩散模型在生成复杂多对象场景时的空间定位不精确和可扩展性有限的问题。

Motivation: 文本引导扩散模型在生成复杂场景时存在空间定位不精确和可扩展性不足的挑战，需要改进以提升生成效果。

Details

Method: 提出Janus-Pro驱动的提示解析模块和MIGLoRA插件，结合LoRA技术实现参数高效微调。 Result: 在COCO和LVIS基准测试中达到最先进性能，同时保持参数高效性，展示了优越的布局保真度和可扩展性。 Conclusion: 该方法通过创新的模块设计，显著提升了复杂场景生成的性能，为开放世界合成提供了高效解决方案。 Abstract: Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model's parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis.

Shared Global and Local Geometry of Language Model Embeddings

Andrew Lee,Melanie Weber,Fernanda Viégas,Martin Wattenberg

Task: 探究语言模型中词嵌入的共同几何结构及其在模型间的可迁移性。

Motivation: 发现语言模型的词嵌入具有共同的几何结构，并探索其在模型解释性和应用中的潜力。

Details

Method: 通过全局相似性和局部几何分析（如局部线性嵌入和内在维度测量）来表征词嵌入的结构。 Result: 发现词嵌入位于低维流形上，且低内在维度的词嵌入具有语义一致的聚类。此外，词嵌入的对齐性在模型间可迁移。 Conclusion: 词嵌入的共同几何结构为模型解释性和跨模型应用提供了新的可能性。 Abstract: Researchers have recently suggested that models share common representations. In this work, we find that the token embeddings of language models exhibit common geometric structure. First, we find ``global'' similarities: token embeddings often share similar relative orientations. Next, we characterize local geometry in two ways: (1) by using Locally Linear Embeddings, and (2) by defining a simple measure for the intrinsic dimension of each token embedding. Our intrinsic dimension measure demonstrates that token embeddings lie on a lower dimensional manifold. We qualitatively show that tokens with lower intrinsic dimensions often have semantically coherent clusters, while those with higher intrinsic dimensions do not. Both characterizations allow us to find similarities in the local geometry of token embeddings. Perhaps most surprisingly, we find that alignment in token embeddings persists through the hidden states of language models, allowing us to develop an application for interpretability. Namely, we empirically demonstrate that steering vectors from one language model can be transferred to another, despite the two models having different dimensions.

HSLiNets: Evaluating Band Ordering Strategies in Hyperspectral and LiDAR Fusion

Judy X Yang,Jing Wang,Zhuanfeng,Li,Chenhong Sui Zekun Long,Jun Zhou

Task: 研究高光谱成像（HSI）与激光雷达（LiDAR）数据融合中波段顺序对分类性能的影响。

Motivation: 以往研究忽略了波段顺序在HSI-LiDAR融合中的作用，而实验表明其对分类精度有显著影响。

Details

Method: 提出一种新颖的融合架构，通过自适应融合不同波段顺序配置来增强特征表示。 Result: 在Houston 2013和Trento数据集上，该方法优于现有融合模型。 Conclusion: 波段顺序是影响融合性能的关键因素，提出的方法能有效提升分类精度。 Abstract: The integration of hyperspectral imaging (HSI) and Light Detection and Ranging (LiDAR) data provides complementary spectral and spatial information for remote sensing applications. While previous studies have explored the role of band selection and grouping in HSI classification, little attention has been given to how the spectral sequence or band order affects classification outcomes when fused with LiDAR. In this work, we systematically investigate the influence of band order on HSI-LiDAR fusion performance. Through extensive experiments, we demonstrate that band order significantly impacts classification accuracy, revealing a previously overlooked factor in fusion-based models. Motivated by this observation, we propose a novel fusion architecture that not only integrates HSI and LiDAR data but also learns from multiple band order configurations. The proposed method enhances feature representation by adaptively fusing different spectral sequences, leading to improved classification accuracy. Experimental results on the Houston 2013 and Trento datasets show that our approach outperforms state-of-the-art fusion models. Data and code are available at https://github.com/Judyxyang/HSLiNets.

EQ-Negotiator: An Emotion-Reasoning LLM Agent in Credit Dialogues

Yuhan Liu,Yunbo Long

Task: 开发一种结合情感感知与情感推理的EQ谈判者，以提升基于大型语言模型的聊天机器人在信贷对话中的动态情感表达能力。

Motivation: 当前基于大型语言模型的聊天机器人在信贷对话中缺乏动态情感表达能力，主要依赖被动共情，无法有效应对客户的负面情绪。

Details

Method: 结合预训练语言模型的情感感知、基于博弈论和隐马尔可夫模型的情感推理，并利用公共情感数据集进行微调。 Result: EQ谈判者能够实时捕捉客户情绪变化，并根据情感决策策略动态调整回应语气，提升信贷对话的效果。 Conclusion: EQ谈判者能够有效管理客户负面情绪，增强信贷服务中的客户满意度，促进积极的客户关系。 Abstract: While large language model (LLM)-based chatbots have been applied for effective engagement in credit dialogues, their capacity for dynamic emotional expression remains limited. Current agents primarily rely on passive empathy rather than affective reasoning. For instance, when faced with persistent client negativity, the agent should employ strategic emotional adaptation by expressing measured anger to discourage counterproductive behavior and guide the conversation toward resolution. This context-aware emotional modulation is essential for imitating the nuanced decision-making of human negotiators. This paper introduces an EQ-negotiator that combines emotion sensing from pre-trained language models (PLMs) with emotional reasoning based on Game Theory and Hidden Markov Models. It takes into account both the current and historical emotions of the client to better manage and address negative emotions during interactions. By fine-tuning pre-trained language models (PLMs) on public emotion datasets and validating them on the credit dialogue datasets, our approach enables LLM-based agents to effectively capture shifts in client emotions and dynamically adjust their response tone based on our emotion decision policies in real-world financial negotiations. This EQ-negotiator can also help credit agencies foster positive client relationships, enhancing satisfaction in credit services.

Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing Systems

Ooha Lakkadi Reddy

Task: 研究印度河谷文字与藏彝走廊象形文字系统之间的历史联系。

Motivation: 探索印度河谷文字与藏彝走廊文字之间的视觉形态相似性，以揭示古代文化传播网络的复杂性。

Details

Method: 采用混合CNN-Transformer架构和人类学框架，通过15个独立训练的模型对三种目标文字进行集成分析。 Result: 藏彝走廊文字与印度河谷文字的视觉相似性（61.7%-63.5%）显著高于与青铜时代原始楔形文字（10.2%-10.9%）或原始埃兰文字（7.6%-8.7%）的相似性。 Conclusion: 研究结果表明印度河谷文字与藏彝走廊文字之间存在显著相似性，挑战了传统关于孤立文字发展的观点，揭示了古代南亚与东亚之间复杂的文化传播网络。 Abstract: This thesis employs a hybrid CNN-Transformer architecture, in conjunction with a detailed anthropological framework, to investigate potential historical connections between the visual morphology of the Indus Valley script and pictographic systems of the Tibetan-Yi Corridor. Through an ensemble methodology of three target scripts across 15 independently trained models, we demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold higher visual similarity to the Indus script (61.7%-63.5%) than to the Bronze Age Proto-Cuneiform (10.2%-10.9%) or Proto-Elamite (7.6%-8.7%) systems. Additionally and contrarily to our current understanding of the networks of the Indus Valley Civilization, the Indus script unexpectedly maps closer to Tibetan-Yi Corridor scripts, with a mean cosine similarity of 0.629, than to the aforementioned contemporaneous West Asian signaries, both of which recorded mean cosine similarities of 0.104 and 0.080 despite their close geographic proximity and evident trade relations. Across various dimensionality reduction practices and clustering methodologies, the Indus script consistently clusters closest to Tibetan-Yi Corridor scripts. Our computational results align with qualitative observations of specific pictorial parallels in numeral systems, gender markers, and key iconographic elements; this is further supported by archaeological evidence of sustained contact networks along the ancient Shu-Shendu road in tandem with the Indus Valley Civilization's decline, providing a plausible transmission pathway. While alternative explanations cannot be ruled out, the specificity and consistency of observed similarities challenge conventional narratives of isolated script development and suggest more complex ancient cultural transmission networks between South and East Asia than previously recognized.

ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging

Haoming Xu,Shuxun Wang,Yanqiu Zhao,Yi Zhong,Ziyan Jiang,Ningyuan Zhao,Shumin Deng,Huajun Chen,Ningyu Zhang

Task: 选择性从大型语言模型中删除敏感知识，避免过度遗忘或遗忘不足的问题。

Motivation: 解决大型语言模型中敏感内容的遗忘问题，提出更平衡的遗忘方法。

Details

Method: 利用模型合并技术（TIES-Merging），将两个专用模型结合为一个更平衡的遗忘模型。 Result: 在26个团队中排名第二，Task Aggregate得分为0.944，整体Aggregate得分为0.487。 Conclusion: 强调需要更全面的评估方法和重新思考遗忘目标，指出当前评估指标的不足。 Abstract: This paper presents the ZJUKLAB team's submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models. This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues. We propose an unlearning system that leverages Model Merging (specifically TIES-Merging), combining two specialized models into a more balanced unlearned model. Our system achieves competitive results, ranking second among 26 teams, with an online score of 0.944 for Task Aggregate and 0.487 for overall Aggregate. In this paper, we also conduct local experiments and perform a comprehensive analysis of the unlearning process, examining performance trajectories, loss dynamics, and weight perspectives, along with several supplementary experiments, to understand the effectiveness of our method. Furthermore, we analyze the shortcomings of our method and evaluation metrics, emphasizing that MIA scores and ROUGE-based metrics alone are insufficient to fully evaluate successful unlearning. Finally, we emphasize the need for more comprehensive evaluation methodologies and rethinking of unlearning objectives in future research. Code is available at https://github.com/zjunlp/unlearn/tree/main/semeval25.

KAC: Kolmogorov-Arnold Classifier for Continual Learning

Yusong Hu,Zichen Liang,Fei Yang,Qibin Hou,Xialei Liu,Ming-Ming Cheng

Task: 探索基于Kolmogorov-Arnold Networks（KAN）的新型分类器KAC在持续学习中的潜力。

Motivation: 现有线性分类器在持续学习中难以保持稳定的分类空间，而KAN在简单持续回归任务中表现出稳定性。

Details

Method: 提出Kolmogorov-Arnold分类器（KAC），结合KAN结构和径向基函数（RBF）以提升持续学习兼容性。 Result: 在多个持续学习基准测试中，KAC均表现出性能提升，验证了其有效性和鲁棒性。 Conclusion: KAC是一种适用于持续学习的有效且鲁棒的分类器。 Abstract: Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Inspired by the success of Kolmogorov-Arnold Networks (KAN) in preserving learning stability during simple continual regression tasks, we set out to explore their potential in more complex continual learning scenarios. In this paper, we introduce the Kolmogorov-Arnold Classifier (KAC), a novel classifier developed for continual learning based on the KAN structure. We delve into the impact of KAN's spline functions and introduce Radial Basis Functions (RBF) for improved compatibility with continual learning. We replace linear classifiers with KAC in several recent approaches and conduct experiments across various continual learning benchmarks, all of which demonstrate performance improvements, highlighting the effectiveness and robustness of KAC in continual learning. The code is available at https://github.com/Ethanhuhuhu/KAC.

Function Alignment: A New Theory for Mind and Intelligence, Part I: Foundations

Gus G. Xia

Task: 提出功能对齐理论，作为心智与智能的新理论框架。

Motivation: 解决认知科学中分散的概念（如有限理性、符号接地和类比）的统一解释问题，并连接计算架构、心理学理论和冥想传统。

Details

Method: 通过分层表示之间的交互建模意义、解释和类比，形成连贯的理论框架。 Result: 提出有限可解释性作为核心理论洞察，并为构建心智提供蓝图。 Conclusion: 功能对齐理论为理解心智提供了结构基础，支持跨学科的重建。 Abstract: This paper introduces function alignment, a novel theory of mind and intelligence that is both intuitively compelling and structurally grounded. It explicitly models how meaning, interpretation, and analogy emerge from interactions among layered representations, forming a coherent framework capable not only of modeling minds but also of serving as a blueprint for building them. One of the key theoretical insights derived from function alignment is bounded interpretability, which provides a unified explanation for previously fragmented ideas in cognitive science, such as bounded rationality, symbol grounding, and analogy-making. Beyond modeling, the function alignment framework bridges disciplines often kept apart, linking computational architecture, psychological theory, and even contemplative traditions such as Zen. Rather than building on any philosophical systems, it offers a structural foundation upon which multiple ways of understanding the mind may be reconstructed.

Can Video Diffusion Model Reconstruct 4D Geometry?

Jinjie Mai,Wenxuan Zhu,Haozhe Liu,Bing Li,Cheng Zheng,Jürgen Schmidhuber,Bernard Ghanem

Task: 从单目视频中重建动态3D场景（即4D几何）。

Motivation: 传统多视图几何方法难以处理动态运动，而基于学习的方法需要复杂的优化或专门的4D表示。

Details

Method: 提出Sora3R框架，利用大规模视频扩散模型的时空先验，通过两阶段流程直接推断4D点图：1）从预训练视频VAE适配点图VAE；2）在视频和点图潜在空间中微调扩散模型。 Result: Sora3R无需外部模块或迭代全局对齐，可靠地恢复相机姿态和详细场景几何，性能与最先进的动态4D重建方法相当。 Conclusion: Sora3R是一种高效且性能优越的动态4D重建框架。 Abstract: Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemporal priors of large-scale video diffusion models to directly infer 4D pointmaps from casual videos. Sora3R follows a two-stage pipeline: (1) we adapt a pointmap VAE from a pretrained video VAE, ensuring compatibility between the geometry and video latent spaces; (2) we finetune a diffusion backbone in combined video and pointmap latent space to generate coherent 4D pointmaps for every frame. Sora3R operates in a fully feedforward manner, requiring no external modules (e.g., depth, optical flow, or segmentation) or iterative global alignment. Extensive experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction across diverse scenarios.

Leveraging Large Language Models for Risk Assessment in Hyperconnected Logistic Hub Network Deployment

Yinzhu Quan,Yujia Xu,Guanlin Chen,Frederick Benaben,Benoit Montreuil

Task: 设计一个基于大型语言模型（LLM）的风险评估框架，用于评估超连接物流枢纽网络的部署。

Motivation: 在全球供应链中，能源效率和环境可持续性的重要性日益增加，但传统方法难以有效处理非结构化信息，动态风险评估成为关键。

Details

Method: 结合多种分析工具，利用LLM分析非结构化数据（如地缘政治、金融趋势、历史风暴事件等），并通过提示设计指导LLM评估风险类型和级别。 Result: 框架能够系统识别潜在风险，聚类风险相似的物流枢纽，支持数据驱动的决策过程。 Conclusion: 该框架具有可扩展性和长期记忆能力，通过解释和解释增强决策，为超连接供应链网络提供全面的风险评估。 Abstract: The growing emphasis on energy efficiency and environmental sustainability in global supply chains introduces new challenges in the deployment of hyperconnected logistic hub networks. In current volatile, uncertain, complex, and ambiguous (VUCA) environments, dynamic risk assessment becomes essential to ensure successful hub deployment. However, traditional methods often struggle to effectively capture and analyze unstructured information. In this paper, we design an Large Language Model (LLM)-driven risk assessment pipeline integrated with multiple analytical tools to evaluate logistic hub deployment. This framework enables LLMs to systematically identify potential risks by analyzing unstructured data, such as geopolitical instability, financial trends, historical storm events, traffic conditions, and emerging risks from news sources. These data are processed through a suite of analytical tools, which are automatically called by LLMs to support a structured and data-driven decision-making process for logistic hub selection. In addition, we design prompts that instruct LLMs to leverage these tools for assessing the feasibility of hub selection by evaluating various risk types and levels. Through risk-based similarity analysis, LLMs cluster logistic hubs with comparable risk profiles, enabling a structured approach to risk assessment. In conclusion, the framework incorporates scalability with long-term memory and enhances decision-making through explanation and interpretation, enabling comprehensive risk assessments for logistic hub deployment in hyperconnected supply chain networks.

Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection

Yun Zhu,Le Hui,Hang Yang,Jianjun Qian,Jin Xie,Jian Yang

Task: 提出一种统一的稀疏监督3D物体检测方法，适用于室内和室外场景。

Motivation: 现有稀疏监督3D物体检测方法仅关注室外场景，忽视了室内场景的需求。

Details

Method: 通过类原型学习利用未标记物体，提出原型匹配模块和多标签协同优化模块。 Result: 在稀疏监督设置下，方法在ScanNet V2、SUN RGB-D和KITTI数据集上分别达到全监督检测器性能的78%、90%和96%。 Conclusion: 该方法在稀疏监督条件下表现出色，具有较高的可扩展性。 Abstract: Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objects. Specifically, we first propose a prototype-based object mining module that converts the unlabeled object mining into a matching problem between class prototypes and unlabeled features. By using optimal transport matching results, we assign prototype labels to high-confidence features, thereby achieving the mining of unlabeled objects. We then present a multi-label cooperative refinement module to effectively recover missed detections through pseudo label quality control and prototype label cooperation. Experiments show that our method achieves state-of-the-art performance under the one object per scene sparse supervised setting across indoor and outdoor datasets. With only one labeled object per scene, our method achieves about 78%, 90%, and 96% performance compared to the fully supervised detector on ScanNet V2, SUN RGB-D, and KITTI, respectively, highlighting the scalability of our method. Code is available at https://github.com/zyrant/CPDet3D.

Collaborative Evolution: Multi-Round Learning Between Large and Small Language Models for Emergent Fake News Detection

Ziyi Zhou,Xiaoming Zhang,Shenghan Tan,Litian Zhang,Chaozhuo Li

Task: 提出一种名为Multi-Round Collaboration Detection (MRCD)的新框架，以解决现有小语言模型(SLMs)和大语言模型(LLMs)在虚假新闻检测中的局限性。

Motivation: 社交媒体上虚假新闻的泛滥对社会产生了显著影响，而现有的SLMs需要大量监督训练且难以适应快速变化的环境，LLMs虽具备零样本能力但缺乏相关演示和动态知识支持。

Details

Method: MRCD框架通过两阶段检索模块选择相关且最新的演示和知识，结合LLMs的泛化能力和SLMs的专业功能，并采用多轮学习框架提高检测可靠性。 Result: MRCD在Pheme和Twitter16两个真实数据集上取得了SOTA结果，准确率分别比仅使用SLMs提高了7.4%和12.8%。 Conclusion: MRCD有效解决了当前模型的局限性，提升了新兴虚假新闻的检测能力。 Abstract: The proliferation of fake news on social media platforms has exerted a substantial influence on society, leading to discernible impacts and deleterious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from the necessity for extensive supervised training and the challenge of adapting to rapidly evolving circumstances. Large language models (LLMs), despite their robust zero-shot capabilities, have fallen short in effectively identifying fake news due to a lack of pertinent demonstrations and the dynamic nature of knowledge. In this paper, a novel framework Multi-Round Collaboration Detection (MRCD) is proposed to address these aforementioned limitations. The MRCD framework is capable of enjoying the merits from both LLMs and SLMs by integrating their generalization abilities and specialized functionalities, respectively. Our approach features a two-stage retrieval module that selects relevant and up-to-date demonstrations and knowledge, enhancing in-context learning for better detection of emerging news events. We further design a multi-round learning framework to ensure more reliable detection results. Our framework MRCD achieves SOTA results on two real-world datasets Pheme and Twitter16, with accuracy improvements of 7.4\% and 12.8\% compared to using only SLMs, which effectively addresses the limitations of current models and improves the detection of emergent fake news.

StyledStreets: Multi-style Street Simulator with Spatial and Temporal Consistency

Yuyin Chen,Yida Wang,Xueyang Zhang,Kun Zhan,Peng Jia,Yifei Zhan,Xianpeng Lang

Task: 提出一种多风格街道模拟器，实现指令驱动的场景编辑，并保证空间和时间一致性。

Motivation: 城市场景重建需要同时建模静态基础设施和动态元素，并支持多样化的环境条件。

Details

Method: 基于高斯泼溅框架，结合姿态优化和多视角训练，提出混合嵌入方案、不确定性感知渲染和统一参数化模型。 Result: 实现了跨季节、天气条件和相机设置的真实风格转换，保持了场景的运动模式和几何关系。 Conclusion: 该方法为城市模拟提供了新能力，适用于自动驾驶测试和增强现实系统。 Abstract: Urban scene reconstruction requires modeling both static infrastructure and dynamic elements while supporting diverse environmental conditions. We present \textbf{StyledStreets}, a multi-style street simulator that achieves instruction-driven scene editing with guaranteed spatial and temporal consistency. Building on a state-of-the-art Gaussian Splatting framework for street scenarios enhanced by our proposed pose optimization and multi-view training, our method enables photorealistic style transfers across seasons, weather conditions, and camera setups through three key innovations: First, a hybrid embedding scheme disentangles persistent scene geometry from transient style attributes, allowing realistic environmental edits while preserving structural integrity. Second, uncertainty-aware rendering mitigates supervision noise from diffusion priors, enabling robust training across extreme style variations. Third, a unified parametric model prevents geometric drift through regularized updates, maintaining multi-view consistency across seven vehicle-mounted cameras. Our framework preserves the original scene's motion patterns and geometric relationships. Qualitative results demonstrate plausible transitions between diverse conditions (snow, sandstorm, night), while quantitative evaluations show state-of-the-art geometric accuracy under style transfers. The approach establishes new capabilities for urban simulation, with applications in autonomous vehicle testing and augmented reality systems requiring reliable environmental consistency. Codes will be publicly available upon publication.

UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning

Hongxuan Tang,Hao Liu,Xinyan Xiao

Task: 提出UGen，一种统一的自动回归多模态模型，同时在文本处理、图像理解和图像生成任务中表现优异。

Motivation: 解决统一多模态学习中的挑战，提升模型在多种任务中的性能。

Details

Method: 将文本和图像转换为离散标记序列，使用单一Transformer以自动回归方式生成，并采用渐进式词汇学习机制训练。 Result: 在综合文本和图像任务中，UGen比传统统一自动回归方法性能提升13.3%，并在所有任务中与任务专用模型竞争。 Conclusion: UGen通过渐进式词汇学习机制有效提升了统一多模态学习的性能，具有广泛的应用潜力。 Abstract: We introduce UGen, a unified autoregressive multimodal model that demonstrates strong performance across text processing, image understanding, and image generation tasks simultaneously. UGen converts both texts and images into discrete token sequences and utilizes a single transformer to generate them uniformly in an autoregressive manner. To address the challenges associated with unified multimodal learning, UGen is trained using a novel mechanism, namely progressive vocabulary learning. In this process, visual token IDs are incrementally activated and integrated into the training phase, ultimately enhancing the effectiveness of unified multimodal learning. Experiments on comprehensive text and image tasks show that UGen achieves a significant overall performance improvement of 13.3% compared to the vanilla unified autoregressive method, and it also delivers competitive results across all tasks against several task-specific models.

One Snapshot is All You Need: A Generalized Method for mmWave Signal Generation

Teng Huang,Han Ding,Wenxin Sun,Cui Zhao,Ge Wang,Fei Wang,Kun Zhao,Zhi Wang,Wei Xi

Task: 提出一种名为mmGen的通用框架，用于生成全场景毫米波信号。

Motivation: 现有毫米波数据集稀缺且受限于预处理签名和不一致的注释格式，限制了毫米波技术的广泛应用。

Details

Method: 通过构建物理信号传输模型，从3D网格合成人类反射和环境反射的毫米波信号，并考虑材料属性、天线增益和多径反射。 Result: 实验结果显示，合成信号与真实捕获信号在Range-Angle和微多普勒特征上的平均相似度分别超过0.91和0.89。 Conclusion: mmGen框架在生成逼真毫米波信号方面表现出高效性和实际应用潜力。 Abstract: Wireless sensing systems, particularly those using mmWave technology, offer distinct advantages over traditional vision-based approaches, such as enhanced privacy and effectiveness in poor lighting conditions. These systems, leveraging FMCW signals, have shown success in human-centric applications like localization, gesture recognition, and so on. However, comprehensive mmWave datasets for diverse applications are scarce, often constrained by pre-processed signatures (e.g., point clouds or RA heatmaps) and inconsistent annotation formats. To overcome these limitations, we propose mmGen, a novel and generalized framework tailored for full-scene mmWave signal generation. By constructing physical signal transmission models, mmGen synthesizes human-reflected and environment-reflected mmWave signals from the constructed 3D meshes. Additionally, we incorporate methods to account for material properties, antenna gains, and multipath reflections, enhancing the realism of the synthesized signals. We conduct extensive experiments using a prototype system with commercial mmWave devices and Kinect sensors. The results show that the average similarity of Range-Angle and micro-Doppler signatures between the synthesized and real-captured signals across three different environments exceeds 0.91 and 0.89, respectively, demonstrating the effectiveness and practical applicability of mmGen.

LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models

Hengyuan Zhao,Ziqin Wang,Qixin Sun,Kaiyou Song,Yilin Li,Xiaolin Hu,Qingpei Guo,Si Liu

Task: 提出一种名为LLaVA-CMoE的创新框架，用于解决大规模语言模型中混合专家（MoE）连续学习的两大挑战。

Motivation: 解决随着任务数量增加导致的模型参数膨胀问题，以及避免修改现有路由器参数对已学知识的侵蚀。

Details

Method: 采用Probe-Guided Knowledge Extension（PGKE）方法评估是否需要额外知识，并结合Probabilistic Task Locator（PTL）分层路由算法。 Result: 在Coin基准测试中显著提升模型性能，同时保持合理的参数数量。 Conclusion: LLaVA-CMoE框架有效解决了连续学习中的参数膨胀和知识侵蚀问题，提升了模型性能。 Abstract: Although applying Mixture of Experts to large language models for learning new tasks is widely regarded as an effective strategy for continuous learning, there still remain two major challenges: (1) As the number of tasks grows, simple parameter expansion strategies can lead to excessively large models. (2) Modifying the parameters of the existing router results in the erosion of previously acquired knowledge. In this paper, we present an innovative framework named LLaVA-CMoE, which is a continuous Mixture of Experts (MoE) architecture without any replay data. Specifically, we have developed a method called Probe-Guided Knowledge Extension (PGKE), which employs probe experts to assess whether additional knowledge is required for a specific layer. This approach enables the model to adaptively expand its network parameters based on task distribution, thereby significantly improving the efficiency of parameter expansion. Additionally, we introduce a hierarchical routing algorithm called Probabilistic Task Locator (PTL), where high-level routing captures inter-task information and low-level routing focuses on intra-task details, ensuring that new task experts do not interfere with existing ones. Our experiments shows that our efficient architecture has substantially improved model performance on the Coin benchmark while maintaining a reasonable parameter count.

AdaMHF: Adaptive Multimodal Hierarchical Fusion for Survival Prediction

Shuaiyu Zhang,Xun Lin,Rongxiang Zhang,Yu Bai,Yong Xu,Tao Tan,Xunbin Zheng,Zitong Yu

Task: 提出一种自适应多模态分层融合框架（AdaMHF），用于病理图像和基因组数据的整合，以提高生存分析的准确性。

Motivation: 当前方法忽略生物特征（如异质性和稀疏性），限制了其在临床实践中的适应性。

Details

Method: AdaMHF通过专家扩展和残差结构提取异质性和稀疏特征，并通过选择和聚合进行特征精炼，最后进行分层融合。 Result: 在TCGA数据集上的实验表明，AdaMHF在完整和不完整模态设置下均优于现有方法。 Conclusion: AdaMHF是一种高效、全面的框架，适用于临床实践中的多模态生存分析。 Abstract: The integration of pathologic images and genomic data for survival analysis has gained increasing attention with advances in multimodal learning. However, current methods often ignore biological characteristics, such as heterogeneity and sparsity, both within and across modalities, ultimately limiting their adaptability to clinical practice. To address these challenges, we propose AdaMHF: Adaptive Multimodal Hierarchical Fusion, a framework designed for efficient, comprehensive, and tailored feature extraction and fusion. AdaMHF is specifically adapted to the uniqueness of medical data, enabling accurate predictions with minimal resource consumption, even under challenging scenarios with missing modalities. Initially, AdaMHF employs an experts expansion and residual structure to activate specialized experts for extracting heterogeneous and sparse features. Extracted tokens undergo refinement via selection and aggregation, reducing the weight of non-dominant features while preserving comprehensive information. Subsequently, the encoded features are hierarchically fused, allowing multi-grained interactions across modalities to be captured. Furthermore, we introduce a survival prediction benchmark designed to resolve scenarios with missing modalities, mirroring real-world clinical conditions. Extensive experiments on TCGA datasets demonstrate that AdaMHF surpasses current state-of-the-art (SOTA) methods, showcasing exceptional performance in both complete and incomplete modality settings.

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu,Zonglin Yang,Tong Xie,Jinjie Ni,Ben Gao,Yuqiang Li,Shixiang Tang,Wanli Ouyang,Erik Cambria,Dongzhan Zhou

Task: 评估大语言模型（LLMs）在科学发现中生成高质量研究假设的能力。

Motivation: 填补缺乏专门基准的空白，验证LLMs在科学发现中的潜力。

Details

Method: 开发自动化框架，从12个学科的论文中提取关键组件，并通过专家验证其准确性。 Result: LLMs在灵感检索任务中表现良好，表明其能够发现新知识关联。 Conclusion: LLMs可作为“研究假设挖掘工具”，支持自动化科学发现。 Abstract: Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.

Omni-AD: Learning to Reconstruct Global and Local Features for Multi-class Anomaly Detection

Jiajie Quan,Ao Tong,Yuxuan Cai,Xinwei He,Yulong Wang,Yang Zhou

Task: 解决多类无监督异常检测（MUAD）中基于重构方法容易陷入“学习捷径”问题。

Motivation: 现有方法在重构输入图像时未能全面捕获正常模式，导致对异常样本的误重构。

Details

Method: 提出一种名为Omni-block的双分支解码器块，分别通过全局特征学习和局部特征学习全面捕获正常模式，并构建Omni-AD框架逐步重构不同粒度的正常模式。 Result: 在公开的异常检测基准测试中，Omni-AD优于现有最先进方法。 Conclusion: 通过全局和局部特征学习的结合，Omni-AD有效解决了“学习捷径”问题，提升了多类无监督异常检测的性能。 Abstract: In multi-class unsupervised anomaly detection(MUAD), reconstruction-based methods learn to map input images to normal patterns to identify anomalous pixels. However, this strategy easily falls into the well-known "learning shortcut" issue when decoders fail to capture normal patterns and reconstruct both normal and abnormal samples naively. To address that, we propose to learn the input features in global and local manners, forcing the network to memorize the normal patterns more comprehensively. Specifically, we design a two-branch decoder block, named Omni-block. One branch corresponds to global feature learning, where we serialize two self-attention blocks but replace the query and (key, value) with learnable tokens, respectively, thus capturing global features of normal patterns concisely and thoroughly. The local branch comprises depth-separable convolutions, whose locality enables effective and efficient learning of local features for normal patterns. By stacking Omni-blocks, we build a framework, Omni-AD, to learn normal patterns of different granularity and reconstruct them progressively. Comprehensive experiments on public anomaly detection benchmarks show that our method outperforms state-of-the-art approaches in MUAD. Code is available at https://github.com/easyoo/Omni-AD.git.

Cultivating Game Sense for Yourself: Making VLMs Gaming Experts

Wenxuan Lu,Jiangyang He,Zhanqiu Zhang,Yiwen Guo,Tianning Zang

Task: 开发能够在没有API访问的第一/第三人称游戏中实现流畅游戏玩法的智能代理。

Motivation: 当前基于视觉语言模型（VLM）的直接控制方法效率低下，无法实现高反应性或动态适应性的任务。

Details

Method: 提出一种新的游戏代理设计范式，VLM作为高级开发者，开发专门的任务执行模块（如射击和战斗模块），这些模块通过观察任务执行并结合视觉工具和神经网络训练管道来封装动作-反馈逻辑。 Result: 实验表明，该框架首次在多种游戏类型（如ACT、FPS和Flappy Bird）中实现了流畅的游戏玩法，为游戏代理设定了新基准。 Conclusion: 通过将VLM提升为高级开发者并开发任务专用模块，GameSense框架显著提升了游戏代理的流畅性和适应性。 Abstract: Developing agents capable of fluid gameplay in first/third-person games without API access remains a critical challenge in Artificial General Intelligence (AGI). Recent efforts leverage Vision Language Models (VLMs) as direct controllers, frequently pausing the game to analyze screens and plan action through language reasoning. However, this inefficient paradigm fundamentally restricts agents to basic and non-fluent interactions: relying on isolated VLM reasoning for each action makes it impossible to handle tasks requiring high reactivity (e.g., FPS shooting) or dynamic adaptability (e.g., ACT combat). To handle this, we propose a paradigm shift in gameplay agent design: instead of directly controlling gameplay, VLM develops specialized execution modules tailored for tasks like shooting and combat. These modules handle real-time game interactions, elevating VLM to a high-level developer. Building upon this paradigm, we introduce GameSense, a gameplay agent framework where VLM develops task-specific game sense modules by observing task execution and leveraging vision tools and neural network training pipelines. These modules encapsulate action-feedback logic, ranging from direct action rules to neural network-based decisions. Experiments demonstrate that our framework is the first to achieve fluent gameplay in diverse genres, including ACT, FPS, and Flappy Bird, setting a new benchmark for game-playing agents.

Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation

Junjie Chen,Weilong Chen,Yifan Zuo,Yuming Fang

Task: 提出一种新颖的框架，通过循环挖掘细粒度和结构感知特征（FGSA）来实现类别无关的姿态估计。

Motivation: 现有方法通过热图池化提取支持特征，并通过交叉注意力获取交互特征，但忽略了从支持和查询图像中挖掘细粒度和结构感知特征的重要性。

Details

Method: 设计基于可变形注意力机制的FGSA挖掘模块，从多尺度特征图中挖掘细粒度特征，并通过偏移关键点的参考点来挖掘结构感知特征。 Result: 在MP-100数据集上显著优于现有方法（+3.2% PCK@0.05）。 Conclusion: 提出的框架通过循环挖掘FGSA特征，显著提升了类别无关姿态估计的性能。 Abstract: Category-agnostic pose estimation aims to locate keypoints on query images according to a few annotated support images for arbitrary novel classes. Existing methods generally extract support features via heatmap pooling, and obtain interacted features from support and query via cross-attention. Hence, these works neglect to mine fine-grained and structure-aware (FGSA) features from both support and query images, which are crucial for pixel-level keypoint localization. To this end, we propose a novel yet concise framework, which recurrently mines FGSA features from both support and query images. Specifically, we design a FGSA mining module based on deformable attention mechanism. On the one hand, we mine fine-grained features by applying deformable attention head over multi-scale feature maps. On the other hand, we mine structure-aware features by offsetting the reference points of keypoints to their linked keypoints. By means of above module, we recurrently mine FGSA features from support and query images, and thus obtain better support features and query estimations. In addition, we propose to use mixup keypoints to pad various classes to a unified keypoint number, which could provide richer supervision than the zero padding used in existing works. We conduct extensive experiments and in-depth studies on large-scale MP-100 dataset, and outperform SOTA method dramatically (+3.2\%PCK@0.05). Code is avaiable at https://github.com/chenbys/FMMP.

R-PRM: Reasoning-Driven Process Reward Modeling

Shuaijie She,Junxiao Liu,Yifeng Liu,Jiajun Chen,Xin Huang,Shujian Huang

Task: 提出一种名为R-PRM的推理驱动过程奖励建模方法，以改进大型语言模型在逐步数学推理中的表现。

Motivation: 现有过程奖励模型（PRMs）直接输出评分，限制了学习效率和评估准确性，且标注数据稀缺。

Details

Method: 利用更强的LLMs生成种子数据，通过偏好优化提升性能，并引入推理时间缩放。 Result: 在ProcessBench和PRMBench上，R-PRM的F1分数分别比基线高11.9和8.5分；在六个数学推理数据集上准确率提升超过8.5分。 Conclusion: R-PRM具有更全面的评估和更强的泛化能力，展现了显著潜力。 Abstract: Large language models (LLMs) inevitably make mistakes when performing step-by-step mathematical reasoning. Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step. However, existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy, which is further exacerbated by the scarcity of annotated data. To address these issues, we propose Reasoning-Driven Process Reward Modeling (R-PRM). First, we leverage stronger LLMs to generate seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities and enabling comprehensive step-by-step evaluation. Second, we further enhance performance through preference optimization, without requiring additional annotated data. Third, we introduce inference-time scaling to fully harness the model's reasoning potential. Extensive experiments demonstrate R-PRM's effectiveness: on ProcessBench and PRMBench, it surpasses strong baselines by 11.9 and 8.5 points in F1 scores, respectively. When applied to guide mathematical reasoning, R-PRM achieves consistent accuracy improvements of over 8.5 points across six challenging datasets. Further analysis reveals that R-PRM exhibits more comprehensive evaluation and stronger generalization capabilities, thereby highlighting its significant potential.

ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model

Jinwei Qi,Chaonan Ji,Sheng Xu,Peng Zhang,Bang Zhang,Liefeng Bo

Task: 提出一种新颖的框架，用于生成具有风格化的实时肖像视频，支持从头部到上半身的交互式视频聊天。

Motivation: 现有方法主要关注头部运动的实时生成，但难以实现与头部动作同步的身体运动，同时对说话风格和面部表情的精细控制仍具挑战性。

Details

Method: 采用两阶段方法：第一阶段通过高效的分层运动扩散模型生成多样化的面部表情和同步的头身运动；第二阶段通过注入显式手部控制信号生成包含上半身动作的肖像视频。 Result: 实验结果表明，该方法能够生成具有丰富表现力和自然上半身运动的肖像视频，支持实时交互式视频聊天。 Conclusion: 该框架成功解决了现有方法的局限性，实现了更具表现力和灵活性的实时视频聊天体验。 Abstract: Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.

Taewon Yun,Jihwan Oh,Hyangsuk Min,Yuho Lee,Jihwan Bang,Jason Cai,Hwanjun Song

Task: 提出ReFeed，一种通过反馈的反思推理增强多维度摘要精炼的流程。

Motivation: 解决多维度摘要精炼中的挑战，特别是维度间的权衡问题。

Details

Method: 发布SumFeed-CoT数据集，用于训练轻量级模型，并通过反思推理优化多维度反馈处理。 Result: 实验表明反思推理和多反馈同时处理对性能至关重要，且ReFeed对噪声反馈和顺序具有鲁棒性。 Conclusion: 有效推理的基础在于数据的目标和指南设计，ReFeed和数据集将公开。 Abstract: Summarization refinement faces challenges when extending to multi-dimension. In this paper, we introduce ReFeed, a powerful summarization refinement pipeline that enhances multiple dimensions through reflective reasoning on feedback. To achieve this, we release SumFeed-CoT, a large-scale Long-CoT-based dataset optimized for training a lightweight model with reflective reasoning. Our experiments reveal how the number of dimensions, feedback exposure, and reasoning policy influence refinement performance, highlighting reflective reasoning and simultaneously addressing multiple feedback is crucial to mitigate trade-off between dimensions. Furthermore, ReFeed is robust to noisy feedback and feedback order. Lastly, our finding emphasizes that creating data with a proper goal and guideline constitutes a fundamental pillar of effective reasoning. The dataset and model will be released.

The Devil is in Low-Level Features for Cross-Domain Few-Shot Segmentation

Yuhan Liu,Yixiong Zou,Yuhua Li,Ruixuan Li

Task: 提出一种方法解决跨域少样本分割（CDFSS）中性能早期峰值后急剧下降的问题。

Motivation: 目标域（尤其是与源域差异较大的域）的分割性能在早期达到峰值后迅速下降，低层特征对域偏移敏感，导致损失曲面尖锐。

Details

Method: 提出两个即插即用模块：一个在源域训练中通过锐度感知最小化方法平滑低层特征的损失曲面，另一个在目标域测试中通过低层特征校准补充目标域信息。 Result: 在四个目标数据集上的实验表明，该方法在1-shot和5-shot场景下分别比现有最佳方法平均MIoU提升3.71%和5.34%。 Conclusion: 通过分析现象并提出针对性方法，显著提升了CDFSS的性能。 Abstract: Cross-Domain Few-Shot Segmentation (CDFSS) is proposed to transfer the pixel-level segmentation capabilities learned from large-scale source-domain datasets to downstream target-domain datasets, with only a few annotated images per class. In this paper, we focus on a well-observed but unresolved phenomenon in CDFSS: for target domains, particularly those distant from the source domain, segmentation performance peaks at the very early epochs, and declines sharply as the source-domain training proceeds. We delve into this phenomenon for an interpretation: low-level features are vulnerable to domain shifts, leading to sharper loss landscapes during the source-domain training, which is the devil of CDFSS. Based on this phenomenon and interpretation, we further propose a method that includes two plug-and-play modules: one to flatten the loss landscapes for low-level features during source-domain training as a novel sharpness-aware minimization method, and the other to directly supplement target-domain information to the model during target-domain testing by low-level-based calibration. Extensive experiments on four target datasets validate our rationale and demonstrate that our method surpasses the state-of-the-art method in CDFSS signifcantly by 3.71% and 5.34% average MIoU in 1-shot and 5-shot scenarios, respectively.

Fine-Tuning LLMs on Small Medical Datasets: Text Classification and Normalization Effectiveness on Cardiology reports and Discharge records

Noah Losch,Lucas Plagwitz,Antonius Büscher,Julian Varghese

Task: 研究在小规模医学数据集上微调大型语言模型（LLMs）对文本分类和命名实体识别任务的有效性。

Motivation: 探索如何通过有限训练数据微调小型LLMs，以实现与大型模型相当的性能，从而自动化临床工作流程并高效提取非结构化医学文本中的结构化数据。

Details

Method: 使用德国心脏病学报告数据集和i2b2 Smoking Challenge数据集，在有限训练数据上本地微调小型LLMs。 Result: 实验表明，微调能提升两项任务的性能，仅需200-300个训练样本即可显著改善效果。 Conclusion: 研究强调了任务特定微调LLMs在临床工作流程自动化和医学文本结构化数据提取中的潜力。 Abstract: We investigate the effectiveness of fine-tuning large language models (LLMs) on small medical datasets for text classification and named entity recognition tasks. Using a German cardiology report dataset and the i2b2 Smoking Challenge dataset, we demonstrate that fine-tuning small LLMs locally on limited training data can improve performance achieving comparable results to larger models. Our experiments show that fine-tuning improves performance on both tasks, with notable gains observed with as few as 200-300 training examples. Overall, the study highlights the potential of task-specific fine-tuning of LLMs for automating clinical workflows and efficiently extracting structured data from unstructured medical text.

Integrating Travel Behavior Forecasting and Generative Modeling for Predicting Future Urban Mobility and Spatial Transformations

Eugene Denteh,Andrews Danyo,Joshua Kofi Asamoah,Blessing Agyei Kyem,Twitchell Addai,Armstrong Aboah

Task: 整合时序融合变换器和生成对抗网络，预测旅行模式和未来城市发展。

Motivation: 传统交通规划方法难以准确预测长期城市增长和交通需求，可能导致基础设施浪费。

Details

Method: 使用时序融合变换器预测旅行行为，生成对抗网络预测未来城市卫星图像。 Result: 旅行行为预测R平方值为0.76，卫星图像结构相似性指数为0.81。 Conclusion: 数据驱动方法显著提升决策效率，促进可持续城市发展。 Abstract: Transportation planning plays a critical role in shaping urban development, economic mobility, and infrastructure sustainability. However, traditional planning methods often struggle to accurately predict long-term urban growth and transportation demands. This may sometimes result in infrastructure demolition to make room for current transportation planning demands. This study integrates a Temporal Fusion Transformer to predict travel patterns from demographic data with a Generative Adversarial Network to predict future urban settings through satellite imagery. The framework achieved a 0.76 R-square score in travel behavior prediction and generated high-fidelity satellite images with a Structural Similarity Index of 0.81. The results demonstrate that integrating predictive analytics and spatial visualization can significantly improve the decision-making process, fostering more sustainable and efficient urban development. This research highlights the importance of data-driven methodologies in modern transportation planning and presents a step toward optimizing infrastructure placement, capacity, and long-term viability.

From User Preferences to Optimization Constraints Using Large Language Models

Manuela Sanguinetti,Alessandra Perniciano,Luca Zedda,Andrea Loddo,Cecilia Di Ruberto,Maurizio Atzori

Task: 利用大型语言模型（LLMs）将用户偏好转化为家庭能源优化的约束条件。

Motivation: 在可再生能源社区（REC）和意大利场景下，将自然语言用户需求转换为智能家电的正式约束条件。

Details

Method: 评估多种意大利语LLM在零样本、单样本和少样本学习设置下的表现，使用意大利用户请求与对应约束表示的数据集。 Result: 建立了任务基线性能，公开数据集和代码，并总结了LLM在该领域的实践经验和局限性。 Conclusion: LLM在将用户偏好转化为能源约束方面具有潜力，但仍需进一步研究和改进。 Abstract: This work explores using Large Language Models (LLMs) to translate user preferences into energy optimization constraints for home appliances. We describe a task where natural language user utterances are converted into formal constraints for smart appliances, within the broader context of a renewable energy community (REC) and in the Italian scenario. We evaluate the effectiveness of various LLMs currently available for Italian in translating these preferences resorting to classical zero-shot, one-shot, and few-shot learning settings, using a pilot dataset of Italian user requests paired with corresponding formal constraint representation. Our contributions include establishing a baseline performance for this task, publicly releasing the dataset and code for further research, and providing insights on observed best practices and limitations of LLMs in this particular domain

Adversarial Wear and Tear: Exploiting Natural Damage for Generating Physical-World Adversarial Examples

Samra Irshad,Seungkyu Lee,Nassir Navab,Hong Joo Lee,Seong Tae Kim

Task: 提出一种新的物理世界对抗样本AdvWT，模拟自然磨损现象以误导深度神经网络。

Motivation: 解决现有物理对抗样本方法依赖临时修改（如阴影、贴纸）且缺乏普适性的问题。

Details

Method: 采用两步法：1) 使用GAN建模自然磨损特征；2) 在潜在‘损伤风格代码’中引入对抗扰动。 Result: AdvWT在数字和物理领域均有效误导DNN，攻击成功率高且外观更自然。 Conclusion: AdvWT不仅提升对抗攻击效果，还能增强模型对真实磨损标志的泛化能力。 Abstract: The presence of adversarial examples in the physical world poses significant challenges to the deployment of Deep Neural Networks in safety-critical applications such as autonomous driving. Most existing methods for crafting physical-world adversarial examples are ad-hoc, relying on temporary modifications like shadows, laser beams, or stickers that are tailored to specific scenarios. In this paper, we introduce a new class of physical-world adversarial examples, AdvWT, which draws inspiration from the naturally occurring phenomenon of `wear and tear', an inherent property of physical objects. Unlike manually crafted perturbations, `wear and tear' emerges organically over time due to environmental degradation, as seen in the gradual deterioration of outdoor signboards. To achieve this, AdvWT follows a two-step approach. First, a GAN-based, unsupervised image-to-image translation network is employed to model these naturally occurring damages, particularly in the context of outdoor signboards. The translation network encodes the characteristics of damaged signs into a latent `damage style code'. In the second step, we introduce adversarial perturbations into the style code, strategically optimizing its transformation process. This manipulation subtly alters the damage style representation, guiding the network to generate adversarial images where the appearance of damages remains perceptually realistic, while simultaneously ensuring their effectiveness in misleading neural networks. Through comprehensive experiments on two traffic sign datasets, we show that AdvWT effectively misleads DNNs in both digital and physical domains. AdvWT achieves an effective attack success rate, greater robustness, and a more natural appearance compared to existing physical-world adversarial examples. Additionally, integrating AdvWT into training enhances a model's generalizability to real-world damaged signs.

Retrieving Time-Series Differences Using Natural Language Queries

Kota Dohi,Tomoya Nishida,Harsh Purohit,Takashi Endo,Yohei Kawaguchi

Task: 提出一种基于自然语言查询的方法，用于检索时间序列数据对，并基于查询中指定的差异进行匹配。

Motivation: 传统方法需要领域专业知识定义搜索标准，而现有自然语言搜索方法难以处理时间序列数据之间的差异。

Details

Method: 定义了时间序列差异的六个关键特征，构建了对应数据集，并开发了一种基于对比学习的模型，以对齐时间序列数据与查询文本之间的差异。 Result: 实验结果表明，模型在检索时间序列对时总体mAP得分为0.994。 Conclusion: 该方法有效解决了自然语言查询在时间序列数据检索中的局限性，显著提升了检索性能。 Abstract: Effectively searching time-series data is essential for system analysis; however, traditional methods often require domain expertise to define search criteria. Recent advancements have enabled natural language-based search, but these methods struggle to handle differences between time-series data. To address this limitation, we propose a natural language query-based approach for retrieving pairs of time-series data based on differences specified in the query. Specifically, we define six key characteristics of differences, construct a corresponding dataset, and develop a contrastive learning-based model to align differences between time-series data with query texts. Experimental results demonstrate that our model achieves an overall mAP score of 0.994 in retrieving time-series pairs.

VADMamba: Exploring State Space Models for Fast Video Anomaly Detection

Jiahao Lyu,Minghua Zhao,Jing Hu,Xuewen Huang,Yifei Chen,Shuangli Du

Task: 将Mamba模型应用于视频异常检测（VAD），提出VADMamba方法。

Motivation: 现有VAD方法（如CNN或Transformer）在检测精度上表现优异，但推理速度较慢；Mamba模型在计算效率和长程建模方面具有潜力。

Details

Method: 基于多任务学习（帧预测和光流重建），提出VQ-Mamba Unet（VQ-MaU）框架，结合VQ层和Mamba-based NVSS块，并通过片段级融合策略提升精度。 Result: 在三个基准数据集上验证了VADMamba的有效性，推理速度优于先前工作。 Conclusion: VADMamba在视频异常检测中实现了高效且高性能的解决方案。 Abstract: Video anomaly detection (VAD) methods are mostly CNN-based or Transformer-based, achieving impressive results, but the focus on detection accuracy often comes at the expense of inference speed. The emergence of state space models in computer vision, exemplified by the Mamba model, demonstrates improved computational efficiency through selective scans and showcases the great potential for long-range modeling. Our study pioneers the application of Mamba to VAD, dubbed VADMamba, which is based on multi-task learning for frame prediction and optical flow reconstruction. Specifically, we propose the VQ-Mamba Unet (VQ-MaU) framework, which incorporates a Vector Quantization (VQ) layer and Mamba-based Non-negative Visual State Space (NVSS) block. Furthermore, two individual VQ-MaU networks separately predict frames and reconstruct corresponding optical flows, further boosting accuracy through a clip-level fusion evaluation strategy. Experimental results validate the efficacy of the proposed VADMamba across three benchmark datasets, demonstrating superior performance in inference speed compared to previous work. Code is available at https://github.com/jLooo/VADMamba.

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Haoxiang Sun,Yingqian Min,Zhipeng Chen,Wayne Xin Zhao,Zheng Liu,Zhongyuan Wang,Lei Fang,Ji-Rong Wen

Task: 介绍并验证一个名为OlymMATH的新颖奥林匹克级数学基准，用于严格测试大型语言模型的复杂推理能力。

Motivation: 现有数学推理评估基准已因大型推理模型的快速发展而饱和，亟需更具挑战性和严谨性的评估框架。

Details

Method: 设计包含200个精心策划的问题的OlymMATH基准，分为两个难度层级（AIME级和更高难度），涵盖四个核心数学领域，并提供双语（英语和中文）版本及可验证的数值解。 Result: 实证结果显示，包括DeepSeek-R1和OpenAI的o3-mini在内的最先进模型在难题子集上表现显著受限。 Conclusion: OlymMATH基准填补了现有数学推理评估的空白，尤其在双语评估方面，为模型能力的全面测试提供了重要工具。 Abstract: In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark at the STILL project: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.

Model as a Game: On Numerical and Spatial Consistency for Generative Games

Jingye Chen,Yuzhong Zhao,Yupan Huang,Lei Cui,Li Dong,Tengchao Lv,Qifeng Chen,Furu Wei

Task: 探索生成模型在游戏生成中如何保持数值和空间一致性，并提出一种新的Model as a Game（MaaG）范式。

Motivation: 现有生成模型在游戏生成中虽然能产生高质量图形和接收玩家输入，但无法维持数值和空间一致性，影响游戏体验。

Details

Method: 基于DiT架构设计两个模块：数值模块（LogicNet）和空间模块（地图维护），以增强一致性。 Result: 实验表明，集成模块在三个游戏中显著提升了性能，且推理时间开销极小。 Conclusion: 提出的MaaG范式通过专门模块有效解决了生成游戏中的一致性问题。 Abstract: Recent advances in generative models have significantly impacted game generation. However, despite producing high-quality graphics and adequately receiving player input, existing models often fail to maintain fundamental game properties such as numerical and spatial consistency. Numerical consistency ensures gameplay mechanics correctly reflect score changes and other quantitative elements, while spatial consistency prevents jarring scene transitions, providing seamless player experiences. In this paper, we revisit the paradigm of generative games to explore what truly constitutes a Model as a Game (MaaG) with a well-developed mechanism. We begin with an empirical study on ``Traveler'', a 2D game created by an LLM featuring minimalist rules yet challenging generative models in maintaining consistency. Based on the DiT architecture, we design two specialized modules: (1) a numerical module that integrates a LogicNet to determine event triggers, with calculations processed externally as conditions for image generation; and (2) a spatial module that maintains a map of explored areas, retrieving location-specific information during generation and linking new observations to ensure continuity. Experiments across three games demonstrate that our integrated modules significantly enhance performance on consistency metrics compared to baselines, while incurring minimal time overhead during inference.

Controlling Large Language Model with Latent Actions

Chengxing Jia,Ziniu Li,Pengyuan Wang,Yi-Chen Li,Zhenyu Hou,Yuxiao Dong,Yang Yu

Task: 学习一个紧凑的潜在动作空间，以增强RL对大型语言模型（LLMs）的可控性和探索性。

Motivation: LLMs在RL训练中缺乏明确的动作空间结构定义，限制了其可控性和探索能力。

Details

Method: 提出了CoLA框架，将潜在动作空间集成到预训练的LLMs中，并在Llama-3.1-8B模型上应用。 Result: CoLA在文本生成中表现出更高的语义多样性，在math500基准测试中得分42.4（基线38.2），结合蒙特卡洛树搜索变体达到68.2；同时在不降低LLM能力的情况下提升基于代理的任务性能，并将计算时间减半。 Conclusion: CoLA展示了在RL基础上优化LLMs适应下游任务的潜力。 Abstract: Adapting Large Language Models (LLMs) to downstream tasks using Reinforcement Learning (RL) has proven to be an effective approach. However, LLMs do not inherently define the structure of an agent for RL training, particularly in terms of defining the action space. This paper studies learning a compact latent action space to enhance the controllability and exploration of RL for LLMs. We propose Controlling Large Language Models with Latent Actions (CoLA), a framework that integrates a latent action space into pre-trained LLMs. We apply CoLA to the Llama-3.1-8B model. Our experiments demonstrate that, compared to RL with token-level actions, CoLA's latent action enables greater semantic diversity in text generation. For enhancing downstream tasks, we show that CoLA with RL achieves a score of 42.4 on the math500 benchmark, surpassing the baseline score of 38.2, and reaches 68.2 when augmented with a Monte Carlo Tree Search variant. Furthermore, CoLA with RL consistently improves performance on agent-based tasks without degrading the pre-trained LLM's capabilities, unlike the baseline. Finally, CoLA reduces computation time by half in tasks involving enhanced thinking prompts for LLMs by RL. These results highlight CoLA's potential to advance RL-based adaptation of LLMs for downstream applications.

DGSUnet: An Improved Unet Model with DINO-Guided SAM2 for Multi-Scale Feature Collaboration

Yimin Xu

Task: 提出一种基于DINOv2和SAM2的多尺度特征协作框架，以解决通用图像分割模型在专业领域中的性能限制问题。

Motivation: 通用图像分割模型（如SAM和DINOv2）在专业领域表现受限，主要由于训练成本高和领域特征表示能力不足。

Details

Method: 通过特征协作机制、轻量级适配器模块和U形网络结构，实现跨域知识注入和多粒度特征自适应聚合。 Result: 在伪装目标检测和显著目标检测等下游任务中超越现有方法，且无需高成本训练。 Conclusion: 该框架为视觉图像分割的高效部署提供了技术路径，在广泛的下游任务和专业领域具有重要应用价值。 Abstract: Despite the significant advancements in general image segmentation achieved by large-scale pre-trained foundation models (such as Meta's Segment Any-thing Model (SAM) series and DINOv2), their performance in specialized fields remains limited by two critical issues: the excessive training costs due to large model parameters, and the insufficient ability to represent specific domain characteristics. This paper proposes a multi-scale feature collabora-tion framework guided by DINOv2 for SAM2, with core innovations in three aspects: (1) Establishing a feature collaboration mechanism between DINOv2 and SAM2 backbones, where high-dimensional semantic features extracted by the self-supervised model guide multi-scale feature fusion; (2) Designing lightweight adapter modules and cross-modal, cross-layer feature fusion units to inject cross-domain knowledge while freezing the base model parameters; (3) Constructing a U-shaped network structure based on U-net, which utilizes attention mechanisms to achieve adaptive aggregation decoding of multi-granularity features. This framework surpasses existing state-of-the-art meth-ods in downstream tasks such as camouflage target detection and salient ob-ject detection, without requiring costly training processes. It provides a tech-nical pathway for efficient deployment of visual image segmentation, demon-strating significant application value in a wide range of downstream tasks and specialized fields within image segmentation.Project page: https://github.com/CheneyXuYiMin/SAM2DINO-Seg

An evaluation of LLMs and Google Translate for translation of selected Indian languages via sentiment and semantic analyses

Rohitash Chandra,Aryan Chaudhary,Yeshwanth Rayavarapu

Task: 评估大型语言模型（如Gemini、GPT和Google Translate）在印度语言（梵语、泰卢固语和印地语）翻译中的质量，包括语义和情感分析。

Motivation: 目前对大型语言模型生成的翻译质量评估研究有限，尤其是在低资源语言领域。

Details

Method: 选择专家翻译的文本，用LLMs生成其英语翻译，并与专家翻译进行语义和情感分析的比较。 Result: LLMs在翻译准确性上有显著进步，但在保留情感和语义完整性（尤其是比喻和哲学语境）方面仍有挑战；GPT-4o和GPT-3.5在情感保留上优于Google Translate。 Conclusion: LLMs在情感翻译上优于Google Translate，但在复杂语境中仍需改进。 Abstract: Large Language models (LLMs) have been prominent for language translation, including low-resource languages. There has been limited study about the assessment of the quality of translations generated by LLMs, including Gemini, GPT and Google Translate. In this study, we address this limitation by using semantic and sentiment analysis of selected LLMs for Indian languages, including Sanskrit, Telugu and Hindi. We select prominent texts that have been well translated by experts and use LLMs to generate their translations to English, and then we provide a comparison with selected expert (human) translations. Our findings suggest that while LLMs have made significant progress in translation accuracy, challenges remain in preserving sentiment and semantic integrity, especially in figurative and philosophical contexts. The sentiment analysis revealed that GPT-4o and GPT-3.5 are better at preserving the sentiments for the Bhagavad Gita (Sanskrit-English) translations when compared to Google Translate. We observed a similar trend for the case of Tamas (Hindi-English) and Maha P (Telugu-English) translations. GPT-4o performs similarly to GPT-3.5 in the translation in terms of sentiments for the three languages. We found that LLMs are generally better at translation for capturing sentiments when compared to Google Translate.

Erika Mori,Yue Qiu,Hirokatsu Kataoka,Yoshimitsu Aoki

Task: 提出一种名为Looped Video Debating (LVD)的框架，通过结合大型语言模型(LLMs)和视觉信息来提升涉及人类交互视频的问答任务的透明度和可靠性。

Motivation: 随着机器人和AI系统在护理、医疗和教育等领域的普及，对能够自然与人类交互的AI需求增加，但目前的多模态整合技术仍面临挑战。

Details

Method: LVD框架整合了大型语言模型(LLMs)与视觉信息（如面部表情和身体动作），以增强问答任务的性能。 Result: 在Social-IQ 2.0基准测试中，LVD无需微调即达到最先进性能，并通过人类标注数据验证了模型的准确性。 Conclusion: LVD框架为AI驱动的社交智能提供了透明且可靠的解决方案，并为未来改进提供了指导。 Abstract: Social intelligence, the ability to interpret emotions, intentions, and behaviors, is essential for effective communication and adaptive responses. As robots and AI systems become more prevalent in caregiving, healthcare, and education, the demand for AI that can interact naturally with humans grows. However, creating AI that seamlessly integrates multiple modalities, such as vision and speech, remains a challenge. Current video-based methods for social intelligence rely on general video recognition or emotion recognition techniques, often overlook the unique elements inherent in human interactions. To address this, we propose the Looped Video Debating (LVD) framework, which integrates Large Language Models (LLMs) with visual information, such as facial expressions and body movements, to enhance the transparency and reliability of question-answering tasks involving human interaction videos. Our results on the Social-IQ 2.0 benchmark show that LVD achieves state-of-the-art performance without fine-tuning. Furthermore, supplementary human annotations on existing datasets provide insights into the model's accuracy, guiding future improvements in AI-driven social intelligence.

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo,Weizhi Zhang,Ye Yuan,Yusheng Zhao,Junwei Yang,Yiyang Gu,Bohan Wu,Binqi Chen,Ziyue Qiao,Qingqing Long,Rongcheng Tu,Xiao Luo,Wei Ju,Zhiping Xiao,Yifan Wang,Meng Xiao,Chenwu Liu,Jingyang Yuan,Shichang Zhang,Yiqiao Jin,Fan Zhang,Xian Wu,Hanqing Zhao,Dacheng Tao,Philip S. Yu,Ming Zhang

Task: 系统解构大型语言模型（LLM）智能体系统，提供方法论中心的分类法。

Motivation: LLM智能体具有目标驱动行为和动态适应能力，可能成为实现通用人工智能的关键路径，但相关研究分散，需要统一视角。

Details

Method: 通过分类法链接架构基础、协作机制和进化路径，揭示设计原则与复杂环境中涌现行为的基本联系。 Result: 提供了统一的架构视角，涵盖智能体构建、协作、进化，以及评估方法、工具应用、实际挑战和多样化应用领域。 Conclusion: 为研究者提供了理解LLM智能体的结构化分类法，并指出了未来研究的潜在方向。 Abstract: The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at https://github.com/luo-junyu/Awesome-Agent-Papers.

An improved EfficientNetV2 for garbage classification

Wenxuan Qiu,Chengxin Xie,Jingui Huang

Task: 提出一种基于EfficientNetV2的增强型垃圾分类框架，解决数据获取成本、泛化性和实时性能的挑战。

Motivation: 针对垃圾分类中特征提取不足和模型复杂度高的问题，提出改进方法以提升分类准确性和效率。

Details

Method: 结合CE-Attention模块减少特征损失，开发轻量级多尺度空间特征提取模块（SAFM），并采用数据增强策略。 Result: 在华为云垃圾分类数据集上达到95.4%的分类准确率，优于基线3.2%，且超越主流模型。 Conclusion: 该方法在垃圾分类场景中有效平衡了准确性和效率。 Abstract: This paper presents an enhanced waste classification framework based on EfficientNetV2 to address challenges in data acquisition cost, generalization, and real-time performance. We propose a Channel-Efficient Attention (CE-Attention) module that mitigates feature loss during global pooling without introducing dimensional scaling, effectively enhancing critical feature extraction. Additionally, a lightweight multi-scale spatial feature extraction module (SAFM) is developed by integrating depthwise separable convolutions, significantly reducing model complexity. Comprehensive data augmentation strategies are further employed to improve generalization. Experiments on the Huawei Cloud waste classification dataset demonstrate that our method achieves a classification accuracy of 95.4\%, surpassing the baseline by 3.2\% and outperforming mainstream models. The results validate the effectiveness of our approach in balancing accuracy and efficiency for practical waste classification scenarios.

Harnessing Chain-of-Thought Metadata for Task Routing and Adversarial Prompt Detection

Ryan Marinelli,Josef Pichlmeier,Tamas Bisztray

Task: 提出一种名为“Number of Thoughts (NofT)”的指标，用于评估任务的难度并支持大型语言模型（LLMs）在生产环境中的应用。

Motivation: 通过NofT指标，可以更有效地识别提示的难度，优化提示路由，并检测对抗性提示攻击。

Details

Method: 基于“思想数量”设定阈值，用于区分提示的难度，并在MathInstruct数据集上测试量化、蒸馏版本的Deepseek模型。 Result: 实现了2%的延迟降低，并在对抗性提示检测中达到95%的准确率。 Conclusion: NofT指标能够有效支持LLMs的提示路由和对抗性攻击检测，具有实际应用价值。 Abstract: In this work, we propose a metric called Number of Thoughts (NofT) to determine the difficulty of tasks pre-prompting and support Large Language Models (LLMs) in production contexts. By setting thresholds based on the number of thoughts, this metric can discern the difficulty of prompts and support more effective prompt routing. A 2% decrease in latency is achieved when routing prompts from the MathInstruct dataset through quantized, distilled versions of Deepseek with 1.7 billion, 7 billion, and 14 billion parameters. Moreover, this metric can be used to detect adversarial prompts used in prompt injection attacks with high efficacy. The Number of Thoughts can inform a classifier that achieves 95% accuracy in adversarial prompt detection. Our experiments ad datasets used are available on our GitHub page: https://github.com/rymarinelli/Number_Of_Thoughts/tree/main.

FakeReasoning: Towards Generalizable Forgery Detection and Reasoning

Yueying Gao,Dongliang Chang,Bingyao Yu,Haotian Qin,Lei Chen,Kongming Liang,Zhanyu Ma

Task: 开发一种可解释的AI生成图像检测方法，通过视觉语言模型（VLMs）实现伪造检测与推理任务（FDR-Task）。

Motivation: 解决生成模型间领域差距大导致的通用性差问题，以及传统显著性方法不适用于AI生成图像检测的局限性。

Details

Method: 提出FakeReasoning框架，包括伪造对齐对比学习和分类概率映射器，利用多模态伪造推理数据集（MMFR-Dataset）进行训练。 Result: FakeReasoning在多个生成模型上表现出色，检测与推理任务均优于现有方法。 Conclusion: FakeReasoning通过结构化推理和跨模态学习，实现了高效且可解释的AI生成图像检测。 Abstract: Accurate and interpretable detection of AI-generated images is essential for mitigating risks associated with AI misuse. However, the substantial domain gap among generative models makes it challenging to develop a generalizable forgery detection model. Moreover, since every pixel in an AI-generated image is synthesized, traditional saliency-based forgery explanation methods are not well suited for this task. To address these challenges, we propose modeling AI-generated image detection and explanation as a Forgery Detection and Reasoning task (FDR-Task), leveraging vision-language models (VLMs) to provide accurate detection through structured and reliable reasoning over forgery attributes. To facilitate this task, we introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 100K images across 10 generative models, with 10 types of forgery reasoning annotations, enabling comprehensive evaluation of FDR-Task. Additionally, we propose FakeReasoning, a forgery detection and reasoning framework with two key components. First, Forgery-Aligned Contrastive Learning enhances VLMs' understanding of forgery-related semantics through both cross-modal and intra-modal contrastive learning between images and forgery attribute reasoning. Second, a Classification Probability Mapper bridges the optimization gap between forgery detection and language modeling by mapping the output logits of VLMs to calibrated binary classification probabilities. Experiments across multiple generative models demonstrate that FakeReasoning not only achieves robust generalization but also outperforms state-of-the-art methods on both detection and reasoning tasks.

OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs

John Murzaku,Owen Rambow

Task: 系统评估四种全模态大语言模型（omni-LLMs）在零样本情感识别任务中的表现。

Motivation: 全模态大语言模型在多模态认知状态任务（尤其是涉及语音的任务）中的应用尚未得到充分研究。

Details

Method: 提出了一种音频特定的提示策略（acoustic prompting），包括声学特征分析、对话上下文分析和逐步推理，并与最小提示和完整思维链提示技术进行比较。 Result: 零样本全模态大语言模型在性能上优于或与经过微调的音频模型相当。 Conclusion: 上下文的使用对性能有积极影响，尤其是在IEMOCAP数据集上，同时生成的声学推理输出也进行了错误分析。 Abstract: The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.

VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

Alan Dao,Norapat Buppodom

Task: 提出一种利用视觉语言模型（VLM）从体素数据中提取高级语义信息（如物体身份、颜色和位置）的新方法。

Motivation: 理解3D环境对智能系统至关重要，但体素网格难以直接提取高级语义信息。

Details

Method: 通过沿主轴（如Z轴）系统切片体素空间，将2D切片输入预训练的2D视觉语言模型（VLM），利用其图像编码器和语言组件关联空间模式与语义概念。 Result: 该方法能够直接从体素表示中高效实现3D语义理解。 Conclusion: 基于切片的方法有效利用了预训练2D VLM的能力，为3D语义理解提供了一种高效解决方案。 Abstract: Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract "voxel semantics"-object identity, color, and location-from voxel data. Critically, instead of employing complex 3D networks, our method processes the voxel space by systematically slicing it along a primary axis (e.g., the Z-axis, analogous to CT scan slices). These 2D slices are then formatted and sequentially fed into the image encoder of a standard VLM. The model learns to aggregate information across slices and correlate spatial patterns with semantic concepts provided by the language component. This slice-based strategy aims to leverage the power of pre-trained 2D VLMs for efficient 3D semantic understanding directly from voxel representations.

OpenHuEval: Evaluating Large Language Model on Hungarian Specifics

Haote Yang,Xingjian Wei,Jiang Wu,Noémi Ligeti-Nagy,Jiaxing Sun,Yinfan Wang,Zijian Győző Yang,Junyuan Gao,Jingchao Wang,Bowen Jiang,Shasha Wang,Nanjun Yu,Zihao Zhang,Shixin Hong,Hongwei Liu,Wei Li,Songyang Zhang,Dahua Lin,Lijun Wu,Gábor Prószéky,Conghui He

Task: 构建首个专注于匈牙利语言和特性的LLM基准测试OpenHuEval。

Motivation: 为匈牙利语言和特性提供全面、深入且科学准确的LLM性能评估。

Details

Method: 从多源匈牙利语材料构建基准，结合最新评估设计原则，如使用真实用户查询、评估生成能力、采用LLM-as-judge方法。 Result: OpenHuEval包含8个维度、5个任务和3953个问题，评估显示主流LLMs需针对匈牙利语言优化。 Conclusion: OpenHuEval为匈牙利语言LLM评估和模型优化提供了重要工具，并揭示了非英语语言LRMs的内在机制。 Abstract: We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .

GenFusion: Closing the Loop between Reconstruction and Generation via Videos

Sibo Wu,Congrong Xu,Binbin Huang,Andreas Geiger,Anpei Chen

Task: 提出一种重建驱动的视频扩散模型，以解决3D重建与生成之间的条件差距问题。

Motivation: 3D重建和生成之间存在显著的条件差距，例如重建需要密集视角而生成通常依赖单视角或无输入，限制了应用。研究发现这种现象源于3D约束与生成先验之间的不对齐。

Details

Method: 提出重建驱动的视频扩散模型，学习将视频帧条件化于易产生伪影的RGB-D渲染；并设计循环融合流程，迭代添加生成模型的修复帧到训练集。 Result: 在稀疏视角和掩码输入下的视图合成评估中验证了方法的有效性。 Conclusion: 该方法通过重建驱动和循环融合，解决了3D重建与生成之间的条件差距问题，提升了视图合成的性能。 Abstract: Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g., scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach.

Keyword-Oriented Multimodal Modeling for Euphemism Identification

Yuxue Hu,Junsong Li,Meixuan Chen,Dongyu Su,Tongguan Wang,Ying Sha

Task: 通过多模态分析识别委婉语的真实含义，结合文本、图像和音频数据。

Motivation: 现有方法主要基于文本，而社交媒体的兴起需要多模态分析，但缺乏相关数据集限制了研究进展。

Details

Method: 引入关键词导向的多模态委婉语语料库（KOM-Euph）和基于跨模态特征对齐与动态融合的识别方法（KOM-EI）。 Result: KOM-EI在实验中表现优于现有先进模型和大语言模型，验证了多模态数据集的重要性。 Conclusion: 多模态方法在委婉语识别中具有显著优势，为内容审核和打击地下市场提供了新工具。 Abstract: Euphemism identification deciphers the true meaning of euphemisms, such as linking "weed" (euphemism) to "marijuana" (target keyword) in illicit texts, aiding content moderation and combating underground markets. While existing methods are primarily text-based, the rise of social media highlights the need for multimodal analysis, incorporating text, images, and audio. However, the lack of multimodal datasets for euphemisms limits further research. To address this, we regard euphemisms and their corresponding target keywords as keywords and first introduce a keyword-oriented multimodal corpus of euphemisms (KOM-Euph), involving three datasets (Drug, Weapon, and Sexuality), including text, images, and speech. We further propose a keyword-oriented multimodal euphemism identification method (KOM-EI), which uses cross-modal feature alignment and dynamic fusion modules to explicitly utilize the visual and audio features of the keywords for efficient euphemism identification. Extensive experiments demonstrate that KOM-EI outperforms state-of-the-art models and large language models, and show the importance of our multimodal datasets.

Frequency-Aware Gaussian Splatting Decomposition

Yishai Lavi,Leo Segre,Shai Avidan

Task: 提出一种频率分解的3D高斯泼溅框架，以分离低频结构和精细细节。

Motivation: 3D高斯泼溅（3D-GS）缺乏频率可解释性，难以区分低频结构和细节。

Details

Method: 通过拉普拉斯金字塔将3D高斯分组，并引入正则化和渐进训练方案。 Result: 实现了频率分离，支持高级3D编辑、动态细节控制和交互式渲染。 Conclusion: 该方法为场景编辑和交互式渲染提供了更好的控制和灵活性。 Abstract: 3D Gaussian Splatting (3D-GS) has revolutionized novel view synthesis with its efficient, explicit representation. However, it lacks frequency interpretability, making it difficult to separate low-frequency structures from fine details. We introduce a frequency-decomposed 3D-GS framework that groups 3D Gaussians that correspond to subbands in the Laplacian Pyrmaids of the input images. Our approach enforces coherence within each subband (i.e., group of 3D Gaussians) through dedicated regularization, ensuring well-separated frequency components. We extend color values to both positive and negative ranges, allowing higher-frequency layers to add or subtract residual details. To stabilize optimization, we employ a progressive training scheme that refines details in a coarse-to-fine manner. Beyond interpretability, this frequency-aware design unlocks a range of practical benefits. Explicit frequency separation enables advanced 3D editing and stylization, allowing precise manipulation of specific frequency bands. It also supports dynamic level-of-detail control for progressive rendering, streaming, foveated rendering and fast geometry interaction. Through extensive experiments, we demonstrate that our method provides improved control and flexibility for emerging applications in scene editing and interactive rendering. Our code will be made publicly available.

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

Yue Li,Meng Tian,Zhenyu Lin,Jiangtong Zhu,Dechang Zhu,Haiqiang Liu,Zining Wang,Yueyi Zhang,Zhiwei Xiong,Xinhai Zhao

Task: 提出一个名为VLADBench的细粒度数据集，用于评估视觉语言模型（VLM）在自动驾驶（AD）复杂场景中的能力。

Motivation: 现有基准测试主要通过粗粒度任务的开放式视觉问答（QA）评估VLM的可解释性，不足以评估复杂驾驶场景中的能力。

Details

Method: 构建VLADBench数据集，包含5个关键领域（如交通知识理解、目标属性理解等）的封闭式QA任务，并进一步细分为11个次级方面和29个三级任务。同时，基于小规模VLM训练领域特定模型。 Result: 实验结果表明，VLADBench能更全面地评估VLM在AD中的能力，揭示了其优势和关键局限性。 Conclusion: VLADBench为开发更具认知和推理能力的AD系统提供了重要基础。 Abstract: Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce $\textbf{VLADBench}$, a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate $\textbf{VLADBench}$ spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.

Clean Image May be Dangerous: Data Poisoning Attacks Against Deep Hashing

Shuai Li,Jie Zhang,Yuang Qi,Kejiang Chen,Tianwei Zhang,Weiming Zhang,Nenghai Yu

Task: 研究针对深度哈希的数据投毒攻击（PADHASH）。

Motivation: 尽管深度哈希在大规模图像检索中表现出色，但其易受恶意攻击，尤其是通过干净查询图像诱导恶意检索结果的问题尚未被充分研究。

Details

Method: 首先训练一个代理模型模拟目标深度哈希模型的行为，然后提出严格的梯度匹配策略生成投毒图像。 Result: 在不同模型、数据集、哈希方法和哈希码长度上的实验证明了攻击方法的有效性和通用性。 Conclusion: 本文首次揭示了干净查询图像可能引发的安全问题，并提出了一种有效的攻击方法，为深度哈希的安全性研究提供了新视角。 Abstract: Large-scale image retrieval using deep hashing has become increasingly popular due to the exponential growth of image data and the remarkable feature extraction capabilities of deep neural networks (DNNs). However, deep hashing methods are vulnerable to malicious attacks, including adversarial and backdoor attacks. It is worth noting that these attacks typically involve altering the query images, which is not a practical concern in real-world scenarios. In this paper, we point out that even clean query images can be dangerous, inducing malicious target retrieval results, like undesired or illegal images. To the best of our knowledge, we are the first to study data \textbf{p}oisoning \textbf{a}ttacks against \textbf{d}eep \textbf{hash}ing \textbf{(\textit{PADHASH})}. Specifically, we first train a surrogate model to simulate the behavior of the target deep hashing model. Then, a strict gradient matching strategy is proposed to generate the poisoned images. Extensive experiments on different models, datasets, hash methods, and hash code lengths demonstrate the effectiveness and generality of our attack method.

Ana-Maria Bucur,Andreea-Codrina Moldovan,Krutika Parvatikar,Marcos Zampieri,Ashiqur R. KhudaBukhsh,Liviu P. Dinu

Task: 提供一份2019年至2024年发布的用于分析和预测抑郁症的社交媒体数据集清单。

Motivation: 支持早期职业研究人员，促进跨学科研究，利用社交媒体数据增强传统抑郁症筛查方法。

Details

Method: 综述2019年至2024年发布的社交媒体数据集，并提供在线可更新的资源清单。 Result: 提供了一份全面的数据集清单，作为持续更新的资源。 Conclusion: 该资源有望促进对社交媒体上抑郁症语言表达的进一步跨学科研究。 Abstract: Depression is the most common mental health disorder, and its prevalence increased during the COVID-19 pandemic. As one of the most extensively researched psychological conditions, recent research has increasingly focused on leveraging social media data to enhance traditional methods of depression screening. This paper addresses the growing interest in interdisciplinary research on depression, and aims to support early-career researchers by providing a comprehensive and up-to-date list of datasets for analyzing and predicting depression through social media data. We present an overview of datasets published between 2019 and 2024. We also make the comprehensive list of datasets available online as a continuously updated resource, with the hope that it will facilitate further interdisciplinary research into the linguistic expressions of depression on social media.

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

Haoyu Zhao,Zhongang Qi,Cong Wang,Qingping Zheng,Guansong Lu,Fei Chen,Hang Xu,Zuxuan Wu

Task: 提出一种名为DynamiCtrl的新框架，用于解决人像动画中的架构限制和文本信息利用不足的问题。

Motivation: 现有方法存在架构限制（如U-Net性能不足）和忽视文本信息的问题，影响了动画的可控性和质量。

Details

Method: 采用MM-DiT架构，引入Shared VAE编码器和Pose-adaptive Layer Norm (PadaLN)来整合姿态特征，并通过文本与视觉特征对齐增强可控性。 Result: 实验证明DynamiCtrl在身份保持、异质角色驱动、背景可控性和高质量合成方面表现优越。 Conclusion: DynamiCtrl通过结合姿态和文本控制，显著提升了人像动画的性能和可控性。 Abstract: Human image animation has recently gained significant attention due to advancements in generative models. However, existing methods still face two major challenges: (1) architectural limitations, most models rely on U-Net, which underperforms compared to the MM-DiT; and (2) the neglect of textual information, which can enhance controllability. In this work, we introduce DynamiCtrl, a novel framework that not only explores different pose-guided control structures in MM-DiT, but also reemphasizes the crucial role of text in this task. Specifically, we employ a Shared VAE encoder for both reference images and driving pose videos, eliminating the need for an additional pose encoder and simplifying the overall framework. To incorporate pose features into the full attention blocks, we propose Pose-adaptive Layer Norm (PadaLN), which utilizes adaptive layer normalization to encode sparse pose features. The encoded features are directly added to the visual input, preserving the spatiotemporal consistency of the backbone while effectively introducing pose control into MM-DiT. Furthermore, within the full attention mechanism, we align textual and visual features to enhance controllability. By leveraging text, we not only enable fine-grained control over the generated content, but also, for the first time, achieve simultaneous control over both background and motion. Experimental results verify the superiority of DynamiCtrl on benchmark datasets, demonstrating its strong identity preservation, heterogeneous character driving, background controllability, and high-quality synthesis. The project page is available at https://gulucaptain.github.io/DynamiCtrl/.

Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models

Umer Butt,Stalin Veranasi,Günter Neumann

Task: 提出一种基于Transformer的方法，用于乌尔都语和罗马化乌尔都语之间的音译任务。

Motivation: 解决低资源语言音译任务中的领域适应性和评估不足问题。

Details

Method: 使用m2m100多语言翻译模型，结合掩码语言建模（MLM）预训练和微调，并在Roman-Urdu-Parl和Dakshina数据集上进行实验。 Result: 模型在Char-BLEU评分上表现优异（乌尔都语->罗马化乌尔都语96.37，罗马化乌尔都语->乌尔都语97.44），超越了RNN基线和GPT-4o Mini。 Conclusion: 多语言迁移学习在低资源音译任务中具有显著效果。 Abstract: As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. Transliteration between Urdu and its Romanized form, Roman Urdu, remains underexplored despite the widespread use of both scripts in South Asia. Prior work using RNNs on the Roman-Urdu-Parl dataset showed promising results but suffered from poor domain adaptability and limited evaluation. We propose a transformer-based approach using the m2m100 multilingual translation model, enhanced with masked language modeling (MLM) pretraining and fine-tuning on both Roman-Urdu-Parl and the domain-diverse Dakshina dataset. To address previous evaluation flaws, we introduce rigorous dataset splits and assess performance using BLEU, character-level BLEU, and CHRF. Our model achieves strong transliteration performance, with Char-BLEU scores of 96.37 for Urdu->Roman-Urdu and 97.44 for Roman-Urdu->Urdu. These results outperform both RNN baselines and GPT-4o Mini and demonstrate the effectiveness of multilingual transfer learning for low-resource transliteration tasks.

Orange Quality Grading with Deep Learning

Mohamed Lamine Mekhalfi,Paul Chippendale,Francisco Fraile,Marcos Rico

Task: 实现基于深度学习的多视角橙子分级方法。

Motivation: 橙子分级是水果行业的关键步骤，自动化分级可以提高效率、精度并减少人力。

Details

Method: 通过多视角图像捕捉并将图像拼接成一张图，使用卷积神经网络（CNN）对橙子进行分级。 Result: 实验证明多视角分级优于单视角分级。 Conclusion: 多视角深度学习方法是橙子分级的有效解决方案。 Abstract: Orange grading is a crucial step in the fruit industry, as it helps to sort oranges according to different criteria such as size, quality, ripeness, and health condition, ensuring safety for human consumption and better price allocation and client satisfaction. Automated grading enables faster processing, precision, and reduced human labor. In this paper, we implement a deep learning-based solution for orange grading via machine vision. Unlike typical grading systems that analyze fruits from a single view, we capture multiview images of each single orange in order to enable a richer representation. Afterwards, we compose the acquired images into one collage. This enables the analysis of the whole orange skin. We train a convolutional neural network (CNN) on the composed images to grade the oranges into three classes, namely good, bad, and undefined. We also evaluate the performance with two different CNNs (ResNet-18 and SqueezeNet). We show experimentally that multi-view grading is superior to single view grading.

SWI: Speaking with Intent in Large Language Models

Yuwei Yin,EunJeong Hwang,Giuseppe Carenini

Task: 提出并验证了在大型语言模型（LLMs）中引入‘Speaking with Intent（SWI）’概念，以提升其推理能力和生成质量。

Motivation: 通过模拟人类有目的和计划性的思维，SWI旨在为LLMs提供明确的意图和高层次规划，从而优化其分析和沟通能力。

Details

Method: 在数学推理、问答和文本摘要等任务上进行实验，比较SWI与基线方法（如无明确意图的生成）及其他提示方法（如Chain-of-Thought和Plan-and-Solve）的性能。 Result: SWI在数学推理、问答和文本摘要任务中均优于基线方法，并与强方法ARR竞争；生成的摘要更准确、简洁且事实正确。 Conclusion: SWI为增强LLMs的推理能力提供了一种新途径，其生成的意图具有连贯性、有效性和可解释性。 Abstract: Intent, typically clearly formulated and planned, functions as a cognitive framework for reasoning and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model's underlying intention and provides high-level planning to guide subsequent analysis and communication. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on mathematical reasoning benchmarks consistently demonstrate the superiority of Speaking with Intent over Baseline (i.e., generation without explicit intent). Moreover, SWI outperforms answer-trigger prompting methods Chain-of-Thought and Plan-and-Solve and maintains competitive performance with the strong method ARR (Analyzing, Retrieving, and Reasoning). Additionally, the effectiveness and generalizability of SWI are solidified on reasoning-intensive question answering (QA) and text summarization benchmarks, where SWI brings consistent improvement to the Baseline generation. In text summarization, SWI-generated summaries exhibit greater accuracy, conciseness, and factual correctness, with fewer hallucinations. Furthermore, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. This proof-of-concept study creates a novel avenue for enhancing LLMs' reasoning abilities with cognitive notions.

Vision-to-Music Generation: A Survey

Zhaokai Wang,Chenxi Bao,Le Zhuo,Jingrui Han,Yang Yue,Yihong Tang,Victor Shea-Jay Huang,Yue Liao

Task: 系统综述视觉到音乐生成领域的研究进展，包括视频到音乐和图像到音乐任务。

Motivation: 视觉到音乐生成是多模态人工智能的重要分支，具有广阔的应用前景，但目前研究仍处于初步阶段，缺乏全面的综述。

Details

Method: 分析不同输入类型（普通视频、人体运动视频、图像）和输出类型（符号音乐、音频音乐）的技术特点和核心挑战，总结现有方法，并详细评述常用数据集和评估指标。 Result: 提供了视觉到音乐生成领域的全面综述，总结了当前方法和挑战。 Conclusion: 希望该综述能激发视觉到音乐生成及多模态生成领域的进一步创新，并提供了持续更新的GitHub资源库。 Abstract: Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies on vision-to-music generation from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and promising directions for future research. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications. To follow latest works and foster further innovation in this field, we are continuously maintaining a GitHub repository at https://github.com/wzk1015/Awesome-Vision-to-Music-Generation.

Evaluating book summaries from internal knowledge in Large Language Models: a cross-model and semantic consistency approach

Javier Coronado-Blázquez

Task: 研究大型语言模型（LLMs）仅凭内部知识生成全面且准确的书籍摘要的能力。

Motivation: 探讨LLMs是否能在不依赖原始文本的情况下，合成与人类解读一致的叙事。

Details

Method: 采用多样化的书籍和多种LLM架构，通过LLM作为评判者的范式，评估AI生成的摘要与人类撰写的高质量摘要的对比。 Result: 结果显示模型在内容表达和风格偏好上存在细微差异，揭示了依赖内部知识进行摘要任务的优缺点。 Conclusion: 研究深化了对LLM内部事实信息编码的理解，并对开发更强大的自然语言生成系统具有启示意义。 Abstract: We study the ability of large language models (LLMs) to generate comprehensive and accurate book summaries solely from their internal knowledge, without recourse to the original text. Employing a diverse set of books and multiple LLM architectures, we examine whether these models can synthesize meaningful narratives that align with established human interpretations. Evaluation is performed with a LLM-as-a-judge paradigm: each AI-generated summary is compared against a high-quality, human-written summary via a cross-model assessment, where all participating LLMs evaluate not only their own outputs but also those produced by others. This methodology enables the identification of potential biases, such as the proclivity for models to favor their own summarization style over others. In addition, alignment between the human-crafted and LLM-generated summaries is quantified using ROUGE and BERTScore metrics, assessing the depth of grammatical and semantic correspondence. The results reveal nuanced variations in content representation and stylistic preferences among the models, highlighting both strengths and limitations inherent in relying on internal knowledge for summarization tasks. These findings contribute to a deeper understanding of LLM internal encodings of factual information and the dynamics of cross-model evaluation, with implications for the development of more robust natural language generative systems.

Learn by Reasoning: Analogical Weight Generation for Few-Shot Class-Incremental Learning

Jizhou Han,Chenhao Ding,Yuhang He,Songlin Dong,Qiang Wang,Xinyuan Gao,Yihong Gong

Task: Few-shot class-incremental Learning (FSCIL) enables models to learn new classes from limited data while retaining performance on previously learned classes.

Motivation: Traditional FSCIL methods suffer from a separation between learning new classes and utilizing old knowledge, and require fine-tuning parameters with limited new class data.

Details

Method: Proposed a novel analogical generative method with Brain-Inspired Analogical Generator (BiAG), which includes Weight Self-Attention Module (WSA), Weight & Prototype Analogical Attention Module (WPAA), and Semantic Conversion Module (SCM). Result: Experiments on miniImageNet, CUB-200, and CIFAR-100 datasets show higher final and average accuracy compared to SOTA methods. Conclusion: The proposed method effectively addresses the limitations of traditional FSCIL methods by leveraging analogical learning mechanisms without parameter fine-tuning. Abstract: Few-shot class-incremental Learning (FSCIL) enables models to learn new classes from limited data while retaining performance on previously learned classes. Traditional FSCIL methods often require fine-tuning parameters with limited new class data and suffer from a separation between learning new classes and utilizing old knowledge. Inspired by the analogical learning mechanisms of the human brain, we propose a novel analogical generative method. Our approach includes the Brain-Inspired Analogical Generator (BiAG), which derives new class weights from existing classes without parameter fine-tuning during incremental stages. BiAG consists of three components: Weight Self-Attention Module (WSA), Weight & Prototype Analogical Attention Module (WPAA), and Semantic Conversion Module (SCM). SCM uses Neural Collapse theory for semantic conversion, WSA supplements new class weights, and WPAA computes analogies to generate new class weights. Experiments on miniImageNet, CUB-200, and CIFAR-100 datasets demonstrate that our method achieves higher final and average accuracy compared to SOTA methods.

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

Xiaoye Qu,Yafu Li,Zhaochen Su,Weigao Sun,Jianhao Yan,Dongrui Liu,Ganqu Cui,Daizong Liu,Shuxian Liang,Junxian He,Peng Li,Wei Wei,Jing Shao,Chaochao Lu,Yue Zhang,Xian-Sheng Hua,Bowen Zhou,Yu Cheng

Task: 综述近期大型推理模型（LRMs）中提高推理效率的努力。

Motivation: 大型推理模型在推理过程中生成的冗长推理痕迹存在冗余内容、过度分析简单问题以及对复杂任务推理路径的浅层探索，导致训练、推理和实际部署中的效率问题。

Details

Method: 通过分析LRM生命周期（从预训练到推理）中的低效模式，并总结提出的改进方法。 Result: 提供了对LRM推理效率问题的全面概述，并讨论了未来研究方向。 Conclusion: 本综述旨在为进一步研究和创新提供基础，推动这一快速发展领域的进步。 Abstract: Recent Large Reasoning Models (LRMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference. However, a growing concern lies in their tendency to produce excessively long reasoning traces, which are often filled with redundant content (e.g., repeated definitions), over-analysis of simple problems, and superficial exploration of multiple reasoning paths for harder tasks. This inefficiency introduces significant challenges for training, inference, and real-world deployment (e.g., in agent-based systems), where token economy is critical. In this survey, we provide a comprehensive overview of recent efforts aimed at improving reasoning efficiency in LRMs, with a particular focus on the unique challenges that arise in this new paradigm. We identify common patterns of inefficiency, examine methods proposed across the LRM lifecycle, i.e., from pretraining to inference, and discuss promising future directions for research. To support ongoing development, we also maintain a real-time GitHub repository tracking recent progress in the field. We hope this survey serves as a foundation for further exploration and inspires innovation in this rapidly evolving area.

Reducing CT Metal Artifacts by Learning Latent Space Alignment with Gemstone Spectral Imaging Data

Wencheng Han,Dongqian Guo,Xiao Chen,Pang Lyu,Yi Jin,Jianbing Shen

Task: 通过Latent Gemstone Spectral Imaging (GSI) Alignment Framework减少CT切片中的金属伪影。

Motivation: 金属伪影降低了CT图像质量，影响对金属植入物周围组织的准确诊断。

Details

Method: 开发了一种对齐框架，将普通CT图像的表示调整到与GSI CT序列匹配，从而抑制金属伪影。 Result: 实验结果表明，该方法显著减少了金属伪影，提高了CT切片的可读性。 Conclusion: 提出的框架有效解决了金属伪影问题，且不引入额外噪声信息。 Abstract: Metal artifacts in CT slices have long posed challenges in medical diagnostics. These artifacts degrade image quality, resulting in suboptimal visualization and complicating the accurate interpretation of tissues adjacent to metal implants. To address these issues, we introduce the Latent Gemstone Spectral Imaging (GSI) Alignment Framework, which effectively reduces metal artifacts while avoiding the introduction of noise information. Our work is based on a key finding that even artifact-affected ordinary CT sequences contain sufficient information to discern detailed structures. The challenge lies in the inability to clearly represent this information. To address this issue, we developed an Alignment Framework that adjusts the representation of ordinary CT images to match GSI CT sequences. GSI is an advanced imaging technique using multiple energy levels to mitigate artifacts caused by metal implants. By aligning the representation to GSI data, we can effectively suppress metal artifacts while clearly revealing detailed structure, without introducing extraneous information into CT sequences. To facilitate the application, we propose a new dataset, Artifacts-GSI, captured from real patients with metal implants, and establish a new benchmark based on this dataset. Experimental results show that our method significantly reduces metal artifacts and greatly enhances the readability of CT slices. All our code and data are available at: https://um-lab.github.io/GSI-MAR/

COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing

Rajvee Sheth,Himanshu Beniwal,Mayank Singh

Task: 构建并发布COMI-LINGUA数据集，用于支持五种基本NLP任务。

Motivation: 现有数据集多关注罗马化文本或依赖合成数据，无法捕捉真实世界的语言混合现象，需人工标注评估自然性和可接受性。

Details

Method: 引入COMI-LINGUA数据集，包含100,970个实例，由三位专家标注，涵盖天城体和罗马字母两种书写形式。 Result: 评估LLMs在五种NLP任务上的表现，揭示了当前多语言建模策略的局限性。 Conclusion: COMI-LINGUA数据集公开可用，强调需提升代码混合文本处理能力。 Abstract: The rapid growth of digital communication has driven the widespread use of code-mixing, particularly Hindi-English, in multilingual communities. Existing datasets often focus on romanized text, have limited scope, or rely on synthetic data, which fails to capture realworld language nuances. Human annotations are crucial for assessing the naturalness and acceptability of code-mixed text. To address these challenges, We introduce COMI-LINGUA, the largest manually annotated dataset for code-mixed text, comprising 100,970 instances evaluated by three expert annotators in both Devanagari and Roman scripts. The dataset supports five fundamental NLP tasks: Language Identification, Matrix Language Identification, Part-of-Speech Tagging, Named Entity Recognition, and Translation. We evaluate LLMs on these tasks using COMILINGUA, revealing limitations in current multilingual modeling strategies and emphasizing the need for improved code-mixed text processing capabilities. COMI-LINGUA is publically availabe at: https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA.

vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition

Yunusa Haruna,Adamu Lawan

Task: 提出一种名为vGamba的混合视觉骨干网络，结合状态空间模型（SSMs）和注意力机制，以高效捕获视觉任务中的长距离依赖关系。

Motivation: 现有方法（如CNNs和ViTs）在捕获长距离依赖关系时存在局限性，CNNs受限于感受野，ViTs计算成本高，而SSMs在视觉领域的应用尚未充分探索。

Details

Method: vGamba结合了Gamba瓶颈块（包含Gamba Cell、多头自注意力机制和门控融合模块），通过SSMs的低计算需求和注意力机制的高精度，实现高效的长距离依赖建模。 Result: 在分类、检测和分割任务上的实验表明，vGamba在准确性和计算效率之间取得了优越的平衡，优于多种现有模型。 Conclusion: vGamba通过结合SSMs和注意力机制，提供了一种高效且表达力强的视觉骨干网络，适用于多种视觉任务。 Abstract: Capturing long-range dependencies efficiently is essential for visual recognition tasks, yet existing methods face limitations. Convolutional neural networks (CNNs) struggle with restricted receptive fields, while Vision Transformers (ViTs) achieve global context and long-range modeling at a high computational cost. State-space models (SSMs) offer an alternative, but their application in vision remains underexplored. This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness. At its core, the Gamba bottleneck block that includes, Gamba Cell, an adaptation of Mamba for 2D spatial structures, alongside a Multi-Head Self-Attention (MHSA) mechanism and a Gated Fusion Module for effective feature representation. The interplay of these components ensures that vGamba leverages the low computational demands of SSMs while maintaining the accuracy of attention mechanisms for modeling long-range dependencies in vision tasks. Additionally, the Fusion module enables seamless interaction between these components. Extensive experiments on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.

How do language models learn facts? Dynamics, curricula and hallucinations

Nicolas Zucchet,Jörg Bornschein,Stephanie Chan,Andrew Lampinen,Razvan Pascanu,Soham De

Task: 研究语言模型在合成事实回忆任务中的学习动态。

Motivation: 尽管大型语言模型在预训练中积累了大量知识，但其知识获取的动态机制仍不清楚。

Details

Method: 通过合成事实回忆任务，分析语言模型的学习动态，包括学习阶段、注意力机制形成、数据分布影响以及幻觉现象。 Result: 发现语言模型学习分为三个阶段，数据分布显著影响学习动态，幻觉与知识同时出现，新知识通过微调难以整合。 Conclusion: 数据分布在知识获取中至关重要，并提出新的数据调度策略以加速神经网络训练。 Abstract: Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.

Ming Yan,Xincheng Lin,Yuhua Luo,Shuqi Fan,Yudi Dai,Qixin Zhong,Lincai Zhong,Yuexin Ma,Lan Xu,Chenglu Wen,Siqi Shen,Cheng Wang

Task: 解决攀岩运动数据集的不足并提出一种新的运动恢复方法ClimbingCap。

Motivation: 攀岩运动的研究较少，主要原因是缺乏大规模且具有挑战性的3D标注数据集。

Details

Method: 收集并标注了大规模攀岩运动数据集AscendMotion，并提出ClimbingCap方法，结合RGB和LiDAR模态重建运动。 Result: 展示了AscendMotion数据集的质量，并展示了ClimbingCap的积极结果。 Conclusion: 公开了AscendMotion数据集和源代码，为攀岩运动研究提供了资源。 Abstract: Human Motion Recovery (HMR) research mainly focuses on ground-based motions such as running. The study on capturing climbing motion, an off-ground motion, is sparse. This is partly due to the limited availability of climbing motion datasets, especially large-scale and challenging 3D labeled datasets. To address the insufficiency of climbing motion datasets, we collect AscendMotion, a large-scale well-annotated, and challenging climbing motion dataset. It consists of 412k RGB, LiDAR frames, and IMU measurements, including the challenging climbing motions of 22 skilled climbing coaches across 12 different rock walls. Capturing the climbing motions is challenging as it requires precise recovery of not only the complex pose but also the global position of climbers. Although multiple global HMR methods have been proposed, they cannot faithfully capture climbing motions. To address the limitations of HMR methods for climbing, we propose ClimbingCap, a motion recovery method that reconstructs continuous 3D human climbing motion in a global coordinate system. One key insight is to use the RGB and LiDAR modalities to separately reconstruct motions in camera coordinates and global coordinates and to optimize them jointly. We demonstrate the quality of the AscendMotion dataset and present promising results from ClimbingCap. The AscendMotion dataset and source code release publicly at \href{this link}{http://www.lidarhumanmotion.net/climbingcap/}

JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai Community

Yunze Xiao,Tingyu He,Lionel Z. Wang,Yiming Ma,Xingyu Song,Xiaohang Xu,Irene Li,Ka Chung Ng

Task: Introduce JiraiBench, the first bilingual benchmark for evaluating large language models' effectiveness in detecting self-destructive content in Chinese and Japanese social media.

Motivation: Address the transnational 'Jirai' online subculture and its associated self-destructive behaviors, emphasizing the need for culturally-informed content moderation.

Details

Method: Develop a comprehensive evaluation framework with a dataset of 10,419 Chinese and 5,000 Japanese posts, annotated along three behavioral categories, and evaluate four state-of-the-art models. Result: Japanese prompts outperformed Chinese prompts when processing Chinese content, indicating cultural proximity can outweigh linguistic similarity. Cross-lingual transfer experiments showed potential for knowledge transfer without explicit target language training. Conclusion: Cultural context is crucial for effective detection systems, highlighting the need for culturally-informed approaches in multilingual content moderation. Abstract: This paper introduces JiraiBench, the first bilingual benchmark for evaluating large language models' effectiveness in detecting self-destructive content across Chinese and Japanese social media communities. Focusing on the transnational "Jirai" (landmine) online subculture that encompasses multiple forms of self-destructive behaviors including drug overdose, eating disorders, and self-harm, we present a comprehensive evaluation framework incorporating both linguistic and cultural dimensions. Our dataset comprises 10,419 Chinese posts and 5,000 Japanese posts with multidimensional annotation along three behavioral categories, achieving substantial inter-annotator agreement. Experimental evaluations across four state-of-the-art models reveal significant performance variations based on instructional language, with Japanese prompts unexpectedly outperforming Chinese prompts when processing Chinese content. This emergent cross-cultural transfer suggests that cultural proximity can sometimes outweigh linguistic similarity in detection tasks. Cross-lingual transfer experiments with fine-tuned models further demonstrate the potential for knowledge transfer between these language systems without explicit target language training. These findings highlight the need for culturally-informed approaches to multilingual content moderation and provide empirical evidence for the importance of cultural context in developing more effective detection systems for vulnerable online communities.

Delving Deep into Semantic Relation Distillation

Zhaoyi Yan,Kangjun Liu,Qixiang Ye

Task: 提出一种基于语义关系的知识蒸馏方法（SeRKD），以改进传统实例级知识蒸馏的局限性。

Motivation: 传统知识蒸馏方法无法捕捉数据中的语义关系，因此需要一种更全面的、基于语义关系的蒸馏方法。

Details

Method: 通过利用超像素等语义组件，将基于语义的提取与基于关系的知识蒸馏相结合，提出SeRKD方法。 Result: 在基准数据集上的实验表明，SeRKD优于现有方法，提升了模型性能和泛化能力。 Conclusion: SeRKD通过语义关系视角重新定义了知识蒸馏，为模型压缩和蒸馏提供了更有效的方法。 Abstract: Knowledge distillation has become a cornerstone technique in deep learning, facilitating the transfer of knowledge from complex models to lightweight counterparts. Traditional distillation approaches focus on transferring knowledge at the instance level, but fail to capture nuanced semantic relationships within the data. In response, this paper introduces a novel methodology, Semantics-based Relation Knowledge Distillation (SeRKD), which reimagines knowledge distillation through a semantics-relation lens among each sample. By leveraging semantic components, \ie, superpixels, SeRKD enables a more comprehensive and context-aware transfer of knowledge, which skillfully integrates superpixel-based semantic extraction with relation-based knowledge distillation for a sophisticated model compression and distillation. Particularly, the proposed method is naturally relevant in the domain of Vision Transformers (ViTs), where visual tokens serve as fundamental units of representation. Experimental evaluations on benchmark datasets demonstrate the superiority of SeRKD over existing methods, underscoring its efficacy in enhancing model performance and generalization capabilities.

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Wenqi Zhang,Mengna Wang,Gangao Liu,Xu Huixin,Yiwei Jiang,Yongliang Shen,Guiyang Hou,Zhe Zheng,Hang Zhang,Xin Li,Weiming Lu,Peng Li,Yueting Zhuang

Task: 将深度思维模型的推理能力扩展到需要与环境持续交互的具身搜索任务中。

Motivation: 当前深度思维模型在数学和编程任务中表现出色，但在需要空间理解、时间推理和持续自我反思的具身交互领域尚未充分探索。

Details

Method: 提出Embodied Reasoner模型，通过合成9.3k条连贯的观察-思考-行动轨迹，采用三阶段训练流程（模仿学习、自我探索和反思调优）。 Result: 模型在评估中显著优于先进视觉推理模型（如OpenAI o1、o3-mini和Claude-3.7），在复杂长时任务中表现尤为突出。 Conclusion: Embodied Reasoner在具身交互任务中表现出更高的效率和逻辑一致性，尤其在复杂场景中具有优势。 Abstract: Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.

Zero-Shot Visual Concept Blending Without Text Guidance

Hiroya Makino,Takahiro Yamaguchi,Hiroyuki Sakai

Task: 提出一种名为“视觉概念混合”的零样本图像生成技术，实现对多张参考图像特征的细粒度控制。

Motivation: 解决单张参考图像难以隔离特定元素的问题，通过多张参考图像区分共性和独特特征。

Details

Method: 在部分解耦的CLIP嵌入空间中操作，无需额外训练或文本提示，灵活转移纹理、形状、运动、风格等抽象概念。 Result: 在风格迁移、形态变形和概念转换等任务中表现优异，用户研究证实特征转移意图被准确识别。 Conclusion: 视觉概念混合因其简单性、灵活性和高级控制能力，在艺术、设计和内容创作等创意领域具有重要价值。 Abstract: We propose a novel, zero-shot image generation technique called "Visual Concept Blending" that provides fine-grained control over which features from multiple reference images are transferred to a source image. If only a single reference image is available, it is difficult to isolate which specific elements should be transferred. However, using multiple reference images, the proposed approach distinguishes between common and unique features by selectively incorporating them into a generated output. By operating within a partially disentangled Contrastive Language-Image Pre-training (CLIP) embedding space (from IP-Adapter), our method enables the flexible transfer of texture, shape, motion, style, and more abstract conceptual transformations without requiring additional training or text prompts. We demonstrate its effectiveness across a diverse range of tasks, including style transfer, form metamorphosis, and conceptual transformations, showing how subtle or abstract attributes (e.g., brushstroke style, aerodynamic lines, and dynamism) can be seamlessly combined into a new image. In a user study, participants accurately recognized which features were intended to be transferred. Its simplicity, flexibility, and high-level control make Visual Concept Blending valuable for creative fields such as art, design, and content creation, where combining specific visual qualities from multiple inspirations is crucial.

As easy as PIE: understanding when pruning causes language models to disagree

Pietro Tropeano,Maria Maistro,Tuukka Ruotsalo,Christina Lioma

Task: 研究语言模型（LM）剪枝对特定数据子集（PIEs）的影响。

Motivation: 剪枝通常关注效率提升而忽视对某些数据点（PIEs）的准确性影响，这些点在NLP领域未被充分研究。

Details

Method: 通过分析多种NLP数据集、剪枝方法和压缩级别，研究PIEs对推断质量的影响。 Result: 发现PIEs显著影响推断质量，BERT比BiLSTM更易受影响，且PIEs包含对模型泛化能力至关重要的数据点。 Conclusion: 剪枝时看似适度的准确性损失实际上对最重要的数据点影响巨大，PIEs的文本更长且语义更复杂。 Abstract: Language Model (LM) pruning compresses the model by removing weights, nodes, or other parts of its architecture. Typically, pruning focuses on the resulting efficiency gains at the cost of effectiveness. However, when looking at how individual data points are affected by pruning, it turns out that a particular subset of data points always bears most of the brunt (in terms of reduced accuracy) when pruning, but this effect goes unnoticed when reporting the mean accuracy of all data points. These data points are called PIEs and have been studied in image processing, but not in NLP. In a study of various NLP datasets, pruning methods, and levels of compression, we find that PIEs impact inference quality considerably, regardless of class frequency, and that BERT is more prone to this than BiLSTM. We also find that PIEs contain a high amount of data points that have the largest influence on how well the model generalises to unseen data. This means that when pruning, with seemingly moderate loss to accuracy across all data points, we in fact hurt tremendously those data points that matter the most. We trace what makes PIEs both hard and impactful to inference to their overall longer and more semantically complex text. These findings are novel and contribute to understanding how LMs are affected by pruning. The code is available at: https://github.com/pietrotrope/AsEasyAsPIE

Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression

Hanyue Tu,Siqi Wu,Li Li,Wengang Zhou,Houqiang Li

Task: 提出一种基于可逆变换的变速率图像压缩模型，以克服自编码器在高比特率下的性能限制。

Motivation: 自编码器在图像压缩中存在信息丢失问题，限制了其在高比特率下的率失真性能和速率适应灵活性。

Details

Method: 设计了一个轻量级多尺度可逆神经网络，将输入图像双射映射到多尺度潜在表示，并采用多尺度空间-通道上下文模型和扩展增益单元来估计潜在表示的熵。 Result: 实验结果表明，该方法在变速率方法中达到最先进性能，且与多模型方法竞争，首次在单模型下超越VVC。 Conclusion: 该方法在高比特率下表现出色，是首个在广泛比特率范围内超越VVC的基于学习的图像压缩解决方案。 Abstract: Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweight multi-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit rates.The source code is available at \href{https://github.com/hytu99/MSINN-VRLIC}{https://github.com/hytu99/MSINN-VRLIC}.

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

Jiefu Ou,William Gantt Walden,Kate Sanders,Zhengping Jiang,Kaiser Sun,Jeffrey Cheng,William Jurayj,Miriam Wanner,Shaobo Liang,Candice Morgan,Seunghoon Han,Weiqi Wang,Chandler May,Hannah Recknor,Daniel Khashabi,Benjamin Van Durme

Task: 构建CLAIMCHECK数据集，用于评估大语言模型在科学论文评审中的表现。

Motivation: 自动生成的评审意见可能缺乏科学依据，难以确保其合理性和针对性。

Details

Method: 从OpenReview中提取NeurIPS 2023和2024的论文及评审数据，并由ML专家标注弱点陈述、争议主张及弱点标签。 Result: 前沿大语言模型在预测弱点标签方面表现尚可，但在其他任务上仍逊于人类专家。 Conclusion: CLAIMCHECK为评估和改进大语言模型在科学评审中的能力提供了重要基准。 Abstract: A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper's claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.

InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression

Dongchen Lu,Yuyao Sun,Zilu Zhang,Leping Huang,Jianliang Zeng,Mao Shu,Huo Cao

Task: 通过三种视觉令牌压缩方法提升多模态大语言模型（MLLM）的性能和效率。

Motivation: 现有MLLM将视觉令牌视为文本序列处理，导致计算资源需求高且效率低下。

Details

Method: 提出InternVL-X模型，包含PVTC（局部和全局查询的点到区域交叉注意力）、LVTC（分层视觉令牌压缩）和RVTC（高效高分辨率切片方法）。 Result: 使用20%或更少的视觉令牌，在7个公共MLLM基准测试中达到最优性能，12项任务平均指标提升2.34%。 Conclusion: InternVL-X通过创新的视觉令牌压缩方法显著提升了MLLM的性能和效率。 Abstract: Most multimodal large language models (MLLMs) treat visual tokens as "a sequence of text", integrating them with text tokens into a large language model (LLM). However, a great quantity of visual tokens significantly increases the demand for computational resources and time. In this paper, we propose InternVL-X, which outperforms the InternVL model in both performance and efficiency by incorporating three visual token compression methods. First, we propose a novel vision-language projector, PVTC. This component integrates adjacent visual embeddings to form a local query and utilizes the transformed CLS token as a global query, then performs point-to-region cross-attention through these local and global queries to more effectively convert visual features. Second, we present a layer-wise visual token compression module, LVTC, which compresses tokens in the LLM shallow layers and then expands them through upsampling and residual connections in the deeper layers. This significantly enhances the model computational efficiency. Futhermore, we propose an efficient high resolution slicing method, RVTC, which dynamically adjusts the number of visual tokens based on image area or length filtering. RVTC greatly enhances training efficiency with only a slight reduction in performance. By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.

Outlier dimensions favor frequent tokens in language model

Iuri Macocco,Nora Graichen,Gemma Boleda,Marco Baroni

Task: 研究最后一层异常维度（即对大多数输入显示极端激活的维度）及其在语言模型中的作用。

Motivation: 探索现代语言模型中普遍存在的异常维度现象，并揭示其与频繁词预测启发式的关系。

Details

Method: 分析异常维度的功能，研究模型如何通过分配平衡权重来阻止不合适的启发式，并调查哪些参数促进异常维度及其在训练中的出现时机。 Result: 发现异常维度是许多不同模型发现的一种专门机制，用于实现有用的词预测启发式。 Conclusion: 异常维度是语言模型中实现特定预测启发式的一种有效机制。 Abstract: We study last-layer outlier dimensions, i.e.dimensions that display extreme activations for the majority of inputs. We show that outlier dimensions arise in many different modern language models, and trace their function back to the heuristic of constantly predicting frequent words. We further show how a model can block this heuristic when it is not contextually appropriate, by assigning a counterbalancing weight mass to the remaining dimensions, and we investigate which model parameters boost outlier dimensions and when they arise during training. We conclude that outlier dimensions are a specialized mechanism discovered by many distinct models to implement a useful token prediction heuristic.

FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval

Zixu Li,Zhiheng Fu,Yupeng Hu,Zhiwei Chen,Haokun Wen,Liqiang Nie

Task: 开发一个细粒度的组合图像检索（CIR）框架和数据集，以解决现有粗粒度修改文本（CoarseMT）在捕捉细粒度检索意图时的不足。

Motivation: 现有的CIR数据集主要使用粗粒度修改文本，导致检索意图不精确和视觉相似图像的模糊性，从而降低了检索准确性。

Details

Method: 开发了一个细粒度的CIR数据标注流程，并基于此优化了FashionIQ和CIRR数据集，创建了Fine-FashionIQ和Fine-CIRR。同时提出了FineCIR框架，专门用于解析修改文本并捕捉细粒度语义。 Result: FineCIR在细粒度和传统CIR基准数据集上均优于现有最先进的CIR基线方法。 Conclusion: 提出的FineCIR框架和细粒度数据集显著提升了组合图像检索的精确性，解决了粗粒度修改文本的局限性。 Abstract: Composed Image Retrieval (CIR) facilitates image retrieval through a multimodal query consisting of a reference image and modification text. The reference image defines the retrieval context, while the modification text specifies desired alterations. However, existing CIR datasets predominantly employ coarse-grained modification text (CoarseMT), which inadequately captures fine-grained retrieval intents. This limitation introduces two key challenges: (1) ignoring detailed differences leads to imprecise positive samples, and (2) greater ambiguity arises when retrieving visually similar images. These issues degrade retrieval accuracy, necessitating manual result filtering or repeated queries. To address these limitations, we develop a robust fine-grained CIR data annotation pipeline that minimizes imprecise positive samples and enhances CIR systems' ability to discern modification intents accurately. Using this pipeline, we refine the FashionIQ and CIRR datasets to create two fine-grained CIR datasets: Fine-FashionIQ and Fine-CIRR. Furthermore, we introduce FineCIR, the first CIR framework explicitly designed to parse the modification text. FineCIR effectively captures fine-grained modification semantics and aligns them with ambiguous visual entities, enhancing retrieval precision. Extensive experiments demonstrate that FineCIR consistently outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR benchmark datasets. Our FineCIR code and fine-grained CIR datasets are available at https://github.com/SDU-L/FineCIR.git.

Collab: Controlled Decoding using Mixture of Agents for LLM Alignment

Souradip Chakraborty,Sujay Bhatt,Udari Madhushani Sehwag,Soumya Suvra Ghosal,Jiahao Qiu,Mengdi Wang,Dinesh Manocha,Furong Huang,Alec Koppel,Sumitra Ganesh

Task: 提出一种基于多智能体协作的解码方法，以在推理时动态选择最优的LLM策略。

Motivation: 传统的单智能体解码方法难以适应多样化任务的复杂性，而现有的RLHF方法计算成本高。

Details

Method: 通过混合多个现成的对齐LLM策略，动态选择每个token的最优模型，实现推理时的对齐。 Result: Collab方法在平均奖励和GPT-4胜率上显著优于单智能体基线，最高提升1.56倍和71.89%。 Conclusion: 多智能体协作解码方法在推理时对齐LLM方面具有高效性和优越性。 Abstract: Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences and broader utilities, but it requires updating billions of model parameters, which is computationally expensive. Controlled Decoding, by contrast, provides a mechanism for aligning a model at inference time without retraining. However, single-agent decoding approaches often struggle to adapt to diverse tasks due to the complexity and variability inherent in these tasks. To strengthen the test-time performance w.r.t the target task, we propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies. Treating each prior policy as an agent in the spirit of mixture of agent collaboration, we develop a decoding method that allows for inference-time alignment through a token-level selection strategy among multiple agents. For each token, the most suitable LLM is dynamically chosen from a pool of models based on a long-term utility metric. This policy-switching mechanism ensures optimal model selection at each step, enabling efficient collaboration and alignment among LLMs during decoding. Theoretical analysis of our proposed algorithm establishes optimal performance with respect to the target task represented via a target reward for the given off-the-shelf models. We conduct comprehensive empirical evaluations with open-source aligned models on diverse tasks and preferences, which demonstrates the merits of this approach over single-agent decoding baselines. Notably, Collab surpasses the current SoTA decoding strategy, achieving an improvement of up to 1.56x in average reward and 71.89% in GPT-4 based win-tie rate.

HORT: Monocular Hand-held Objects Reconstruction with Transformers

Zerui Chen,Rolandos Alexandros Potamias,Shizhe Chen,Cordelia Schmid

Task: 从单目图像中高效重建手持物体的密集3D点云。

Motivation: 现有方法依赖隐式3D表示导致重建结果过于平滑且耗时，而基于扩散模型的方法因多步去噪导致高分辨率重建效率低下。

Details

Method: 提出一种基于Transformer的模型，采用由粗到细的策略，首先生成稀疏点云，再逐步利用像素对齐的图像特征细化成密集点云，并结合3D手部几何信息提升重建精度。 Result: 在合成和真实数据集上实现了最先进的精度和更快的推理速度，并能泛化到自然场景图像。 Conclusion: 所提方法在高效性和准确性上优于现有方法，适用于实际应用场景。 Abstract: Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.

ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation

Zhicheng Lee,Shulin Cao,Jinxin Liu,Jiajie Zhang,Weichuan Liu,Xiaoyin Che,Lei Hou,Juanzi Li

Task: 提出ReaRAG模型，增强大型推理模型（LRMs）的事实准确性，并解决现有检索增强模型在推理中的过度思考和鲁棒性问题。

Motivation: 现有基于强化学习的检索增强模型在问答任务中存在过度思考和推理鲁棒性不足的问题，限制了其事实准确性。

Details

Method: 提出ReaRAG模型，结合数据构造框架和预定义动作空间（搜索和完成），通过迭代执行搜索动作并利用检索结果指导推理步骤，直至选择完成动作。 Result: ReaRAG在多跳问答任务中优于现有基线，并展现出较强的错误识别和推理轨迹优化能力。 Conclusion: ReaRAG在提升大型推理模型事实准确性的同时，有效整合了检索增强生成的鲁棒推理能力。 Abstract: Large Reasoning Models (LRMs) exhibit remarkable reasoning abilities but rely primarily on parametric knowledge, limiting factual accuracy. While recent works equip reinforcement learning (RL)-based LRMs with retrieval capabilities, they suffer from overthinking and lack robustness in reasoning, reducing their effectiveness in question answering (QA) tasks. To address this, we propose ReaRAG, a factuality-enhanced reasoning model that explores diverse queries without excessive iterations. Our solution includes a novel data construction framework with an upper bound on the reasoning chain length. Specifically, we first leverage an LRM to generate deliberate thinking, then select an action from a predefined action space (Search and Finish). For Search action, a query is executed against the RAG engine, where the result is returned as observation to guide reasoning steps later. This process iterates until a Finish action is chosen. Benefiting from ReaRAG's strong reasoning capabilities, our approach outperforms existing baselines on multi-hop QA. Further analysis highlights its strong reflective ability to recognize errors and refine its reasoning trajectory. Our study enhances LRMs' factuality while effectively integrating robust reasoning for Retrieval-Augmented Generation (RAG).

DuckSegmentation: A segmentation model based on the AnYue Hemp Duck Dataset

Ling Feng,Tianyu Xie,Wei Ma,Ruijie Fu,Yingxiao Zhang,Jun Li,Bei Zhou

Task: 构建AnYue Shelduck数据集并开发高效的鸭识别模块DuckProcessing，用于智能农业中的鸭识别与分割。

Motivation: 解决现有大型模型在农业应用中因可解释性差和计算量大而无法实际落地的问题。

Details

Method: 基于YOLOv8和DuckSegmentation模型进行目标检测与分割，并通过知识蒸馏优化模型性能。 Result: YOLOv8在测试集上Precision为98.10%，Recall为96.53%，F1分数为0.95；DuckSegmentation的mIoU为96.43%；学生模型（Deeplabv3 r50）的mIoU为94.49%。 Conclusion: DuckProcessing为智能农业中的鸭识别提供了高效且实用的解决方案。 Abstract: The modernization of smart farming is a way to improve agricultural production efficiency, and improve the agricultural production environment. Although many large models have achieved high accuracy in the task of object recognition and segmentation, they cannot really be put into use in the farming industry due to their own poor interpretability and limitations in computational volume. In this paper, we built AnYue Shelduck Dateset, which contains a total of 1951 Shelduck datasets, and performed target detection and segmentation annotation with the help of professional annotators. Based on AnYue ShelduckDateset, this paper describes DuckProcessing, an efficient and powerful module for duck identification based on real shelduckfarms. First of all, using the YOLOv8 module designed to divide the mahjong between them, Precision reached 98.10%, Recall reached 96.53% and F1 score reached 0.95 on the test set. Again using the DuckSegmentation segmentation model, DuckSegmentation reached 96.43% mIoU. Finally, the excellent DuckSegmentation was used as the teacher model, and through knowledge distillation, Deeplabv3 r50 was used as the student model, and the final student model achieved 94.49% mIoU on the test set. The method provides a new way of thinking in practical sisal duck smart farming.

Effective Skill Unlearning through Intervention and Abstention

Yongce Li,Chung-En Sun,Tsui-Wei Weng

Task: 研究大型语言模型（LLMs）中特定技能的遗忘方法，同时保留其整体能力。

Motivation: 理解LLMs的能力机制并实现对它们的控制，对于开发更好的模型至关重要。

Details

Method: 提出了两种轻量级、无需训练的机器学习技能遗忘技术：Neuron Adjust和Key Space Detection。 Result: 在数学解题、Python编程和理解能力等技能的遗忘任务中，Key Space Detection方法在目标技能上实现了超过80%的性能下降，而在其他技能和模型的一般知识（MMLU）上性能下降不到10%。 Conclusion: 提出的方法在特定技能遗忘任务中表现出色，为LLMs的控制提供了有效工具。 Abstract: Large language Models (LLMs) have demonstrated remarkable skills across various domains. Understanding the mechanisms behind their abilities and implementing controls over them is becoming increasingly important for developing better models. In this paper, we focus on skill unlearning in LLMs, specifically unlearning a particular skill while retaining their overall capabilities. We introduce two lightweight, training-free machine skill unlearning techniques for LLMs. First, we observe that the pre-activation distribution of neurons in each Feed-Forward Layer (FFL) differs when the model demonstrates different skills. Additionally, we find that queries triggering the same skill cluster within the FFL key space and can be separated from other queries using a hypercube. Based on these observations, we propose two lightweight, training-free skill unlearning methods via \textit{intervention} and \textit{abstention} respectively: \texttt{Neuron Adjust} and \texttt{Key Space Detection}. We evaluate our methods on unlearning math-solving, Python-coding, and comprehension skills across seven different languages. The results demonstrate their strong unlearning capabilities for the designated skills. Specifically, \texttt{Key Space Detection} achieves over 80\% relative performance drop on the forgetting skill and less than 10\% relative performance drop on other skills and the model's general knowledge (MMLU) for most unlearning tasks. Our code is available at https://github.com/Trustworthy-ML-Lab/effective_skill_unlearning

UGNA-VPR: A Novel Training Paradigm for Visual Place Recognition Based on Uncertainty-Guided NeRF Augmentation

Yehui Shen,Lei Zhang,Qingqiu Li,Xiongwei Zhao,Yue Wang,Huimin Lu,Xieyuanli Chen

Task: 通过不确定性估计和NeRF数据增强提升视觉地点识别（VPR）网络的性能。

Motivation: 现有VPR数据集多为单视角场景，导致在多方向驾驶或特征稀疏场景中识别精度下降，且获取额外数据成本高昂。

Details

Method: 结合NeRF训练和自监督不确定性估计网络生成新合成数据，并改进数据存储方法。 Result: 在三个数据集和三种VPR骨干网络上的实验表明，该方法显著提升了VPR性能。 Conclusion: 提出的训练范式充分利用现有数据，优于其他方法，并在自录数据集上验证了有效性。 Abstract: Visual place recognition (VPR) is crucial for robots to identify previously visited locations, playing an important role in autonomous navigation in both indoor and outdoor environments. However, most existing VPR datasets are limited to single-viewpoint scenarios, leading to reduced recognition accuracy, particularly in multi-directional driving or feature-sparse scenes. Moreover, obtaining additional data to mitigate these limitations is often expensive. This paper introduces a novel training paradigm to improve the performance of existing VPR networks by enhancing multi-view diversity within current datasets through uncertainty estimation and NeRF-based data augmentation. Specifically, we initially train NeRF using the existing VPR dataset. Then, our devised self-supervised uncertainty estimation network identifies places with high uncertainty. The poses of these uncertain places are input into NeRF to generate new synthetic observations for further training of VPR networks. Additionally, we propose an improved storage method for efficient organization of augmented and original training data. We conducted extensive experiments on three datasets and tested three different VPR backbone networks. The results demonstrate that our proposed training paradigm significantly improves VPR performance by fully utilizing existing data, outperforming other training approaches. We further validated the effectiveness of our approach on self-recorded indoor and outdoor datasets, consistently demonstrating superior results. Our dataset and code have been released at \href{https://github.com/nubot-nudt/UGNA-VPR}{https://github.com/nubot-nudt/UGNA-VPR}.

MemInsight: Autonomous Memory Augmentation for LLM Agents

Rana Salama,Jason Cai,Michelle Yuan,Anna Currey,Monica Sunkara,Yi Zhang,Yassine Benajiba

Task: 提出一种自主记忆增强方法MemInsight，以提升LLM代理的语义数据表示和检索能力。

Motivation: LLM代理需要长期记忆能力以利用历史交互和知识，但记忆规模增长和语义结构化需求带来挑战。

Details

Method: 通过自主增强历史交互，优化语义数据表示和检索机制。 Result: 在对话推荐、问答和事件摘要任务中验证有效性，推荐说服力提升14%，LoCoMo检索召回率优于RAG基线34%。 Conclusion: MemInsight能显著提升LLM代理在多任务中的上下文表现。 Abstract: Large language model (LLM) agents have evolved to intelligently process information, make decisions, and interact with users or tools. A key capability is the integration of long-term memory capabilities, enabling these agents to draw upon historical interactions and knowledge. However, the growing memory size and need for semantic structuring pose significant challenges. In this work, we propose an autonomous memory augmentation approach, MemInsight, to enhance semantic data representation and retrieval mechanisms. By leveraging autonomous augmentation to historical interactions, LLM agents are shown to deliver more accurate and contextualized responses. We empirically validate the efficacy of our proposed approach in three task scenarios; conversational recommendation, question answering and event summarization. On the LLM-REDIAL dataset, MemInsight boosts persuasiveness of recommendations by up to 14%. Moreover, it outperforms a RAG baseline by 34% in recall for LoCoMo retrieval. Our empirical results show the potential of MemInsight to enhance the contextual performance of LLM agents across multiple tasks.

LandMarkSystem Technical Report

Zhenxiang Ma,Zhenyu Yang,Miao Tao,Yuanzhen Zhou,Zeyu He,Yuchang Zhang,Rong Fu,Hengjie Li

Task: 提出一种名为LandMarkSystem的新型计算框架，用于增强多尺度场景重建和渲染。

Motivation: 传统深度学习框架难以满足对场景质量和规模日益增长的需求，而NeRF和3DGS等先进技术需要更高效的支持。

Details

Method: 通过组件化模型适配层支持多种NeRF和3DGS结构，利用分布式并行计算和模型参数卸载优化计算效率。 Result: 系统提供了针对复杂3D稀疏计算的专用算子，实现了高效训练和快速推理，并在多种代表性算法中验证了其能力。 Conclusion: LandMarkSystem通过模块化架构和动态加载策略，提升了3D重建任务的效率和效果，并开源以促进进一步研究。 Abstract: 3D reconstruction is vital for applications in autonomous driving, virtual reality, augmented reality, and the metaverse. Recent advancements such as Neural Radiance Fields(NeRF) and 3D Gaussian Splatting (3DGS) have transformed the field, yet traditional deep learning frameworks struggle to meet the increasing demands for scene quality and scale. This paper introduces LandMarkSystem, a novel computing framework designed to enhance multi-scale scene reconstruction and rendering. By leveraging a componentized model adaptation layer, LandMarkSystem supports various NeRF and 3DGS structures while optimizing computational efficiency through distributed parallel computing and model parameter offloading. Our system addresses the limitations of existing frameworks, providing dedicated operators for complex 3D sparse computations, thus facilitating efficient training and rapid inference over extensive scenes. Key contributions include a modular architecture, a dynamic loading strategy for limited resources, and proven capabilities across multiple representative algorithms.This comprehensive solution aims to advance the efficiency and effectiveness of 3D reconstruction tasks.To facilitate further research and collaboration, the source code and documentation for the LandMarkSystem project are publicly available in an open-source repository, accessing the repository at: https://github.com/InternLandMark/LandMarkSystem.

Jaco: An Offline Running Privacy-aware Voice Assistant

Daniel Bermuth,Alexander Poeppel,Wolfgang Reif

Task: 设计并实现一个名为Jaco的新型语音助手，具备离线运行、隐私保护、多语言支持和可扩展性等特点。

Motivation: 现有语音助手多为在线云服务，用户隐私保护不足，Jaco旨在解决这一问题。

Details

Method: 提出Jaco的架构设计，支持离线运行、技能扩展、隐私保护和多语言功能。 Result: Jaco在低资源设备上运行良好，功能与其他语音助手竞争，同时保护用户隐私。 Conclusion: Jaco结合并扩展了其他语音助手的优势，提供了一种隐私友好的离线解决方案。 Abstract: With the recent advance in speech technology, smart voice assistants have been improved and are now used by many people. But often these assistants are running online as a cloud service and are not always known for a good protection of users' privacy. This paper presents the architecture of a novel voice assistant, called Jaco, with the following features: (a) It can run completely offline, even on low resource devices like a RaspberryPi. (b) Through a skill concept it can be easily extended. (c) The architectural focus is on protecting users' privacy, but without restricting capabilities for developers. (d) It supports multiple languages. (e) It is competitive with other voice assistant solutions. In this respect the assistant combines and extends the advantages of other approaches.

Multimodal surface defect detection from wooden logs for sawing optimization

Bořek Reich,Matej Kunda,Fedor Zolotarev,Tuomas Eerola,Pavel Zemčík,Tomi Kauppi

Task: 提出一种基于多模态数据融合的木材表面节疤检测方法。

Motivation: 节疤是影响锯材质量的主要因素，但其检测在现有系统中往往因单一模态数据精度不足而受限。

Details

Method: 采用RGB和点云数据的多模态数据融合管道，结合后期融合模块提高检测精度，并提出一种基于表面节疤检测的锯切角度优化方法。 Result: 多模态数据融合方法比单一模态数据显著提高了节疤检测精度，锯切角度优化方法减少了不希望的边角节疤。 Conclusion: 多模态数据融合和锯切角度优化方法为木材质量检测和加工提供了高效且实用的解决方案。 Abstract: We propose a novel, good-quality, and less demanding method for detecting knots on the surface of wooden logs using multimodal data fusion. Knots are a primary factor affecting the quality of sawn timber, making their detection fundamental to any timber grading or cutting optimization system. While X-ray computed tomography provides accurate knot locations and internal structures, it is often too slow or expensive for practical use. An attractive alternative is to use fast and cost-effective log surface measurements, such as laser scanners or RGB cameras, to detect surface knots and estimate the internal structure of wood. However, due to the small size of knots and noise caused by factors, such as bark and other natural variations, detection accuracy often remains low when only one measurement modality is used. In this paper, we demonstrate that by using a data fusion pipeline consisting of separate streams for RGB and point cloud data, combined by a late fusion module, higher knot detection accuracy can be achieved compared to using either modality alone. We further propose a simple yet efficient sawing angle optimization method that utilizes surface knot detections and cross-correlation to minimize the amount of unwanted arris knots, demonstrating its benefits over randomized sawing angles.

Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models

Pin-Yu Chen,Han Shen,Payel Das,Tianyi Chen

Task: 研究大型语言模型（LLMs）微调中安全性与能力之间的权衡关系。

Motivation: 经验观察发现，任务特定数据集的微调会损害模型的安全性，需要理论框架来理解这种安全与能力的权衡。

Details

Method: 提出理论框架，分析数据相似性、上下文重叠和对齐损失景观对安全与能力权衡的影响。 Result: 理论结果揭示了LLM微调中安全与能力权衡的基本限制，并通过数值实验验证。 Conclusion: 为理解LLM微调中的安全与能力权衡提供了理论支持，揭示了其根本限制。 Abstract: Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.

Unsupervised Real-World Denoising: Sparsity is All You Need

Hamadi Chihaoui,Paolo Favaro

Task: 提出一种基于输入稀疏化的方法（MID），用于解决合成噪声与真实噪声分布不一致的问题，以提升去噪性能。

Motivation: 由于难以获取大量成对的噪声与干净图像数据集，现有方法利用非配对数据生成合成噪声-干净图像对，但合成与真实噪声分布不一致导致性能受限。

Details

Method: 提出MID方法，通过随机输入掩码稀疏化输入，训练去噪器同时去噪和修复稀疏输入，并迭代优化噪声采样器。 Result: 在真实噪声图像数据集上的实验表明，MID方法在无监督去噪任务中表现优异。 Conclusion: MID方法通过输入稀疏化和迭代优化噪声采样器，有效缩小了合成与真实噪声的分布差距，提升了去噪性能。 Abstract: Supervised training for real-world denoising presents challenges due to the difficulty of collecting large datasets of paired noisy and clean images. Recent methods have attempted to address this by utilizing unpaired datasets of clean and noisy images. Some approaches leverage such unpaired data to train denoisers in a supervised manner by generating synthetic clean-noisy pairs. However, these methods often fall short due to the distribution gap between synthetic and real noisy images. To mitigate this issue, we propose a solution based on input sparsification, specifically using random input masking. Our method, which we refer to as Mask, Inpaint and Denoise (MID), trains a denoiser to simultaneously denoise and inpaint synthetic clean-noisy pairs. On one hand, input sparsification reduces the gap between synthetic and real noisy images. On the other hand, an inpainter trained in a supervised manner can still accurately reconstruct sparse inputs by predicting missing clean pixels using the remaining unmasked pixels. Our approach begins with a synthetic Gaussian noise sampler and iteratively refines it using a noise dataset derived from the denoiser's predictions. The noise dataset is created by subtracting predicted pseudo-clean images from real noisy images at each iteration. The core intuition is that improving the denoiser results in a more accurate noise dataset and, consequently, a better noise sampler. We validate our method through extensive experiments on real-world noisy image datasets, demonstrating competitive performance compared to existing unsupervised denoising methods.

Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead

Viktor Schlegel,Anil A Bharath,Zilong Zhao,Kevin Yee

Task: 综述隐私保护合成数据的理论与方法，并评估其在专业领域中的实际表现。

Motivation: 解决高敏感领域中数据隔离问题，同时平衡数据效用与隐私保护。

Details

Method: 结合生成模型与差分隐私理论，综述现有方法，并通过实证分析评估四种领先方法在五个真实数据集上的表现。 Result: 在严格隐私约束（ε≤4）下，方法性能显著下降，揭示通用基准与专业领域数据间的差距。 Conclusion: 需建立更鲁棒的评估框架、标准化专业领域基准，并改进技术以满足隐私敏感领域的需求。 Abstract: Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($\epsilon \leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.

VALLR: Visual ASR Language Model for Lip Reading

Marshall Thomas,Edward Fish,Richard Bowden

Task: 提出一种新颖的两阶段、以音素为中心的视觉自动语音识别（V-ASR）框架，以解决现有方法在视觉相似音素上的高错误率问题。

Motivation: 由于缺乏听觉信息以及视觉上难以区分重叠的视素（viseme），传统的直接预测单词或字符的方法存在高错误率。

Details

Method: 采用两阶段框架：首先使用带有CTC头的视频Transformer从视觉输入预测音素序列，然后通过微调的大型语言模型（LLM）重建连贯的单词和句子。 Result: 在两个具有挑战性的数据集LRS2和LRS3上实现了最先进的性能，其中在LRS3上的单词错误率（WER）降至18.7，同时使用的标注数据量比次优方法少99.4%。 Conclusion: 该框架通过显式编码中间语言结构，显著提升了视觉自动语音识别的性能和数据效率。 Abstract: Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

Silin Gao,Sheryl Mathew,Li Mi,Sepideh Mamooler,Mengjie Zhao,Hiromi Wakaki,Yuki Mitsufuji,Syrielle Montariol,Antoine Bosselut

Task: 提出一个新的基准VinaBench，用于解决视觉叙事生成中忠实性和自一致性的挑战。

Motivation: 当前视觉叙事生成缺乏知识约束来规划故事，导致生成结果与输入文本不一致且图像间不连贯。

Details

Method: 通过标注视觉叙事样本中的常识和话语约束，提供系统化的学习支架，并基于此提出新的评估指标。 Result: 实验表明，使用VinaBench的知识约束能有效提升生成视觉叙事的忠实性和连贯性。 Conclusion: VinaBench为视觉叙事生成提供了有效的知识约束和评估方法，显著提升了生成质量。 Abstract: Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.

Diffusion Image Prior

Hamadi Chihaoui,Paolo Favaro

Task: 提出一种基于预训练扩散模型的零样本图像恢复方法（DIIP），无需明确的退化模型。

Motivation: 现实场景中的退化可能过于复杂，无法明确定义，需要一种无需退化模型先验的方法。

Details

Method: 利用预训练扩散模型作为更强的先验，通过优化过程重建干净图像，并通过早期停止避免过拟合。 Result: DIIP在多种退化盲图像恢复任务（如JPEG伪影去除、水滴去除、去噪和超分辨率）中取得最先进的结果。 Conclusion: DIIP提供了一种无需退化模型先验的通用图像恢复方法，表现优于现有方法。 Abstract: Zero-shot image restoration (IR) methods based on pretrained diffusion models have recently achieved significant success. These methods typically require at least a parametric form of the degradation model. However, in real-world scenarios, the degradation may be too complex to define explicitly. To handle this general case, we introduce the Diffusion Image Prior (DIIP). We take inspiration from the Deep Image Prior (DIP)[16], since it can be used to remove artifacts without the need for an explicit degradation model. However, in contrast to DIP, we find that pretrained diffusion models offer a much stronger prior, despite being trained without knowledge from corrupted data. We show that, the optimization process in DIIP first reconstructs a clean version of the image before eventually overfitting to the degraded input, but it does so for a broader range of degradations than DIP. In light of this result, we propose a blind image restoration (IR) method based on early stopping, which does not require prior knowledge of the degradation model. We validate DIIP on various degradation-blind IR tasks, including JPEG artifact removal, waterdrop removal, denoising and super-resolution with state-of-the-art results.

D4R -- Exploring and Querying Relational Graphs Using Natural Language and Large Language Models -- the Case of Historical Documents

Michel Boeglin,David Kahn,Josiane Mothe,Diego Ortiz,David Panzoli

Task: 设计一个数字平台（D4R），帮助非技术用户（如历史学家）通过高级图形工具探索文本文件，进行文本分析和知识提取。

Motivation: 弥合人工智能技术与历史研究之间的鸿沟，同时扩展其能力至其他领域。

Details

Method: 利用大型语言模型将自然语言问题转换为Cypher查询，从Neo4J数据库中检索数据，并提供用户友好的图形界面。 Result: 开发了一个直观的平台，支持用户导航和分析从非结构化文本中提取的复杂关系数据。 Conclusion: D4R成功为非技术用户提供了强大的文本分析工具，并展示了跨领域应用的潜力。 Abstract: D4R is a digital platform designed to assist non-technical users, particularly historians, in exploring textual documents through advanced graphical tools for text analysis and knowledge extraction. By leveraging a large language model, D4R translates natural language questions into Cypher queries, enabling the retrieval of data from a Neo4J database. A user-friendly graphical interface allows for intuitive interaction, enabling users to navigate and analyse complex relational data extracted from unstructured textual documents. Originally designed to bridge the gap between AI technologies and historical research, D4R's capabilities extend to various other domains. A demonstration video and a live software demo are available.

Dual-Task Learning for Dead Tree Detection and Segmentation with Hybrid Self-Attention U-Nets in Aerial Imagery

Anis Ur Rahman,Einari Heinaro,Mete Ahishali,Samuli Junttila

Task: 开发一种混合后处理框架，用于改进基于深度学习的树木分割，以精确识别枯立木。

Motivation: 密集的树冠结构、活植被与死植被的光谱重叠以及过分割问题限制了现有方法的可靠性，需要更精确的树木分割方法以支持森林健康评估和生态监测。

Details

Method: 结合分水岭算法和自适应滤波的混合后处理框架，优化边界划分并减少复杂森林环境中的误报。 Result: 在北方森林的高分辨率航空影像上测试，实例级分割精度提高了41.5%，位置误差减少了57%。 Conclusion: 该框架在密集植被区域表现优异，支持大规模生态监测和森林管理应用，如野火风险评估和碳储量估算。 Abstract: Mapping standing dead trees is critical for assessing forest health, monitoring biodiversity, and mitigating wildfire risks, for which aerial imagery has proven useful. However, dense canopy structures, spectral overlaps between living and dead vegetation, and over-segmentation errors limit the reliability of existing methods. This study introduces a hybrid postprocessing framework that refines deep learning-based tree segmentation by integrating watershed algorithms with adaptive filtering, enhancing boundary delineation, and reducing false positives in complex forest environments. Tested on high-resolution aerial imagery from boreal forests, the framework improved instance-level segmentation accuracy by 41.5% and reduced positional errors by 57%, demonstrating robust performance in densely vegetated regions. By balancing detection accuracy and over-segmentation artifacts, the method enabled the precise identification of individual dead trees, which is critical for ecological monitoring. The framework's computational efficiency supports scalable applications, such as wall-to-wall tree mortality mapping over large geographic regions using aerial or satellite imagery. These capabilities directly benefit wildfire risk assessment (identifying fuel accumulations), carbon stock estimation (tracking emissions from decaying biomass), and precision forestry (targeting salvage loggings). By bridging advanced remote sensing techniques with practical forest management needs, this work advances tools for large-scale ecological conservation and climate resilience planning.

ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer

Michael Brown,Sofia Martinez,Priya Singh

Task: 提出一种高效的文本驱动语音风格转换框架ReverBERT。

Motivation: 现有方法计算成本高，需要更高效的解决方案。

Details

Method: 结合状态空间模型（SSM）和离散傅里叶变换，并引入基于Transformer的SSM层。 Result: 在自然度、表现力和计算效率上显著优于基线方法。 Conclusion: ReverBERT为文本驱动语音风格转换提供了高效且高质量的解决方案。 Abstract: Text-driven speech style transfer aims to mold the intonation, pace, and timbre of a spoken utterance to match stylistic cues from text descriptions. While existing methods leverage large-scale neural architectures or pre-trained language models, the computational costs often remain high. In this paper, we present \emph{ReverBERT}, an efficient framework for text-driven speech style transfer that draws inspiration from a state space model (SSM) paradigm, loosely motivated by the image-based method of Wang and Liu~\cite{wang2024stylemamba}. Unlike image domain techniques, our method operates in the speech space and integrates a discrete Fourier transform of latent speech features to enable smooth and continuous style modulation. We also propose a novel \emph{Transformer-based SSM} layer for bridging textual style descriptors with acoustic attributes, dramatically reducing inference time while preserving high-quality speech characteristics. Extensive experiments on benchmark speech corpora demonstrate that \emph{ReverBERT} significantly outperforms baselines in terms of naturalness, expressiveness, and computational efficiency. We release our model and code publicly to foster further research in text-driven speech style transfer.

Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving

Lucas Nunes,Rodrigo Marcuzzi,Jens Behley,Cyrill Stachniss

Task: 提出一种无需依赖投影或解耦多分辨率模型的新方法，生成3D语义场景尺度数据。

Motivation: 解决3D数据标注的瓶颈问题，并缩小合成数据与真实数据之间的领域差距。

Details

Method: 利用扩散模型直接生成3D语义场景尺度数据，避免中间表示带来的误差。 Result: 生成的合成数据质量更高，且用于训练语义分割网络时能提升模型性能。 Conclusion: 该方法展示了生成场景尺度点云数据的潜力，可扩展现有数据集并减少标注工作量。 Abstract: Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still however a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at https://github.com/PRBonn/3DiSS.

AskSport: Web Application for Sports Question-Answering

Enzo B Onofre,Leonardo M P Moraes,Cristina D Aguiar

Task: 介绍AskSport，一个基于自然语言的体育问答网络应用。

Motivation: 为用户提供一种便捷的方式，通过自然语言提问获取体育相关的答案和信息。

Details

Method: 描述AskSport的功能和特性，包括其用例展示如何返回名称和数值信息。 Result: AskSport能够返回三个最相关的答案及相关文档，其实现已公开在HuggingFace上。 Conclusion: AskSport是一个有效的体育问答工具，能够满足用户的信息需求。 Abstract: This paper introduces AskSport, a question-answering web application about sports. It allows users to ask questions using natural language and retrieve the three most relevant answers, including related information and documents. The paper describes the characteristics and functionalities of the application, including use cases demonstrating its ability to return names and numerical values. AskSport and its implementation are available for public access on HuggingFace.

FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs

Xiaoqin Wang,Xusen Ma,Xianxu Hou,Meidan Ding,Yudong Li,Junliang Chen,Wenting Chen,Xiaoyang Peng,Linlin Shen

Task: 评估多模态大语言模型（MLLMs）在面部感知任务中的能力。

Motivation: 当前对MLLMs在面部感知任务中的评估研究不足，需要一种全面的评估方法。

Details

Method: 提出FaceBench数据集，包含分层多视图和多级属性，并开发Face-LLaVA模型作为基线。 Result: 现有MLLMs在细粒度面部属性理解上表现不佳，Face-LLaVA优于开源模型，接近商业模型。 Conclusion: FaceBench和Face-LLaVA为MLLMs的面部感知能力评估提供了有效工具。 Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at https://github.com/CVI-SZU/FaceBench.

Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing Systems

Ooha Lakkadi Reddy

Task: 研究印度河谷文字与藏彝走廊象形文字系统之间的历史联系。

Motivation: 探索印度河谷文字与藏彝走廊象形文字之间的视觉形态相似性，挑战传统关于孤立文字发展的观点。

Details

Method: 采用混合CNN-Transformer架构和人类学框架，通过15个独立训练模型的集成方法分析三种目标文字。 Result: 藏彝走廊文字与印度河谷文字的视觉相似性（61.7%-63.5%）显著高于与青铜时代原始楔形文字（10.2%-10.9%）或原始埃兰文字（7.6%-8.7%）的相似性。 Conclusion: 研究结果表明印度河谷文字与藏彝走廊文字之间存在显著相似性，支持古代南亚与东亚之间复杂的文化传播网络。 Abstract: This thesis employs a hybrid CNN-Transformer architecture, in conjunction with a detailed anthropological framework, to investigate potential historical connections between the visual morphology of the Indus Valley script and pictographic systems of the Tibetan-Yi Corridor. Through an ensemble methodology of three target scripts across 15 independently trained models, we demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold higher visual similarity to the Indus script (61.7%-63.5%) than to the Bronze Age Proto-Cuneiform (10.2%-10.9%) or Proto-Elamite (7.6%-8.7%) systems. Additionally and contrarily to our current understanding of the networks of the Indus Valley Civilization, the Indus script unexpectedly maps closer to Tibetan-Yi Corridor scripts, with a mean cosine similarity of 0.629, than to the aforementioned contemporaneous West Asian signaries, both of which recorded mean cosine similarities of 0.104 and 0.080 despite their close geographic proximity and evident trade relations. Across various dimensionality reduction practices and clustering methodologies, the Indus script consistently clusters closest to Tibetan-Yi Corridor scripts. Our computational results align with qualitative observations of specific pictorial parallels in numeral systems, gender markers, and key iconographic elements; this is further supported by archaeological evidence of sustained contact networks along the ancient Shu-Shendu road in tandem with the Indus Valley Civilization's decline, providing a plausible transmission pathway. While alternative explanations cannot be ruled out, the specificity and consistency of observed similarities challenge conventional narratives of isolated script development and suggest more complex ancient cultural transmission networks between South and East Asia than previously recognized.

Chirag Parikh,Deepti Rawat,Rakshitha R. T.,Tathagata Ghosh,Ravi Kiran Sarvadevabhatla

Task: 构建一个大规模、多样化的VideoQA数据集RoadSocial，用于通用道路事件理解。

Motivation: 解决现有数据集因地域偏见、视角偏见和专家驱动标注而受限的问题，捕捉全球道路事件的复杂性。

Details

Method: 利用文本和视频大语言模型（LLMs）的半自动标注框架，生成涵盖12个挑战性QA任务的问题-答案对。 Result: 构建了包含13.2K视频、674标签和260K高质量QA对的数据集，并评估了18种视频LLMs的性能。 Conclusion: RoadSocial提升了通用视频LLMs的道路事件理解能力，并推动了该领域的研究。 Abstract: We introduce RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalable semi-automatic annotation framework leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning 14M frames and 414K social comments, resulting in a dataset with 13.2K videos, 674 tags and 260K high-quality QA pairs. We evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose) on our road event understanding benchmark. We also demonstrate RoadSocial's utility in improving road event understanding capabilities of general-purpose Video LLMs.

Measuring and Analyzing Subjective Uncertainty in Scientific Communications

Jamshid Sourati,Grace Shao

Task: 研究科学论文中主观不确定性语言的使用及其对科学界的影响。

Motivation: 科学发现的不确定性通常通过统计指标报告，但语言中的主观不确定性可能影响科学对公众的影响，而其对科学界内部的影响尚未充分研究。

Details

Method: 测量和分析不同学科、发表年份和地理位置的论文中主观不确定性语言的使用，并研究其与文献计量指标（如作者数量、性别、领域中心性、引用次数等）的相关性。 Result: 主观不确定性语言的使用在不同领域、发表年份和地理位置之间存在显著差异，并与文献计量指标相关。 Conclusion: 研究揭示了科学交流中语言使用的模式，有助于识别和记录不同科学社区的语言规范。 Abstract: Uncertainty of scientific findings are typically reported through statistical metrics such as $p$-values, confidence intervals, etc. The magnitude of this objective uncertainty is reflected in the language used by the authors to report their findings primarily through expressions carrying uncertainty-inducing terms or phrases. This language uncertainty is a subjective concept and is highly dependent on the writing style of the authors. There is evidence that such subjective uncertainty influences the impact of science on public audience. In this work, we turned our focus to scientists themselves, and measured/analyzed the subjective uncertainty and its impact within scientific communities across different disciplines. We showed that the level of this type of uncertainty varies significantly across different fields, years of publication and geographical locations. We also studied the correlation between subjective uncertainty and several bibliographical metrics, such as number/gender of authors, centrality of the field's community, citation count, etc. The underlying patterns identified in this work are useful in identification and documentation of linguistic norms in scientific communication in different communities/societies.

Retinal Fundus Multi-Disease Image Classification using Hybrid CNN-Transformer-Ensemble Architectures

Deependra Singh,Saksham Agarwal,Subhankar Mishra

Task: 开发一种能够仅通过眼底图像准确预测视网膜疾病的综合诊断系统。

Motivation: 解决全球范围内视网膜疾病患者众多但医疗资源分布不均的问题，特别是在非城市地区。

Details

Method: 采用混合模型，结合深度卷积神经网络（CNN）、Transformer编码器和集成架构，对眼底图像进行20种疾病分类。 Result: C-Tran集成模型表现最佳，得分0.9166，超过基线0.9；IEViT模型也显示出高效的计算性能。 Conclusion: 研究为视网膜疾病诊断提供了高效、准确的解决方案，尤其适用于医疗资源匮乏地区。 Abstract: Our research is motivated by the urgent global issue of a large population affected by retinal diseases, which are evenly distributed but underserved by specialized medical expertise, particularly in non-urban areas. Our primary objective is to bridge this healthcare gap by developing a comprehensive diagnostic system capable of accurately predicting retinal diseases solely from fundus images. However, we faced significant challenges due to limited, diverse datasets and imbalanced class distributions. To overcome these issues, we have devised innovative strategies. Our research introduces novel approaches, utilizing hybrid models combining deeper Convolutional Neural Networks (CNNs), Transformer encoders, and ensemble architectures sequentially and in parallel to classify retinal fundus images into 20 disease labels. Our overarching goal is to assess these advanced models' potential in practical applications, with a strong focus on enhancing retinal disease diagnosis accuracy across a broader spectrum of conditions. Importantly, our efforts have surpassed baseline model results, with the C-Tran ensemble model emerging as the leader, achieving a remarkable model score of 0.9166, surpassing the baseline score of 0.9. Additionally, experiments with the IEViT model showcased equally promising outcomes with improved computational efficiency. We've also demonstrated the effectiveness of dynamic patch extraction and the integration of domain knowledge in computer vision tasks. In summary, our research strives to contribute significantly to retinal disease diagnosis, addressing the critical need for accessible healthcare solutions in underserved regions while aiming for comprehensive and accurate disease prediction.

VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

Alan Dao,Norapat Buppodom

Task: 提出一种利用视觉语言模型（VLM）从体素数据中提取高级语义信息（如物体身份、颜色和位置）的新方法。

Motivation: 体素网格提供了3D空间的结构化表示，但提取高级语义信息仍然具有挑战性。

Details

Method: 通过沿主轴（如Z轴）系统切片体素空间，将2D切片输入标准VLM的图像编码器，利用预训练的2D VLM实现高效的3D语义理解。 Result: 模型能够跨切片聚合信息，并将空间模式与语言组件提供的语义概念关联。 Conclusion: 切片策略有效利用了预训练的2D VLM，直接从体素表示中实现了高效的3D语义理解。 Abstract: Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract "voxel semantics"-object identity, color, and location-from voxel data. Critically, instead of employing complex 3D networks, our method processes the voxel space by systematically slicing it along a primary axis (e.g., the Z-axis, analogous to CT scan slices). These 2D slices are then formatted and sequentially fed into the image encoder of a standard VLM. The model learns to aggregate information across slices and correlate spatial patterns with semantic concepts provided by the language component. This slice-based strategy aims to leverage the power of pre-trained 2D VLMs for efficient 3D semantic understanding directly from voxel representations.

Fine-Grained Behavior and Lane Constraints Guided Trajectory Prediction Method

Wenyi Xiong,Jian Chen,Ziheng Qi

Task: 提出一种双流架构BLNet，用于自动驾驶系统中的轨迹预测，结合行为意图识别和车道约束建模。

Motivation: 现有算法无法对目标车辆的未来行为和车道约束提供细粒度和连续的描述，导致预测精度下降。

Details

Method: 采用双流并行注意力机制，生成行为状态查询和车道查询，并通过两阶段解码器进行轨迹生成和细化。 Result: 在nuScenes和Argoverse数据集上的实验表明，BLNet显著优于现有的直接回归和基于目标的算法。 Conclusion: BLNet通过结合行为意图和车道约束，显著提升了轨迹预测的精度。 Abstract: Trajectory prediction, as a critical component of autonomous driving systems, has attracted the attention of many researchers. Existing prediction algorithms focus on extracting more detailed scene features or selecting more reasonable trajectory destinations. However, in the face of dynamic and evolving future movements of the target vehicle, these algorithms cannot provide a fine-grained and continuous description of future behaviors and lane constraints, which degrades the prediction accuracy. To address this challenge, we present BLNet, a novel dualstream architecture that synergistically integrates behavioral intention recognition and lane constraint modeling through parallel attention mechanisms. The framework generates fine-grained behavior state queries (capturing spatial-temporal movement patterns) and lane queries (encoding lane topology constraints), supervised by two auxiliary losses, respectively. Subsequently, a two-stage decoder first produces trajectory proposals, then performs point-level refinement by jointly incorporating both the continuity of passed lanes and future motion features. Extensive experiments on two large datasets, nuScenes and Argoverse, show that our network exhibits significant performance gains over existing direct regression and goal-based algorithms.

Bias-Aware Agent: Enhancing Fairness in AI-Driven Knowledge Retrieval

Karanbir Singh,William Ngu

Task: 提出一种基于代理框架和偏见检测工具的新型方法，以实现偏见感知的知识检索。

Motivation: 尽管大型语言模型（LLMs）和AI代理在信息检索领域取得了显著进展，但它们仍存在偏见和公平性问题，这些问题根植于知识库和LLMs的训练中。

Details

Method: 利用代理框架和创新的偏见检测工具，识别并突出检索内容中的固有偏见。 Result: 通过增强用户的透明度和意识，该方法旨在促进更公平的信息系统，并推动负责任AI的发展。 Conclusion: 该研究为偏见感知的信息检索提供了一种新方法，有助于构建更公平和透明的AI系统。 Abstract: Advancements in retrieving accessible information have evolved faster in the last few years compared to the decades since the internet's creation. Search engines, like Google, have been the number one way to find relevant data. They have always relied on the user's abilities to find the best information in its billions of links and sources at everybody's fingertips. The advent of large language models (LLMs) has completely transformed the field of information retrieval. The LLMs excel not only at retrieving relevant knowledge but also at summarizing it effectively, making information more accessible and consumable for users. On top of it, the rise of AI Agents has introduced another aspect to information retrieval i.e. dynamic information retrieval which enables the integration of real-time data such as weather forecasts, and financial data with the knowledge base to curate context-aware knowledge. However, despite these advancements the agents remain susceptible to issues of bias and fairness, challenges deeply rooted within the knowledge base and training of LLMs. This study introduces a novel approach to bias-aware knowledge retrieval by leveraging agentic framework and the innovative use of bias detectors as tools to identify and highlight inherent biases in the retrieved content. By empowering users with transparency and awareness, this approach aims to foster more equitable information systems and promote the development of responsible AI.

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Shuming Liu,Chen Zhao,Tianqi Xu,Bernard Ghanem

Task: 通过研究帧选择策略，提升大型视频语言模型（VLMs）在长视频分析中的性能，而无需额外训练。

Motivation: 传统方法（如均匀帧采样）在长视频分析中因资源分配不均而效果受限，尤其是在噪声环境中表现不佳。

Details

Method: 提出BOLT方法，通过多源检索评估设置和基于查询-帧相似性的帧选择策略（如逆变换采样）优化模型性能。 Result: 逆变换采样显著提升性能，Video-MME基准准确率从53.8%提升至56.1%，MLVU基准从58.9%提升至63.4%。 Conclusion: BOLT方法通过优化帧选择策略，有效提升了大型VLMs在长视频分析中的性能。 Abstract: Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is constrained by limited context windows. Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content, diminishing their effectiveness in real-world scenarios. In this paper, we introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies. First, to enable a more realistic evaluation of VLMs in long-form video understanding, we propose a multi-source retrieval evaluation setting. Our findings reveal that uniform sampling performs poorly in noisy contexts, underscoring the importance of selecting the right frames. Second, we explore several frame selection strategies based on query-frame similarity and analyze their effectiveness at inference time. Our results show that inverse transform sampling yields the most significant performance improvement, increasing accuracy on the Video-MME benchmark from 53.8% to 56.1% and MLVU benchmark from 58.9% to 63.4%. Our code is available at https://github.com/sming256/BOLT.

Composable Prompting Workspaces for Creative Writing: Exploration and Iteration Using Dynamic Widgets

Rifat Mehreen Amin,Oliver Hans Kühle,Daniel Buschek,Andreas Butz

Task: 提出一种可组合的提示画布概念，用于文本探索和迭代。

Motivation: 当前生成式AI模型的图形用户界面缺乏对迭代探索的支持，未能将提示作为可操作的界面对象表示。

Details

Method: 设计动态小部件的可组合提示画布，用户通过系统建议、提示或手动生成小部件来捕捉任务相关方面。 Result: 在比较研究中，18名参与者使用该系统完成写作任务，结果显示用户对生成文本的控制更强，且系统在创造力支持指数上显著优于基线。 Conclusion: 该工作强调了支持用户驱动定制和（重新）结构的GUI的必要性，以提高提示的灵活性和效率。 Abstract: Generative AI models offer many possibilities for text creation and transformation. Current graphical user interfaces (GUIs) for prompting them lack support for iterative exploration, as they do not represent prompts as actionable interface objects. We propose the concept of a composable prompting canvas for text exploration and iteration using dynamic widgets. Users generate widgets through system suggestions, prompting, or manually to capture task-relevant facets that affect the generated text. In a comparative study with a baseline (conversational UI), 18 participants worked on two writing tasks, creating diverse prompting environments with custom widgets and spatial layouts. They reported having more control over the generated text and preferred our system over the baseline. Our design significantly outperformed the baseline on the Creativity Support Index, and participants felt the results were worth the effort. This work highlights the need for GUIs that support user-driven customization and (re-)structuring to increase both the flexibility and efficiency of prompting.

Hamadi Chihaoui,Paolo Favaro

Task: 提出一种零样本、无需训练的方法Invert2Restore，用于在完全盲或部分盲设置下解决图像恢复中的退化算子建模问题。

Motivation: 解决真实场景中图像恢复的两大挑战：图像先验的准确表征和退化算子的精确建模，尤其是针对退化模型未知或部分已知的情况。

Details

Method: 利用预训练扩散模型作为确定性映射，通过引导输入噪声向高概率密度区域移动来恢复退化图像。 Result: 实验验证表明，Invert2Restore在退化算子未知或部分已知的情况下，实现了高保真度的图像恢复，并在多种图像退化类型中表现优异。 Conclusion: Invert2Restore是一种高效且通用的方法，能够在退化算子信息有限的情况下实现高质量的图像恢复。 Abstract: Two of the main challenges of image restoration in real-world scenarios are the accurate characterization of an image prior and the precise modeling of the image degradation operator. Pre-trained diffusion models have been very successfully used as image priors in zero-shot image restoration methods. However, how to best handle the degradation operator is still an open problem. In real-world data, methods that rely on specific parametric assumptions about the degradation model often face limitations in their applicability. To address this, we introduce Invert2Restore, a zero-shot, training-free method that operates in both fully blind and partially blind settings -- requiring no prior knowledge of the degradation model or only partial knowledge of its parametric form without known parameters. Despite this, Invert2Restore achieves high-fidelity results and generalizes well across various types of image degradation. It leverages a pre-trained diffusion model as a deterministic mapping between normal samples and undistorted image samples. The key insight is that the input noise mapped by a diffusion model to a degraded image lies in a low-probability density region of the standard normal distribution. Thus, we can restore the degraded image by carefully guiding its input noise toward a higher-density region. We experimentally validate Invert2Restore across several image restoration tasks, demonstrating that it achieves state-of-the-art performance in scenarios where the degradation operator is either unknown or partially known.

debug-gym: A Text-Based Environment for Interactive Debugging

Xingdi Yuan,Morgane M Moss,Charbel El Feghali,Chinmay Singh,Darya Moldavskaya,Drew MacPhee,Lucas Caccia,Matheus Pereira,Minseon Kim,Alessandro Sordoni,Marc-Alexandre Côté

Task: 探索如何让大语言模型（LLMs）通过交互式探索代码库来获取任务相关信息。

Motivation: 当前LLMs在编码任务中通常假设所有相关信息可通过上下文或训练数据获取，但实际中可能需要交互式探索。

Details

Method: 提出一个轻量级文本环境debug-gym，配备Python调试器等工具，支持LLM代理的交互式调试。 Result: debug-gym环境支持LLM代理在编码和调试任务中交互式获取信息。 Conclusion: 该方法不仅适用于编码任务，还可推广到其他需要LLM代理信息搜寻行为的任务。 Abstract: Large Language Models (LLMs) are increasingly relied upon for coding tasks, yet in most scenarios it is assumed that all relevant information can be either accessed in context or matches their training data. We posit that LLMs can benefit from the ability to interactively explore a codebase to gather the information relevant to their task. To achieve this, we present a textual environment, namely debug-gym, for developing LLM-based agents in an interactive coding setting. Our environment is lightweight and provides a preset of useful tools, such as a Python debugger (pdb), designed to facilitate an LLM-based agent's interactive debugging. Beyond coding and debugging tasks, this approach can be generalized to other tasks that would benefit from information-seeking behavior by an LLM agent.

Shape Modeling of Longitudinal Medical Images: From Diffeomorphic Metric Mapping to Deep Learning

Edwin Tay,Nazli Tümer,Amir A. Zadpoor

Task: 综述生物组织纵向形状变化的建模方法。

Motivation: 生物组织的形状变化对医疗诊断、预后和治疗具有重要意义，但由于其非线性特性，建模具有挑战性。

Details

Method: 综述了多种方法，包括微分同胚度量映射和基于深度学习的方法（如自编码器、生成网络、循环神经网络等）。 Result: 总结了现有技术的协同组合，并指出了当前研究中的关键不足。 Conclusion: 强调了未来研究的潜在方向，以填补现有研究的空白。 Abstract: Living biological tissue is a complex system, constantly growing and changing in response to external and internal stimuli. These processes lead to remarkable and intricate changes in shape. Modeling and understanding both natural and pathological (or abnormal) changes in the shape of anatomical structures is highly relevant, with applications in diagnostic, prognostic, and therapeutic healthcare. Nevertheless, modeling the longitudinal shape change of biological tissue is a non-trivial task due to its inherent nonlinear nature. In this review, we highlight several existing methodologies and tools for modeling longitudinal shape change (i.e., spatiotemporal shape modeling). These methods range from diffeomorphic metric mapping to deep-learning based approaches (e.g., autoencoders, generative networks, recurrent neural networks, etc.). We discuss the synergistic combinations of existing technologies and potential directions for future research, underscoring key deficiencies in the current research landscape.

Model Assembly Learning with Heterogeneous Layer Weight Merging

Yi-Kai Zhang,Jin Wang,Xu-Xiang Zhong,De-Chuan Zhan,Han-Jia Ye

Task: 提出一种名为模型组装学习（MAL）的新范式，用于通过迭代整合来自不同模型的参数来增强基础模型的能力。

Motivation: 现有方法需要相同架构的模型合并，限制了灵活性；MAL旨在解决这一问题，支持异构架构和选择性参数合并。

Details

Method: 通过迭代整合来自开放模型库中不同模型的参数，允许跨层选择性合并，并系统研究异构参数合并的条件和设置。 Result: 建立了异构参数合并的关键法则，并提供了实施MAL的实用指南。 Conclusion: MAL为模型合并提供了一种灵活且高效的新方法，支持异构架构和跨层参数整合。 Abstract: Model merging acquires general capabilities without extra data or training by combining multiple models' parameters. Previous approaches achieve linear mode connectivity by aligning parameters into the same loss basin using permutation invariance. In this paper, we introduce Model Assembly Learning (MAL), a novel paradigm for model merging that iteratively integrates parameters from diverse models in an open-ended model zoo to enhance the base model's capabilities. Unlike previous works that require identical architectures, MAL allows the merging of heterogeneous architectures and selective parameters across layers. Specifically, the base model can incorporate parameters from different layers of multiple pre-trained models. We systematically investigate the conditions and fundamental settings of heterogeneous parameter merging, addressing all possible mismatches in layer widths between the base and target models. Furthermore, we establish key laws and provide practical guidelines for effectively implementing MAL.

ICG-MVSNet: Learning Intra-view and Cross-view Relationships for Guidance in Multi-View Stereo

Yuxi Hu,Jun Zhang,Zhe Zhang,Rafael Weilharter,Yuchen Rao,Kuangyi Chen,Runze Yuan,Friedrich Fraundorfer

Task: 提出ICG-MVSNet，通过显式整合单视图和跨视图关系来改进多视图立体视觉（MVS）中的深度估计。

Motivation: 当前基于学习的MVS框架忽略了特征和相关性中嵌入的几何信息，导致成本匹配能力较弱。

Details

Method: 开发了单视图特征融合模块和轻量级跨视图聚合模块，分别利用单图像内的特征坐标相关性和体积相关性中的上下文信息。 Result: 在DTU数据集和Tanks and Temples基准测试中表现优异，计算资源需求较低。 Conclusion: ICG-MVSNet通过整合几何信息，显著提升了深度估计的准确性和效率。 Abstract: Multi-view Stereo (MVS) aims to estimate depth and reconstruct 3D point clouds from a series of overlapping images. Recent learning-based MVS frameworks overlook the geometric information embedded in features and correlations, leading to weak cost matching. In this paper, we propose ICG-MVSNet, which explicitly integrates intra-view and cross-view relationships for depth estimation. Specifically, we develop an intra-view feature fusion module that leverages the feature coordinate correlations within a single image to enhance robust cost matching. Additionally, we introduce a lightweight cross-view aggregation module that efficiently utilizes the contextual information from volume correlations to guide regularization. Our method is evaluated on the DTU dataset and Tanks and Temples benchmark, consistently achieving competitive performance against state-of-the-art works, while requiring lower computational resources.

LLM-Gomoku: A Large Language Model-Based System for Strategic Gomoku with Self-Play and Reinforcement Learning

Hui Wang

Task: 开发一个基于大型语言模型（LLMs）的Gomoku AI系统，模拟人类学习下棋的过程。

Motivation: 尽管LLMs在自然语言处理中表现出色，但在Gomoku等游戏中用于战略规划和决策仍存在挑战。

Details

Method: 通过让模型“阅读棋盘”、“理解规则”、“选择策略”和“评估位置”，并结合自对弈和强化学习来提升能力。 Result: 该方法显著改善了落子位置的选择，解决了生成非法位置的问题，并通过并行位置评估减少了处理时间。 Conclusion: 经过大量自对弈训练，模型的Gomoku下棋能力显著提升。 Abstract: In recent years, large language models (LLMs) have shown significant advancements in natural language processing (NLP), with strong capa-bilities in generation, comprehension, and rea-soning. These models have found applications in education, intelligent decision-making, and gaming. However, effectively utilizing LLMs for strategic planning and decision-making in the game of Gomoku remains a challenge. This study aims to develop a Gomoku AI system based on LLMs, simulating the human learning process of playing chess. The system is de-signed to understand and apply Gomoku strat-egies and logic to make rational decisions. The research methods include enabling the model to "read the board," "understand the rules," "select strategies," and "evaluate positions," while en-hancing its abilities through self-play and rein-forcement learning. The results demonstrate that this approach significantly improves the se-lection of move positions, resolves the issue of generating illegal positions, and reduces pro-cess time through parallel position evaluation. After extensive self-play training, the model's Gomoku-playing capabilities have been notably enhanced.

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Achint Soni,Meet Soni,Sirisha Rambhatla

Task: 通过自然语言指令修改图像的特定区域，同时保持整体结构和背景保真度。

Motivation: 现有方法利用扩散模型生成的交叉注意力图来识别目标修改区域，但由于交叉注意力机制关注语义相关性，难以保持图像完整性，导致编辑伪影和失真。

Details

Method: 提出LOCATEdit，通过基于图的方法增强交叉注意力图，利用自注意力派生的补丁关系来保持图像区域的平滑、连贯注意力，确保修改仅限于指定区域并保留周围结构。 Result: 在PIE-Bench上显著优于现有基线方法，展示了其在各种编辑任务中的先进性能和有效性。 Conclusion: LOCATEdit通过改进注意力机制解决了现有方法的局限性，实现了更高质量的图像编辑。 Abstract: Text-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. \method consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks. Code can be found on https://github.com/LOCATEdit/LOCATEdit/

Learning to Represent Individual Differences for Choice Decision Making

Yan-Ying Chen,Yue Weng,Alexandre Filipowicz,Rumen Iliev,Francine Chen,Shabnam Hakimi,Yanxia Zhang,Matthew Lee,Kent Lyons,Charlene Wu

Task: 研究如何利用表示学习从行为实验数据中测量个体差异以提高人类决策预测的准确性。

Motivation: 人类决策受多种复杂因素影响且个体间差异显著，现有方法（如问卷和行为模型）通常局限于低维度且未针对特定预测任务优化。

Details

Method: 采用表示学习从结构化和非结构化数据中创建个体嵌入，以灵活捕捉个体差异。 Result: 使用表示学习的模型在决策预测上表现优于未使用表示学习的模型，甚至超越理论行为模型。 Conclusion: 表示学习是一种有效且灵活的工具，可用于捕捉个体差异并提升决策预测性能。 Abstract: Human decision making can be challenging to predict because decisions are affected by a number of complex factors. Adding to this complexity, decision-making processes can differ considerably between individuals, and methods aimed at predicting human decisions need to take individual differences into account. Behavioral science offers methods by which to measure individual differences (e.g., questionnaires, behavioral models), but these are often narrowed down to low dimensions and not tailored to specific prediction tasks. This paper investigates the use of representation learning to measure individual differences from behavioral experiment data. Representation learning offers a flexible approach to create individual embeddings from data that are both structured (e.g., demographic information) and unstructured (e.g., free text), where the flexibility provides more options for individual difference measures for personalization, e.g., free text responses may allow for open-ended questions that are less privacy-sensitive. In the current paper we use representation learning to characterize individual differences in human performance on an economic decision-making task. We demonstrate that models using representation learning to capture individual differences consistently improve decision predictions over models without representation learning, and even outperform well-known theory-based behavioral models used in these environments. Our results propose that representation learning offers a useful and flexible tool to capture individual differences.

uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images

Jonathan Lee,Bolivar Solarte,Chin-Hsuan Wu,Jin-Cheng Jhang,Fu-En Wang,Yi-Hsuan Tsai,Min Sun

Task: 提出uLayout，一个统一的模型，用于从透视和全景图像中估计房间布局几何。

Motivation: 传统解决方案需要针对每种图像类型设计不同的模型，而uLayout旨在通过统一两种图像类型的设计来简化流程。

Details

Method: 将两种图像类型统一到等距柱状投影中，并通过共享特征提取器和额外的1D卷积层处理不同输入域的视场差异。 Result: uLayout在多个真实数据集上表现出色，性能与当前最先进解决方案相当，并首次实现了两种图像类型的端到端统一模型。 Conclusion: uLayout通过简单而有效的方法，成功实现了透视和全景图像布局估计的统一，为相关领域提供了新的解决方案。 Abstract: We present uLayout, a unified model for estimating room layout geometries from both perspective and panoramic images, whereas traditional solutions require different model designs for each image type. The key idea of our solution is to unify both domains into the equirectangular projection, particularly, allocating perspective images into the most suitable latitude coordinate to effectively exploit both domains seamlessly. To address the Field-of-View (FoV) difference between the input domains, we design uLayout with a shared feature extractor with an extra 1D-Convolution layer to condition each domain input differently. This conditioning allows us to efficiently formulate a column-wise feature regression problem regardless of the FoV input. This simple yet effective approach achieves competitive performance with current state-of-the-art solutions and shows for the first time a single end-to-end model for both domains. Extensive experiments in the real-world datasets, LSUN, Matterport3D, PanoContext, and Stanford 2D-3D evidence the contribution of our approach. Code is available at https://github.com/JonathanLee112/uLayout.

Elementwise Layer Normalization

Felix Stollenwerk

Task: 提出一种替代层归一化的元素级变换方法（ELN）。

Motivation: Dynamic Tanh（DyT）虽然经验上有效，但缺乏理论支持。

Details

Method: 通过数学推导DyT，并证明需要明确的近似条件，进而提出ELN。 Result: ELN比DyT更准确地模拟层归一化。 Conclusion: ELN是一种理论支持且更准确的层归一化替代方法。 Abstract: A recent paper proposed Dynamic Tanh (DyT) as a drop-in replacement for Layer Normalization. Although the method is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we derive DyT mathematically and show that a well-defined approximation is needed to do so. By dropping said approximation, an alternative element-wise transformation is obtained, which we call Elementwise Layer Normalization (ELN). We demonstrate that ELN resembles Layer Normalization more accurately than DyT does.

Bearing fault diagnosis based on multi-scale spectral images and convolutional neural network

Tongchao Luo,Mingquan Qiu,Zhenyu Wu,Zebo Zhao,Dingyou Zhang

Task: 提出一种基于多尺度频谱特征图像和深度学习的轴承故障诊断新方法。

Motivation: 解决传统轴承故障诊断方法诊断精度低的问题。

Details

Method: 通过均值去除预处理振动信号，利用快速傅里叶变换（FFT）转换为多长度频谱，构建多尺度频谱图像（MSSI），并采用卷积神经网络（CNN）进行故障诊断。 Result: 实验结果表明，所提方法显著提高了故障诊断的准确性。 Conclusion: 该方法在轴承故障诊断中表现出高效性和优越性。 Abstract: To address the challenges of low diagnostic accuracy in traditional bearing fault diagnosis methods, this paper proposes a novel fault diagnosis approach based on multi-scale spectrum feature images and deep learning. Firstly, the vibration signal are preprocessed through mean removal and then converted to multi-length spectrum with fast Fourier transforms (FFT). Secondly, a novel feature called multi-scale spectral image (MSSI) is constructed by multi-length spectrum paving scheme. Finally, a deep learning framework, convolutional neural network (CNN), is formulated to diagnose the bearing faults. Two experimental cases are utilized to verify the effectiveness of the proposed method. Experimental results demonstrate that the proposed method significantly improves the accuracy of fault diagnosis.

GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics

Arsham Gholamzadeh Khoee,Shuai Wang,Yinan Yu,Robert Feldt,Dhasarathy Parthasarathy

Task: 开发一个基于LLM的工具GateLens，用于分析汽车领域的表格数据，以支持软件发布决策。

Motivation: 传统的手动分析方法在安全关键领域（如汽车系统）中效率低下且成本高昂，而现有LLM在处理结构化数据和复杂查询时存在局限性。

Details

Method: GateLens将自然语言查询转换为关系代数表达式，并生成优化的Python代码。 Result: GateLens在基准数据集上表现优于基线系统，F1分数更高，且能更稳健地处理复杂和模糊查询。工业评估显示分析时间减少80%以上。 Conclusion: GateLens通过自动化测试结果分析，实现了更快、更可靠的发布决策，提升了汽车系统软件的扩展性和可靠性。 Abstract: Ensuring the reliability and effectiveness of software release decisions is critical, particularly in safety-critical domains like automotive systems. Precise analysis of release validation data, often presented in tabular form, plays a pivotal role in this process. However, traditional methods that rely on manual analysis of extensive test datasets and validation metrics are prone to delays and high costs. Large Language Models (LLMs) offer a promising alternative but face challenges in analytical reasoning, contextual understanding, handling out-of-scope queries, and processing structured test data consistently; limitations that hinder their direct application in safety-critical scenarios. This paper introduces GateLens, an LLM-based tool for analyzing tabular data in the automotive domain. GateLens translates natural language queries into Relational Algebra (RA) expressions and then generates optimized Python code. It outperforms the baseline system on benchmarking datasets, achieving higher F1 scores and handling complex and ambiguous queries with greater robustness. Ablation studies confirm the critical role of the RA module, with performance dropping sharply when omitted. Industrial evaluations reveal that GateLens reduces analysis time by over 80% while maintaining high accuracy and reliability. As demonstrated by presented results, GateLens achieved high performance without relying on few-shot examples, showcasing strong generalization across various query types from diverse company roles. Insights from deploying GateLens with a partner automotive company offer practical guidance for integrating AI into critical workflows such as release validation. Results show that by automating test result analysis, GateLens enables faster, more informed, and dependable release decisions, and can thus advance software scalability and reliability in automotive systems.

AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion

Liuyue Xie,Jiancong Guo,Ozan Cakmakci,Andre Araujo,Laszlo A. Jeni,Zhiheng Jia

Task: 提出一种新颖的框架，通过联合建模相机内参和外参，解决复杂光学畸变下的相机校准问题。

Motivation: 现有方法依赖预校正图像或校准图案，限制了其适用性和灵活性。

Details

Method: 提出AlignDiff框架，基于几何先验的扩散模型，结合边缘感知注意力机制，同时估计相机畸变和场景几何。 Result: 实验表明，该方法显著减少了估计射线束的角度误差（约8.2度），并在真实数据集上优于现有方法。 Conclusion: AlignDiff通过几何特征建模和大型镜头数据库，提高了相机校准的准确性和泛化能力。 Abstract: Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-the-wild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by ~8.2 degrees and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.

Ziyu Guo,Young Yoon Lee,Joseph Liu,Yizhak Ben-Shabat,Victor Zordan,Mubbasir Kapadia

Task: 提出一种新颖的Stylized Motion Latent Diffusion模型（StyleMotif），能够基于多模态输入生成同时包含内容和风格的运动。

Motivation: 现有方法要么专注于生成多样化的运动内容，要么专注于从序列中迁移风格，而StyleMotif旨在无缝合成广泛内容范围内的运动，并融入多模态输入的风格线索。

Details

Method: 引入风格-内容交叉融合机制，并将风格编码器与预训练的多模态模型对齐，确保生成的运动准确捕捉参考风格并保持真实感。 Result: 实验表明，StyleMotif在风格化运动生成上超越现有方法，并展现出多模态运动风格化的新兴能力。 Conclusion: StyleMotif能够实现更细腻的运动合成，代码和预训练模型将在论文接受后发布。 Abstract: We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance. Project Page: https://stylemotif.github.io

FusionSegReID: Advancing Person Re-Identification with Multimodal Retrieval and Precise Segmentation

Jincheng Yan,Yun Wang,Xiaoyan Luo,Yu-Wing Tai

Task: 提出一种多模态模型FusionSegReID，结合图像和文本输入以提升行人重识别（ReID）性能。

Motivation: 传统单模态ReID方法在遮挡、光照变化和姿态变化等复杂场景中表现受限，多模态融合的研究尚未充分探索。

Details

Method: 通过结合图像和文本模态的互补优势，设计FusionSegReID模型，提升匹配准确性和鲁棒性。 Result: 实验显示在Top-1准确率和mAP上显著提升，且在遮挡和低质量图像等复杂场景中表现更优。 Conclusion: FusionSegReID优于传统单模态模型，为实际ReID任务提供了更鲁棒和灵活的解决方案。 Abstract: Person re-identification (ReID) plays a critical role in applications like security surveillance and criminal investigations by matching individuals across large image galleries captured by non-overlapping cameras. Traditional ReID methods rely on unimodal inputs, typically images, but face limitations due to challenges like occlusions, lighting changes, and pose variations. While advancements in image-based and text-based ReID systems have been made, the integration of both modalities has remained under-explored. This paper presents FusionSegReID, a multimodal model that combines both image and text inputs for enhanced ReID performance. By leveraging the complementary strengths of these modalities, our model improves matching accuracy and robustness, particularly in complex, real-world scenarios where one modality may struggle. Our experiments show significant improvements in Top-1 accuracy and mean Average Precision (mAP) for ReID, as well as better segmentation results in challenging scenarios like occlusion and low-quality images. Ablation studies further confirm that multimodal fusion and segmentation modules contribute to enhanced re-identification and mask accuracy. The results show that FusionSegReID outperforms traditional unimodal models, offering a more robust and flexible solution for real-world person ReID tasks.

Audio-driven Gesture Generation via Deviation Feature in the Latent Space

Jiahui Chen,Yang Huan,Runhua Shi,Chanfan Ding,Xiaoqi Mo,Siyu Xiong,Yinong He

Task: 提出一种弱监督框架，用于生成与语音同步的手势和嘴部动作视频。

Motivation: 手势在增强语音交流中至关重要，但现有方法多关注点级运动或完全监督的数据驱动方法，缺乏对像素级运动偏差的关注。

Details

Method: 采用扩散模型整合潜在运动特征，通过弱监督学习潜在表示偏差，实现更精确的手势和嘴部动作生成。 Result: 实验表明，该方法显著提升了视频质量，优于当前最先进技术。 Conclusion: 弱监督学习结合潜在空间偏差，能够有效生成逼真的语音同步手势视频。 Abstract: Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.

The MVTec AD 2 Dataset: Advanced Scenarios for Unsupervised Anomaly Detection

Lars Heckler-Kram,Jan-Hendrik Neudeck,Ulla Scheler,Rebecca König,Carsten Steger

Task: 提出并评估MVTec AD 2数据集，用于解决现有异常检测基准性能饱和问题。

Motivation: 现有异常检测基准（如MVTec AD和VisA）的性能已趋于饱和，缺乏区分能力，阻碍了模型比较和领域进展。

Details

Method: 构建包含8000多张高分辨率图像的MVTec AD 2数据集，涵盖8种工业检测场景，包括透明和重叠物体、暗场和背光照明等挑战性用例。 Result: 现有最先进方法的平均AU-PRO性能仍低于60%，数据集还提供了光照条件变化的测试场景。 Conclusion: MVTec AD 2为异常检测提供了更具挑战性和多样性的基准，促进了该领域的进一步发展。 Abstract: In recent years, performance on existing anomaly detection benchmarks like MVTec AD and VisA has started to saturate in terms of segmentation AU-PRO, with state-of-the-art models often competing in the range of less than one percentage point. This lack of discriminatory power prevents a meaningful comparison of models and thus hinders progress of the field, especially when considering the inherent stochastic nature of machine learning results. We present MVTec AD 2, a collection of eight anomaly detection scenarios with more than 8000 high-resolution images. It comprises challenging and highly relevant industrial inspection use cases that have not been considered in previous datasets, including transparent and overlapping objects, dark-field and back light illumination, objects with high variance in the normal data, and extremely small defects. We provide comprehensive evaluations of state-of-the-art methods and show that their performance remains below 60% average AU-PRO. Additionally, our dataset provides test scenarios with lighting condition changes to assess the robustness of methods under real-world distribution shifts. We host a publicly accessible evaluation server that holds the pixel-precise ground truth of the test set (https://benchmark.mvtec.com/). All image data is available at https://www.mvtec.com/company/research/datasets/mvtec-ad-2.

InteractionMap: Improving Online Vectorized HDMap Construction with Interaction

Kuang Wu,Chuan Yang,Zhanbin Li

Task: 改进基于DETR框架的高清地图矢量化方法，通过充分利用时间和空间上的局部到全局信息交互。

Motivation: 当前的高清地图矢量化方法主要基于DETR框架，但未充分利用地图元素的形状先验和时空信息交互。

Details

Method: 提出InteractionMap，包括显式位置关系先验、关键帧层次时序融合模块以及几何感知分类损失和匹配成本。 Result: 在nuScenes和Argoverse2基准测试中达到最先进性能。 Conclusion: InteractionMap通过局部到全局信息交互显著提升了高清地图矢量化的性能。 Abstract: Vectorized high-definition (HD) maps are essential for an autonomous driving system. Recently, state-of-the-art map vectorization methods are mainly based on DETR-like framework to generate HD maps in an end-to-end manner. In this paper, we propose InteractionMap, which improves previous map vectorization methods by fully leveraging local-to-global information interaction in both time and space. Firstly, we explore enhancing DETR-like detectors by explicit position relation prior from point-level to instance-level, since map elements contain strong shape priors. Secondly, we propose a key-frame-based hierarchical temporal fusion module, which interacts temporal information from local to global. Lastly, the separate classification branch and regression branch lead to the problem of misalignment in the output distribution. We interact semantic information with geometric information by introducing a novel geometric-aware classification loss in optimization and a geometric-aware matching cost in label assignment. InteractionMap achieves state-of-the-art performance on both nuScenes and Argoverse2 benchmarks.

CMED: A Child Micro-Expression Dataset

Nikin~Matharaarachchi,Muhammad~Fermi Pasha,Sonya~Coleman,Kah PengWong

Task: 构建并分析首个儿童微表情数据集，探索儿童与成人微表情的关键特征差异，并建立儿童微表情自动检测与识别的基线。

Motivation: 现有微表情检测研究集中于成人，而儿童微表情特征与成人不同，且缺乏相关数据集，导致研究不足。

Details

Method: 通过视频会议软件采集自然状态下的儿童微表情视频，构建首个儿童微表情数据集，并采用手工创建和学习方法建立自动检测与识别基线。 Result: 成功构建首个儿童微表情数据集，并识别出儿童与成人微表情的关键特征差异，建立了自动检测与识别的基线。 Conclusion: 该研究填补了儿童微表情研究的空白，为心理治疗提供了重要工具，并为未来研究奠定了基础。 Abstract: Micro-expressions are short bursts of emotion that are difficult to hide. Their detection in children is an important cue to assist psychotherapists in conducting better therapy. However, existing research on the detection of micro-expressions has focused on adults, whose expressions differ in their characteristics from those of children. The lack of research is a direct consequence of the lack of a child-based micro-expressions dataset as it is much more challenging to capture children's facial expressions due to the lack of predictability and controllability. This study compiles a dataset of spontaneous child micro-expression videos, the first of its kind, to the best of the authors knowledge. The dataset is captured in the wild using video conferencing software. This dataset enables us to then explore key features and differences between adult and child micro-expressions. This study also establishes a baseline for the automated spotting and recognition of micro-expressions in children using three approaches comprising of hand-created and learning-based approaches.

RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond

Daniel Bermuth,Alexander Poeppel,Wolfgang Reif

Task: 提出一种改进多视角多人姿态估计的新算法，专注于快速三角测量速度和良好的泛化能力。

Motivation: 多视角成像与姿态估计的集成为计算机视觉应用带来了显著进展，为理解人类运动和互动提供了新可能性。

Details

Method: 扩展至全身姿态估计，捕捉从面部表情到手指动作的细节，适用于多个个体和视角。 Result: 在不同未见过的数据集和配置中表现出色，展示了方法的适应性。 Conclusion: 所有工作公开可用，以支持该领域的进一步进展。 Abstract: The integration of multi-view imaging and pose estimation represents a significant advance in computer vision applications, offering new possibilities for understanding human movement and interactions. This work presents a new algorithm that improves multi-view multi-person pose estimation, focusing on fast triangulation speeds and good generalization capabilities. The approach extends to whole-body pose estimation, capturing details from facial expressions to finger movements across multiple individuals and viewpoints. Adaptability to different settings is demonstrated through strong performance across unseen datasets and configurations. To support further progress in this field, all of this work is publicly accessible.

AMA-SAM: Adversarial Multi-Domain Alignment of Segment Anything Model for High-Fidelity Histology Nuclei Segmentation

Jiahe Qian,Yaoyu Fang,Jinkui Hao,Bo Zhou

Task: 通过扩展Segment Anything Model (SAM)来解决多数据集学习中的细胞核分割问题。

Motivation: 现有细胞核分割方法仅考虑单一数据集，忽略了利用辅助域数据以减少过拟合和提升性能。多数据集可能加剧域偏移导致的性能下降。

Details

Method: 提出AMA-SAM模型，包含条件梯度反转层(CGRL)用于多域对齐和高分辨率解码器(HR-Decoder)以提升输出分辨率。 Result: 在多个公开数据集上验证，性能显著优于现有方法。 Conclusion: AMA-SAM是首个将SAM应用于多数据集学习的细胞核分割方法，有效解决了域偏移和低分辨率问题。 Abstract: Accurate segmentation of cell nuclei in histopathology images is essential for numerous biomedical research and clinical applications. However, existing cell nucleus segmentation methods only consider a single dataset (i.e., primary domain), while neglecting to leverage supplementary data from diverse sources (i.e., auxiliary domains) to reduce overfitting and enhance the performance. Although incorporating multiple datasets could alleviate overfitting, it often exacerbates performance drops caused by domain shifts. In this work, we introduce Adversarial Multi-domain Alignment of Segment Anything Model (AMA-SAM) that extends the Segment Anything Model (SAM) to overcome these obstacles through two key innovations. First, we propose a Conditional Gradient Reversal Layer (CGRL), a multi-domain alignment module that harmonizes features from diverse domains to promote domain-invariant representation learning while preserving crucial discriminative features for the primary dataset. Second, we address SAM's inherent low-resolution output by designing a High-Resolution Decoder (HR-Decoder), which directly produces fine-grained segmentation maps in order to capture intricate nuclei boundaries in high-resolution histology images. To the best of our knowledge, this is the first attempt to adapt SAM for multi-dataset learning with application to histology nuclei segmentation. We validate our method on several publicly available datasets, demonstrating consistent and significant improvements over state-of-the-art approaches.

Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance

Jaywon Koo,Jefferson Hernandez,Moayed Haji-Ali,Ziyan Yang,Vicente Ordonez

Task: 提出一种基于条件Fréchet距离的度量标准cFreD，用于评估文本到图像合成模型的性能。

Motivation: 现有度量标准（如IS、FID和CLIPScore）仅评估图像质量或图像-文本对齐性，无法同时兼顾两者，导致与人类偏好相关性较低。

Details

Method: 通过条件Fréchet距离（cFreD）同时衡量视觉保真度和文本提示对齐性。 Result: cFreD在多个文本到图像模型和多样化提示数据集上表现出与人类判断更高的相关性。 Conclusion: cFreD是一种稳健且面向未来的度量标准，可用于系统评估文本到图像模型，并标准化这一快速发展的领域的基准测试。 Abstract: Evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences. We propose cFreD, a metric based on the notion of Conditional Fr\'echet Distance that explicitly accounts for both visual fidelity and text-prompt alignment. Existing metrics such as Inception Score (IS), Fr\'echet Inception Distance (FID) and CLIPScore assess either image quality or image-text alignment but not both which limits their correlation with human preferences. Scoring models explicitly trained to replicate human preferences require constant updates and may not generalize to novel generation techniques or out-of-domain inputs. Through extensive experiments across multiple recently proposed text-to-image models and diverse prompt datasets, we demonstrate that cFreD exhibits a higher correlation with human judgments compared to statistical metrics, including metrics trained with human preferences. Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models, standardizing benchmarking in this rapidly evolving field. We release our evaluation toolkit and benchmark in the appendix.

OccRobNet : Occlusion Robust Network for Accurate 3D Interacting Hand-Object Pose Estimation

Mallika Garg,Debashis Ghosh,Pyari Mohan Pradhan

Task: 提出一种遮挡鲁棒且准确的方法，从输入RGB图像中估计3D手-物体姿态。

Motivation: 遮挡是3D手部姿态估计中的挑战性问题，尤其是在手与物体交互或双手交互时，过去的研究未充分关注遮挡区域，但这些区域包含重要信息。

Details

Method: 首先使用基于CNN的模型定位手部关节，然后通过提取上下文信息进行细化，再利用自注意力变换器识别特定关节及手部身份，最后通过交叉注意力机制估计姿态。 Result: 在InterHand2.6M、HO3D和H2O3D数据集上取得了最先进的结果。 Conclusion: 通过识别遮挡区域的关节，该方法对遮挡具有鲁棒性，实现了高精度的3D手部姿态估计。 Abstract: Occlusion is one of the challenging issues when estimating 3D hand pose. This problem becomes more prominent when hand interacts with an object or two hands are involved. In the past works, much attention has not been given to these occluded regions. But these regions contain important and beneficial information that is vital for 3D hand pose estimation. Thus, in this paper, we propose an occlusion robust and accurate method for the estimation of 3D hand-object pose from the input RGB image. Our method includes first localising the hand joints using a CNN based model and then refining them by extracting contextual information. The self attention transformer then identifies the specific joints along with the hand identity. This helps the model to identify the hand belongingness of a particular joint which helps to detect the joint even in the occluded region. Further, these joints with hand identity are then used to estimate the pose using cross attention mechanism. Thus, by identifying the joints in the occluded region, the obtained network becomes robust to occlusion. Hence, this network achieves state-of-the-art results when evaluated on the InterHand2.6M, HO3D and H$_2$O3D datasets.

SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling

Xianglong He,Zi-Xin Zou,Chia-Hao Chen,Yuan-Chen Guo,Ding Liang,Chun Yuan,Wanli Ouyang,Yan-Pei Cao,Yangguang Li

Task: 提出一种名为SparseFlex的稀疏结构等值面表示方法，用于直接从渲染损失中实现高分辨率（最高1024^3）的可微分网格重建。

Motivation: 现有隐式场方法需要昂贵且细节损失的水密转换，而其他方法难以处理高分辨率问题。

Details

Method: 结合Flexicubes的精度与稀疏体素结构，引入视锥感知的分段体素训练策略，仅激活渲染相关体素，显著降低内存消耗。 Result: 实验显示重建精度达到最先进水平，Chamfer Distance减少约82%，F-score提升约88%，并能生成高分辨率、任意拓扑的3D形状。 Conclusion: SparseFlex通过支持高分辨率可微分网格重建与生成，显著推进了3D形状表示与建模的技术水平。 Abstract: Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to $1024^3$ directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ~82% reduction in Chamfer Distance and a ~88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation and modeling.

3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models

Yuhan Zhang,Mengchen Zhang,Tong Wu,Tengfei Wang,Gordon Wetzstein,Dahua Lin,Ziwei Liu

Task: 开发一个自动评估系统（3DGen-Score和3DGen-Eval）以统一评估文本到3D和图像到3D生成的质量。

Motivation: 3D生成领域快速发展，但自动评估方法未能与人类感知保持一致，缺乏全面的偏好数据集。

Details

Method: 构建3DGen-Arena平台收集人类偏好数据（3DGen-Bench），并基于CLIP和MLLM训练评分模型和自动评估器。 Result: 评分模型能有效预测人类偏好，与人类排名相关性优于现有指标。 Conclusion: 3DGen-Bench数据集和自动评估系统将促进3D生成领域的公平评估，推动生成模型及其下游应用的发展。 Abstract: 3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a comprehensive preference dataset over generative models. To mitigate this absence, we develop 3DGen-Arena, an integrated platform in a battle manner. Then, we carefully design diverse text and image prompts and leverage the arena platform to gather human preferences from both public users and expert annotators, resulting in a large-scale multi-dimension human preference dataset 3DGen-Bench. Using this dataset, we further train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval. These two models innovatively unify the quality evaluation of text-to-3D and image-to-3D generation, and jointly form our automated evaluation system with their respective strengths. Extensive experiments demonstrate the efficacy of our scoring model in predicting human preferences, exhibiting a superior correlation with human ranks compared to existing metrics. We believe that our 3DGen-Bench dataset and automated evaluation system will foster a more equitable evaluation in the field of 3D generation, further promoting the development of 3D generative models and their downstream applications.

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Aniket Didolkar,Andrii Zadaianchuk,Rabiul Awal,Maximilian Seitzer,Efstratios Gavves,Aishwarya Agrawal

Task: 提出一种用户可控制的对象中心表示学习方法（CTRL-O），通过语言描述指导对象表示。

Motivation: 现有对象中心模型缺乏用户可控性，无法根据用户输入指导对象表示，限制了其应用范围。

Details

Method: 通过语言描述条件化对象表示（slots），实现对象与语言的绑定，无需掩码监督。 Result: CTRL-O在复杂真实场景中实现了目标对象与语言的绑定，并在文本到图像生成和视觉问答任务中表现优异。 Conclusion: CTRL-O通过引入用户可控性，扩展了对象中心模型的应用能力，为下游任务提供了新可能。 Abstract: Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

Shitian Zhao,Qilong Wu,Xinyue Li,Bo Zhang,Ming Li,Qi Qin,Dongyang Liu,Kaipeng Zhang,Hongsheng Li,Yu Qiao,Peng Gao,Bin Fu,Zhen Li

Task: 开发LeX-Art套件，提升文本到图像合成的表达力和文本渲染保真度。

Motivation: 解决文本表达与图像渲染之间的差距，提供高质量的数据和模型。

Details

Method: 采用数据为中心的方法，构建高质量数据集LeX-10K，开发LeX-Enhancer模型，并训练LeX-FLUX和LeX-Lumina模型。 Result: LeX-Lumina在CreateBench上PNED增益达79.81%，LeX-FLUX在颜色、位置和字体准确性上优于基线。 Conclusion: LeX-Art套件显著提升了文本到图像合成的性能，代码、模型和数据集已公开。 Abstract: We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.

Reconstructing Humans with a Biomechanically Accurate Skeleton

Yan Xia,Xiaowei Zhou,Etienne Vouga,Qixing Huang,Georgios Pavlakos

Task: 从单张图像重建具有生物力学准确性的3D人体模型。

Motivation: 解决现有方法在极端3D姿态和视角下性能不足以及关节角度限制被违反的问题。

Details

Method: 训练一个基于Transformer的模型，通过伪标签生成和迭代优化来估计生物力学准确的骨架模型参数。 Result: 在标准基准测试中表现优异，尤其在极端姿态和视角下显著优于现有方法，同时生成更自然的关节旋转。 Conclusion: 提出的方法在3D人体网格重建中实现了更高的生物力学准确性和性能，代码和模型已开源。 Abstract: In this paper, we introduce a method for reconstructing 3D humans from a single image using a biomechanically accurate skeleton model. To achieve this, we train a transformer that takes an image as input and estimates the parameters of the model. Due to the lack of training data for this task, we build a pipeline to produce pseudo ground truth model parameters for single images and implement a training procedure that iteratively refines these pseudo labels. Compared to state-of-the-art methods for 3D human mesh recovery, our model achieves competitive performance on standard benchmarks, while it significantly outperforms them in settings with extreme 3D poses and viewpoints. Additionally, we show that previous reconstruction methods frequently violate joint angle limits, leading to unnatural rotations. In contrast, our approach leverages the biomechanically plausible degrees of freedom making more realistic joint rotation estimates. We validate our approach across multiple human pose estimation benchmarks. We make the code, models and data available at: https://isshikihugh.github.io/HSMR/

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng,Ziqi Huang,Hongbo Liu,Kai Zou,Yinan He,Fan Zhang,Yuanhan Zhang,Jingwen He,Wei-Shi Zheng,Yu Qiao,Ziwei Liu

Task: 开发VBench-2.0，一个用于评估视频生成模型内在忠实性的新一代基准。

Motivation: 现有评估标准（如VBench）主要关注视频的视觉说服力和时间一致性，而忽略了其是否符合现实世界原则（如物理定律、常识推理等），因此需要更全面的评估工具。

Details

Method: VBench-2.0从五个关键维度（人类忠实性、可控性、创造力、物理性和常识性）评估视频生成模型，结合通用模型（如VLM和LLM）和专用方法（如异常检测）进行多维度分析。 Result: VBench-2.0通过精细的能力划分和广泛的人工标注，提供了一个更全面的评估框架，旨在推动视频生成模型向内在忠实性发展。 Conclusion: VBench-2.0为下一代视频生成模型设定了新的评估标准，强调内在忠实性，对AI辅助电影制作和模拟世界建模等应用具有重要意义。 Abstract: Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real "world models" through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored for individual dimensions, our evaluation framework integrates generalists such as state-of-the-art VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive annotations to ensure alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.

Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck

Adrian Bulat,Yassine Ouali,Georgios Tzimiropoulos

Task: 压缩大型视觉语言模型（LVLM）的视觉标记，生成适用于生成和判别任务、近乎无损且存储高效的表示。

Motivation: 解决视觉信息压缩在生成和判别任务中的通用性问题，同时保持高效性和无损性。

Details

Method: 提出Fwd2Bot方法，采用“双前向传递”训练策略，结合自回归损失和对比损失，并使用阶段特定适配器增强训练。 Result: Fwd2Bot实现了高信息密度的压缩表示，生成任务压缩率提高2倍，判别任务在图像检索和组合性上达到新SOTA。 Conclusion: Fwd2Bot是一种高效的视觉信息压缩方法，适用于多种任务，性能优越。 Abstract: In this work, we aim to compress the vision tokens of a Large Vision Language Model (LVLM) into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) is storage-efficient. We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner. At the core of Fwd2bot there exists a "double-forward pass" training strategy, whereby, during the first forward pass, the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of summary tokens. Then, using the same LLM, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training signal is provided by two losses: an autoregressive one applied after the second pass that provides a direct optimization objective for compression, and a contrastive loss, applied after the first pass, that further boosts the representation strength, especially for discriminative tasks. The training is further enhanced by stage-specific adapters. We accompany the proposed method by an in-depth ablation study. Overall, Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks. For generative tasks, we offer a 2x higher compression rate without compromising the generative capabilities, setting a new state-of-the-art result. For discriminative tasks, we set a new state-of-the-art on image retrieval and compositionality.

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Qi Qin,Le Zhuo,Yi Xin,Ruoyi Du,Zhen Li,Bin Fu,Yiting Lu,Jiakang Yuan,Xinyue Li,Dongyang Liu,Xiangyang Zhu,Manyuan Zhang,Will Beddow,Erwann Millon,Victor Perez,Wenhai Wang,Conghui He,Bo Zhang,Xiaohong Liu,Hongsheng Li,Yu Qiao,Chang Xu,Peng Gao

Task: 介绍Lumina-Image 2.0，一种先进的文本到图像生成框架，相比前作Lumina-Next有显著进步。

Motivation: 通过统一架构和高效训练策略，提升文本到图像生成的质量和效率。

Details

Method: 采用统一架构（Unified Next-DiT）和统一标注系统（UniCap），结合多阶段渐进训练和推理加速技术。 Result: 在学术基准和公开文本到图像竞赛中表现优异，仅需2.6B参数即可实现高性能。 Conclusion: Lumina-Image 2.0展示了其可扩展性和设计效率，代码和模型已开源。 Abstract: We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.

Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video

David Yifan Yao,Albert J. Zhai,Shenlong Wang

Task: 提出一种统一的方法（Uni4D）来理解动态场景，包括静态/动态重建、相机姿态估计和密集3D运动跟踪。

Motivation: 利用预训练视觉基础模型的能力，解决单一模型在全面4D理解上的挑战。

Details

Method: 多阶段优化框架，结合多个预训练模型，无需重新训练或微调。 Result: 在动态4D建模中实现最先进的性能，具有卓越的视觉质量。 Conclusion: Uni4D展示了重新利用视觉基础模型进行4D理解的有效性。 Abstract: This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.

Exploring the Evolution of Physics Cognition in Video Generation: A Survey

Minghui Lin,Xiang Wang,Yishan Wang,Shu Wang,Fengqi Dai,Pengxiang Ding,Cunxiang Wang,Zhengrong Zuo,Nong Sang,Siteng Huang,Donglin Wang

Task: 系统总结视频生成中物理认知的架构设计及其应用。

Motivation: 现有视频生成技术虽在视觉真实性上取得进展，但常违反物理规律，缺乏物理保真度，需系统性综述以填补空白。

Details

Method: 从认知科学角度梳理物理认知的演进过程，提出三层分类法，涵盖前沿方法、经典范式与基准。 Result: 总结了物理认知在视频生成中的关键挑战与未来研究方向。 Conclusion: 通过跨学科分析，推动视频生成从“视觉模仿”迈向“类人物理理解”的新阶段。 Abstract: Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - generated content often violates the fundamental laws of physics, falling into the dilemma of ''visual realism but physical absurdity". Researchers began to increasingly recognize the importance of physical fidelity in video generation and attempted to integrate heuristic physical cognition such as motion representations and physical knowledge into generative systems to simulate real-world dynamic scenarios. Considering the lack of a systematic overview in this field, this survey aims to provide a comprehensive summary of architecture designs and their applications to fill this gap. Specifically, we discuss and organize the evolutionary process of physical cognition in video generation from a cognitive science perspective, while proposing a three-tier taxonomy: 1) basic schema perception for generation, 2) passive cognition of physical knowledge for generation, and 3) active cognition for world simulation, encompassing state-of-the-art methods, classical paradigms, and benchmarks. Subsequently, we emphasize the inherent key challenges in this domain and delineate potential pathways for future research, contributing to advancing the frontiers of discussion in both academia and industry. Through structured review and interdisciplinary analysis, this survey aims to provide directional guidance for developing interpretable, controllable, and physically consistent video generation paradigms, thereby propelling generative models from the stage of ''visual mimicry'' towards a new phase of ''human-like physical comprehension''.

Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence

Haolin Liu,Xiaohang Zhan,Zizheng Yan,Zhongjin Luo,Yuxin Wen,Xiaoguang Han

Task: 建立3D形状对应关系，特别是在复杂场景中。

Motivation: 现有功能映射方法在复杂场景（如非等距形状差异）中表现不佳，需要更稳定的形状对应估计方法。

Details

Method: 提出Stable-SCore框架，结合2D字符对应基础模型和语义流引导的注册方法。 Result: 在挑战性场景中显著优于现有方法，并支持广泛的实际应用。 Conclusion: Stable-SCore框架为复杂3D形状对应问题提供了稳定且高效的解决方案。 Abstract: Establishing character shape correspondence is a critical and fundamental task in computer vision and graphics, with diverse applications including re-topology, attribute transfer, and shape interpolation. Current dominant functional map methods, while effective in controlled scenarios, struggle in real situations with more complex challenges such as non-isometric shape discrepancies. In response, we revisit registration-for-correspondence methods and tap their potential for more stable shape correspondence estimation. To overcome their common issues including unstable deformations and the necessity for careful pre-alignment or high-quality initial 3D correspondences, we introduce Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence. We first re-purpose a foundation model for 2D character correspondence that ensures reliable and stable 2D mappings. Crucially, we propose a novel Semantic Flow Guided Registration approach that leverages 2D correspondence to guide mesh deformations. Our framework significantly surpasses existing methods in challenging scenarios, and brings possibilities for a wide array of real applications, as demonstrated in our results.

Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying

Hairong Yin,Huangying Zhan,Yi Xu,Raymond A. Yeh

Task: Open-vocabulary querying in 3D Gaussian Splatting to identify semantically relevant regions based on a text query.

Motivation: Prior methods like LangSplat and OpenGaussian have limitations in directly querying 3D Gaussians or achieving semantic consistency.

Details

Method: A point-level querying method leveraging SAM2 masklets for semantic ground-truth and a novel two-step querying approach. Result: Achieves better performance, e.g., +20.42 mIoU improvement on the 3D-OVS dataset. Conclusion: The proposed method outperforms state-of-the-art approaches in open-vocabulary querying. Abstract: Open-vocabulary querying in 3D Gaussian Splatting aims to identify semantically relevant regions within a 3D Gaussian representation based on a given text query. Prior work, such as LangSplat, addressed this task by retrieving these regions in the form of segmentation masks on 2D renderings. More recently, OpenGaussian introduced point-level querying, which directly selects a subset of 3D Gaussians. In this work, we propose a point-level querying method that builds upon LangSplat's framework. Our approach improves the framework in two key ways: (a) we leverage masklets from the Segment Anything Model 2 (SAM2) to establish semantic consistent ground-truth for distilling the language Gaussians; (b) we introduces a novel two-step querying approach that first retrieves the distilled ground-truth and subsequently uses the ground-truth to query the individual Gaussians. Experimental evaluations on three benchmark datasets demonstrate that the proposed method achieves better performance compared to state-of-the-art approaches. For instance, our method achieves an mIoU improvement of +20.42 on the 3D-OVS dataset.

Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting

Anand Bhattad,Konpat Preechakul,Alexei A. Efros

Task: 提出一种名为Visual Jenga的新场景理解任务，通过逐步移除图像中的物体来揭示场景元素之间的内在关系。

Motivation: 受Jenga游戏启发，探索物体移除对场景连贯性的影响，以理解场景元素间的结构依赖关系。

Details

Method: 采用数据驱动、无需训练的简单方法，利用物体间的不对称关系和大规模修复模型生成反事实量化不对称性。 Result: 该方法在真实世界图像上表现出色，验证了其有效性。 Conclusion: Visual Jenga任务为场景理解提供了新视角，提出的方法为未来研究奠定了基础。 Abstract: This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.

A Unified Image-Dense Annotation Generation Model for Underwater Scenes

Hongkai Lin,Dingkang Liang,Zhenghao Qi,Xiang Bai

Task: 提出一种统一的文本到图像和密集标注生成方法（TIDE），用于水下场景。

Motivation: 高质量和大规模的水下密集标注数据集稀缺，因复杂环境和数据收集成本高。

Details

Method: 通过单一模型统一生成文本到图像和文本到密集标注，引入隐式布局共享机制（ILS）和时间自适应归一化（TAN）优化一致性。 Result: 合成大规模水下数据集，验证方法有效提升现有水下密集预测模型性能，缓解数据稀缺问题。 Conclusion: TIDE为缓解其他领域数据稀缺问题提供新视角。 Abstract: Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for gaining a comprehensive understanding of underwater scenes. Nevertheless, high-quality and large-scale underwater datasets with dense annotations remain scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text-to-image and text-to-dense annotations within a single model. The Implicit Layout Sharing mechanism (ILS) and cross-modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations. We synthesize a large-scale underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks. The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations. We hope our method can offer new perspectives on alleviating data scarcity issues in other fields. The code is available at https: //github.com/HongkLin/TIDE.

LOCORE: Image Re-ranking with Long-Context Sequence Modeling

Zilin Xiao,Pavel Suma,Ayush Sachdeva,Hao-Jen Wang,Giorgos Kordopatis-Zilos,Giorgos Tolias,Vicente Ordonez

Task: 提出一种名为LOCORE的长上下文重排序模型，用于图像检索任务。

Motivation: 现有方法在局部描述符上进行成对相似性估计或全局描述符上进行列表重排序，LOCORE是首个在局部描述符上进行列表重排序的方法。

Details

Method: 利用高效的长上下文序列模型捕捉查询图像和图库图像在局部描述符级别的依赖关系，测试时采用滑动窗口策略处理长候选列表。 Result: 在多个图像检索基准测试（如ROxf、RPar、SOP、In-Shop和CUB-200）上表现优于其他重排序方法，且延迟与成对局部描述符重排序器相当。 Conclusion: LOCORE通过长上下文序列模型和滑动窗口策略，在图像检索任务中实现了高效且性能优越的列表重排序。 Abstract: We introduce LOCORE, Long-Context Re-ranker, a model that takes as input local descriptors corresponding to an image query and a list of gallery images and outputs similarity scores between the query and each gallery image. This model is used for image retrieval, where typically a first ranking is performed with an efficient similarity measure, and then a shortlist of top-ranked images is re-ranked based on a more fine-grained similarity measure. Compared to existing methods that perform pair-wise similarity estimation with local descriptors or list-wise re-ranking with global descriptors, LOCORE is the first method to perform list-wise re-ranking with local descriptors. To achieve this, we leverage efficient long-context sequence models to effectively capture the dependencies between query and gallery images at the local-descriptor level. During testing, we process long shortlists with a sliding window strategy that is tailored to overcome the context size limitations of sequence models. Our approach achieves superior performance compared with other re-rankers on established image retrieval benchmarks of landmarks (ROxf and RPar), products (SOP), fashion items (In-Shop), and bird species (CUB-200) while having comparable latency to the pair-wise local descriptor re-rankers.

Optimal Stepsize for Diffusion Sampling

Jianning Pei,Han Hu,Shuyang Gu

Task: 提出一种动态规划框架，通过从参考轨迹中提取知识来优化扩散模型的步长调度。

Motivation: 扩散模型在生成质量上表现优异，但由于步长离散化的不足导致计算密集的采样问题。

Details

Method: 采用动态规划框架，通过递归误差最小化优化步长调度，利用最优子结构保证全局离散化边界。 Result: 实验表明，该方法在文本到图像生成任务中实现了10倍加速，同时保留了99.4%的性能。 Conclusion: 提出的最优步长蒸馏方法在多种架构、ODE求解器和噪声调度中表现出强鲁棒性，显著提升了扩散模型的采样效率。 Abstract: Diffusion models achieve remarkable generation quality but suffer from computational intensive sampling due to suboptimal step discretization. While existing works focus on optimizing denoising directions, we address the principled design of stepsize schedules. This paper proposes Optimal Stepsize Distillation, a dynamic programming framework that extracts theoretically optimal schedules by distilling knowledge from reference trajectories. By reformulating stepsize optimization as recursive error minimization, our method guarantees global discretization bounds through optimal substructure exploitation. Crucially, the distilled schedules demonstrate strong robustness across architectures, ODE solvers, and noise schedules. Experiments show 10x accelerated text-to-image generation while preserving 99.4% performance on GenEval. Our code is available at https://github.com/bebebe666/OptimalSteps.

Ziyu Guo,Young Yoon Lee,Joseph Liu,Yizhak Ben-Shabat,Victor Zordan,Mubbasir Kapadia

Task: 提出了一种名为StyleMotif的模型，用于生成基于内容和多模态风格的运动。

Motivation: 现有方法要么专注于生成多样化的运动内容，要么从序列中转移风格，而StyleMotif旨在无缝合成广泛内容的运动，同时融入多模态输入的风格特征。

Details

Method: 引入了风格-内容交叉融合机制，并将风格编码器与预训练的多模态模型对齐，以确保生成的运动准确捕捉参考风格并保持真实性。 Result: 实验表明，该框架在风格化运动生成方面优于现有方法，并展现出多模态运动风格化的新兴能力。 Conclusion: StyleMotif能够实现更细腻的运动合成，代码和预训练模型将在论文接受后发布。 Abstract: We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance. Project Page: https://stylemotif.github.io

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng,Kaixiong Gong,Bohao Li,Zonghao Guo,Yibing Wang,Tianshuo Peng,Benyou Wang,Xiangyu Yue

Task: 探索如何通过R1范式在多模态大语言模型（MLLMs）中激发视频推理能力。

Motivation: 受DeepSeek-R1在基于规则的强化学习（RL）中成功激发推理能力的启发，但直接应用RL训练在视频推理中存在时间建模不足和数据稀缺的挑战。

Details

Method: 提出T-GRPO算法以利用视频中的时间信息，并结合高质量图像推理数据进行训练。构建了两个数据集Video-R1-COT-165k和Video-R1-260k。 Result: Video-R1在视频推理基准测试（如VideoMMMU、VSI-Bench）和通用视频基准测试（如MVBench、TempCompass）中表现显著提升，Video-R1-7B在VSI-bench上达到35.8%的准确率，超越GPT-4o。 Conclusion: Video-R1成功解决了视频推理中的挑战，并通过开源代码、模型和数据推动了该领域的发展。 Abstract: Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All codes, models, data are released.

Test-Time Visual In-Context Tuning

Jiahao Xie,Alessio Tonioni,Nathalie Rauschmayr,Federico Tombari,Bernt Schiele

Task: 提出一种名为测试时视觉上下文调整（VICT）的方法，以提升视觉上下文学习（VICL）在分布变化下的泛化能力。

Motivation: 现有VICL范式在分布变化下泛化能力较差，需要一种能够动态适应新测试样本的方法。

Details

Method: 通过翻转任务提示和测试样本的角色，并利用循环一致性损失重构原始任务提示输出。 Result: 在六种代表性视觉任务和15种常见干扰上的实验表明，VICT能显著提升VICL在新领域的泛化能力。 Conclusion: VICT不仅提升了VICL的泛化性，还展示了在测试时适应未见任务的潜力。 Abstract: Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. Specifically, we flip the role between the task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks ranging from high-level visual understanding to low-level image processing, with 15 common corruptions, demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. In addition, we show the potential of applying VICT for unseen tasks at test time. Code: https://github.com/Jiahao000/VICT.

HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM

Ziren Gong,Fabio Tosi,Youmin Zhang,Stefano Mattoccia,Matteo Poggi

Task: 提出HS-SLAM方法，以解决NeRF-based SLAM在场景表示、结构信息捕捉和全局一致性方面的挑战。

Motivation: 现有方法在场景表示、结构信息捕捉和全局一致性方面存在不足，尤其是在显著运动或被遗忘的场景中。

Details

Method: 提出混合编码网络（结合hash-grid、tri-planes和one-blob）、结构监督采样非局部像素块，以及主动全局BA以消除相机漂移。 Result: 实验结果表明HS-SLAM在跟踪和重建精度上优于基线方法，同时保持机器人应用所需的效率。 Conclusion: HS-SLAM通过改进场景表示和全局一致性，显著提升了NeRF-based SLAM的性能。 Abstract: NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capacity, we propose a hybrid encoding network that combines the complementary strengths of hash-grid, tri-planes, and one-blob, improving the completeness and smoothness of reconstruction. Additionally, we introduce structural supervision by sampling patches of non-local pixels rather than individual rays to better capture the scene structure. To ensure global consistency, we implement an active global bundle adjustment (BA) to eliminate camera drifts and mitigate accumulative errors. Experimental results demonstrate that HS-SLAM outperforms the baselines in tracking and reconstruction accuracy while maintaining the efficiency required for robotics.

X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction

Weihao Yu,Yuanhao Cai,Ruyi Zha,Zhiwen Fan,Chenxin Li,Yixuan Yuan

Task: 提出一种名为X²-Gaussian的新框架，用于实现连续时间的4D-CT重建。

Motivation: 传统相位分箱工作流在4D-CT重建中存在运动不对齐和临床实用性受限的问题。

Details

Method: 结合动态辐射高斯泼溅与自监督呼吸运动学习，通过时空编码器-解码器架构预测时变高斯变形，消除相位离散化。 Result: 实验表明，X²-Gaussian在PSNR上比传统方法提高了9.93 dB，比先前的高斯泼溅技术提高了2.25 dB。 Conclusion: X²-Gaussian通过连续运动建模和无硬件周期学习，推动了动态临床成像的高保真4D-CT重建。 Abstract: Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows. Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality. In this paper, We propose X$^2$-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization. To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization. Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. By unifying continuous motion modeling with hardware-free period learning, X$^2$-Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging. Project website at: https://x2-gaussian.github.io/.

Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation

Reza Qorbani,Gianluca Villani,Theodoros Panagiotakopoulos,Marc Botet Colomer,Linus Härenstam-Nielsen,Mattia Segu,Pier Luigi Dovesi,Jussi Karlgren,Daniel Cremers,Federico Tombari,Matteo Poggi

Task: 提出一种无需训练、测试时域适应的框架SemLA，用于开放词汇语义分割。

Motivation: 解决开放词汇语义分割模型在训练与测试域差异大时性能下降的问题，避免微调需求。

Details

Method: 利用基于LoRA的适配器库，通过CLIP嵌入动态选择并合并最相关的适配器，为每个输入构建定制模型。 Result: 在20个域基准测试中表现优异，适应性强，性能提升显著。 Conclusion: SemLA为开放词汇语义分割的域适应设定了新标准，具有高效性、可解释性和隐私保护优势。 Abstract: Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world applications. We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation. SemLA leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on a 20-domain benchmark built over 10 standard datasets demonstrate SemLA's superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.

VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

Chi-Pin Huang,Yen-Siang Wu,Hung-Kai Chung,Kai-Po Chang,Fu-En Yang,Yu-Chiang Frank Wang

Task: 提出一个统一的框架VideoMage，用于在多个主题及其交互运动中实现视频定制。

Motivation: 现有方法主要关注个性化单一概念（主题身份或运动模式），无法有效处理多个主题及其交互运动。

Details

Method: 使用主题和运动LoRAs从用户提供的图像和视频中捕获个性化内容，并通过外观无关的运动学习方法解耦运动模式和视觉外观。此外，开发了一种时空组合方案来指导主题在所需运动模式中的交互。 Result: 实验表明，VideoMage优于现有方法，能够生成具有一致主题身份和交互的连贯、用户可控的视频。 Conclusion: VideoMage是一个有效的统一框架，能够同时处理多个主题及其交互运动，生成高质量的视频。 Abstract: Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.

Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

Abdelrahman Shaker,Muhammad Maaz,Chenhui Gou,Hamid Rezatofighi,Salman Khan,Fahad Shahbaz Khan

Task: 提出Mobile-VideoGPT，一种高效的多模态框架，用于解决视频理解模型的高计算需求和低效率问题。

Motivation: 传统视频大模型（LMMs）存在计算需求高、参数多、推理速度慢的问题，难以实际应用。

Details

Method: 采用轻量级双视觉编码器、高效投影器和小型语言模型（SLM），并结合注意力帧评分机制和高效令牌投影器。 Result: Mobile-VideoGPT-0.5B在多个基准测试中表现优异，吞吐量高且参数更少。 Conclusion: Mobile-VideoGPT是一种高效且实用的视频理解解决方案。 Abstract: Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.

Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for Federated Continual Learning

Xiaoming Qi,Jingyang Zhang,Huazhu Fu,Guanyu Yang,Shuo Li,Yueming Jin

Task: 提出一种新型服务器端联邦持续学习模式（FedDAH），以解决医学领域中动态异步任务流下的协作学习问题。

Motivation: 现有服务器端联邦持续学习方法在动态异步任务流中面临灾难性遗忘和优化偏差问题，限制了模型在医学场景中的应用。

Details

Method: 提出动态分配超网络（DAHyper）管理任务与模型参数的映射，并引入自适应模型重新校准（AMR）解决优化偏差。 Result: 在AMOS数据集上的实验表明，FedDAH优于其他联邦持续学习方法。 Conclusion: FedDAH通过动态分配超网络和自适应模型重新校准，有效解决了灾难性遗忘和优化偏差问题，提升了医学场景中的协作学习效果。 Abstract: Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (FedDAH). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:https://github.com/jinlab-imvr/FedDAH.

Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead

Viktor Schlegel,Anil A Bharath,Zilong Zhao,Kevin Yee

Task: 总结隐私保护合成数据的研究现状及其在专业领域中的应用。

Motivation: 解决因监管、隐私或制度原因导致的数据隔离问题，同时平衡数据效用与隐私保护。

Details

Method: 综述生成模型和差分隐私的理论基础，评估四种领先方法在五个真实数据集上的表现。 Result: 在严格隐私约束下（ε ≤ 4），性能显著下降，揭示了通用基准与专业领域数据之间的差距。 Conclusion: 需要更健壮的评价框架、标准化基准和改进技术，以充分发挥隐私保护合成数据的潜力。 Abstract: Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($\epsilon \leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.

CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis

Youngkyoon Jang,Eduardo Pérez-Pellitero

Task: 提出一种基于共视地图的高斯泼溅方法（CoMapGS），用于恢复稀疏新视角合成中代表性不足的区域。

Motivation: 解决稀疏新视角合成中高不确定性和低不确定性区域的恢复问题，通过共视地图增强初始点云并应用不确定性感知加权监督。

Details

Method: 构建共视地图、增强初始点云、使用邻近分类器进行不确定性感知加权监督。 Result: CoMapGS在Mip-NeRF 360和LLFF等数据集上优于现有方法。 Conclusion: CoMapGS通过共视地图和自适应监督机制，显著提升了稀疏区域的恢复效果和重建质量。 Abstract: We propose Covisibility Map-based Gaussian Splatting (CoMapGS), designed to recover underrepresented sparse regions in sparse novel view synthesis. CoMapGS addresses both high- and low-uncertainty regions by constructing covisibility maps, enhancing initial point clouds, and applying uncertainty-aware weighted supervision using a proximity classifier. Our contributions are threefold: (1) CoMapGS reframes novel view synthesis by leveraging covisibility maps as a core component to address region-specific uncertainty; (2) Enhanced initial point clouds for both low- and high-uncertainty regions compensate for sparse COLMAP-derived point clouds, improving reconstruction quality and benefiting few-shot 3DGS methods; (3) Adaptive supervision with covisibility-score-based weighting and proximity classification achieves consistent performance gains across scenes with varying sparsity scores derived from covisibility maps. Experimental results demonstrate that CoMapGS outperforms state-of-the-art methods on datasets including Mip-NeRF 360 and LLFF.

Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins

Yiqing Shen,Chenjia Li,Bohan Liu,Cheng-Yi Li,Tito Porras,Mathias Unberath

Task: 提出一种基于数字孪生表示的无LLM微调的推理分割框架（ORDiRS），用于灵活分析手术室工作流程。

Motivation: 现有方法依赖端到端深度神经网络，缺乏灵活性且难以适应不同手术室场景的需求。

Details

Method: 提出数字孪生表示以保留语义和空间关系，并设计ORDiRS框架，采用“推理-检索-合成”范式。 Result: 在内部和公共数据集上，ORDiRS的cIoU比现有方法提高了6.12%-9.74%。 Conclusion: ORDiRS为手术室工作流程分析提供了更灵活且高效的解决方案。 Abstract: Analyzing operating room (OR) workflows to derive quantitative insights into OR efficiency is important for hospitals to maximize patient care and financial sustainability. Prior work on OR-level workflow analysis has relied on end-to-end deep neural networks. While these approaches work well in constrained settings, they are limited to the conditions specified at development time and do not offer the flexibility necessary to accommodate the OR workflow analysis needs of various OR scenarios (e.g., large academic center vs. rural provider) without data collection, annotation, and retraining. Reasoning segmentation (RS) based on foundation models offers this flexibility by enabling automated analysis of OR workflows from OR video feeds given only an implicit text query related to the objects of interest. Due to the reliance on large language model (LLM) fine-tuning, current RS approaches struggle with reasoning about semantic/spatial relationships and show limited generalization to OR video due to variations in visual characteristics and domain-specific terminology. To address these limitations, we first propose a novel digital twin (DT) representation that preserves both semantic and spatial relationships between the various OR components. Then, building on this foundation, we propose ORDiRS (Operating Room Digital twin representation for Reasoning Segmentation), an LLM-tuning-free RS framework that reformulates RS into a "reason-retrieval-synthesize" paradigm. Finally, we present ORDiRS-Agent, an LLM-based agent that decomposes OR workflow analysis queries into manageable RS sub-queries and generates responses by combining detailed textual explanations with supporting visual evidence from RS. Experimental results on both an in-house and a public OR dataset demonstrate that our ORDiRS achieves a cIoU improvement of 6.12%-9.74% compared to the existing state-of-the-arts.

ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging

Haoming Xu,Shuxun Wang,Yanqiu Zhao,Yi Zhong,Ziyan Jiang,Ningyuan Zhao,Shumin Deng,Huajun Chen,Ningyu Zhang

Task: 选择性从大语言模型中删除敏感知识，避免过度遗忘或遗忘不足的问题。

Motivation: 解决大语言模型中敏感内容的遗忘问题，提出更平衡的遗忘方法。

Details

Method: 利用模型合并（特别是TIES-Merging）技术，将两个专用模型结合为一个更平衡的遗忘模型。 Result: 在26个团队中排名第二，Task Aggregate得分为0.944，整体Aggregate得分为0.487。 Conclusion: 需要更全面的评估方法和重新思考遗忘目标，当前MIA分数和ROUGE指标不足以完全评估遗忘效果。 Abstract: This paper presents the ZJUKLAB team's submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models. This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues. We propose an unlearning system that leverages Model Merging (specifically TIES-Merging), combining two specialized models into a more balanced unlearned model. Our system achieves competitive results, ranking second among 26 teams, with an online score of 0.944 for Task Aggregate and 0.487 for overall Aggregate. In this paper, we also conduct local experiments and perform a comprehensive analysis of the unlearning process, examining performance trajectories, loss dynamics, and weight perspectives, along with several supplementary experiments, to understand the effectiveness of our method. Furthermore, we analyze the shortcomings of our method and evaluation metrics, emphasizing that MIA scores and ROUGE-based metrics alone are insufficient to fully evaluate successful unlearning. Finally, we emphasize the need for more comprehensive evaluation methodologies and rethinking of unlearning objectives in future research. Code is available at https://github.com/zjunlp/unlearn/tree/main/semeval25.

VideoMix: Aggregating How-To Videos for Task-Oriented Learning

Saelyne Yang,Anh Truong,Juho Kim,Dingzeyu Li

Task: 开发一个系统（VideoMix），帮助用户通过整合多个教程视频的信息来全面理解任务。

Motivation: 用户在学习新任务时需要观看多个教程视频，但视频分散且不易浏览，导致效率低下。

Details

Method: 利用视觉-语言模型管道提取和组织视频信息，提供文本摘要和相关视频片段。 Result: 用户研究表明，VideoMix比独立观看视频更高效且能提供更全面的任务理解。 Conclusion: VideoMix展示了以任务为导向的多视频整合方法在提升学习效率方面的潜力。 Abstract: Tutorial videos are a valuable resource for people looking to learn new tasks. People often learn these skills by viewing multiple tutorial videos to get an overall understanding of a task by looking at different approaches to achieve the task. However, navigating through multiple videos can be time-consuming and mentally demanding as these videos are scattered and not easy to skim. We propose VideoMix, a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task. Insights from our formative study (N=12) reveal that learners value understanding potential outcomes, required materials, alternative methods, and important details shared by different videos. Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips, enabling users to quickly digest and navigate the content. A comparative user study (N=12) demonstrated that VideoMix enabled participants to gain a more comprehensive understanding of tasks with greater efficiency than a baseline video interface, where videos are viewed independently. Our findings highlight the potential of a task-oriented, multi-video approach where videos are organized around a shared goal, offering an enhanced alternative to conventional video-based learning.

UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning

Hongxuan Tang,Hao Liu,Xinyan Xiao

Task: 提出一种统一的自动回归多模态模型UGen，用于同时处理文本和图像任务。

Motivation: 解决多模态学习中统一处理文本和图像任务的挑战。

Details

Method: 通过渐进式词汇学习机制，逐步激活和整合视觉标记ID，使用单一变换器以自动回归方式生成文本和图像。 Result: 在综合文本和图像任务中，UGen比传统方法整体性能提升13.3%，并在所有任务中与特定任务模型竞争。 Conclusion: UGen通过渐进式词汇学习机制，显著提升了多模态学习的统一性和性能。 Abstract: We introduce UGen, a unified autoregressive multimodal model that demonstrates strong performance across text processing, image understanding, and image generation tasks simultaneously. UGen converts both texts and images into discrete token sequences and utilizes a single transformer to generate them uniformly in an autoregressive manner. To address the challenges associated with unified multimodal learning, UGen is trained using a novel mechanism, namely progressive vocabulary learning. In this process, visual token IDs are incrementally activated and integrated into the training phase, ultimately enhancing the effectiveness of unified multimodal learning. Experiments on comprehensive text and image tasks show that UGen achieves a significant overall performance improvement of 13.3% compared to the vanilla unified autoregressive method, and it also delivers competitive results across all tasks against several task-specific models.

WVSC: Wireless Video Semantic Communication with Multi-frame Compensation

Bingyan Xie,Yongpeng Wu,Yuxuan Shi,Biqian Feng,Wenjun Zhang,Jihong Park,Tony Q. S. Quek

Task: 提出一种无线视频语义通信框架（WVSC），将语义通信思想融入无线视频传输场景。

Motivation: 现有无线视频传输方案直接在像素级进行视频编码，忽略了视频中的内部语义信息。

Details

Method: WVSC将原始视频帧编码为语义帧，基于紧凑表示进行视频编码，并引入参考语义帧替代传统视频编码中的运动向量，接收端采用多帧补偿（MFC）技术提升带宽效率。 Result: 实验结果表明，WVSC在PSNR上优于其他DL方法（如DVSC）约1 dB，优于传统方案约2 dB。 Conclusion: WVSC通过语义级编码和多帧补偿技术，显著提升了无线视频传输的带宽效率和性能。 Abstract: Existing wireless video transmission schemes directly conduct video coding in pixel level, while neglecting the inner semantics contained in videos. In this paper, we propose a wireless video semantic communication framework, abbreviated as WVSC, which integrates the idea of semantic communication into wireless video transmission scenarios. WVSC first encodes original video frames as semantic frames and then conducts video coding based on such compact representations, enabling the video coding in semantic level rather than pixel level. Moreover, to further reduce the communication overhead, a reference semantic frame is introduced to substitute motion vectors of each frame in common video coding methods. At the receiver, multi-frame compensation (MFC) is proposed to produce compensated current semantic frame with a multi-frame fusion attention module. With both the reference frame transmission and MFC, the bandwidth efficiency improves with satisfying video transmission performance. Experimental results verify the performance gain of WVSC over other DL-based methods e.g. DVSC about 1 dB and traditional schemes about 2 dB in terms of PSNR.

PLAIN: Scalable Estimation Architecture for Integrated Sensing and Communication

Bashar Tahir,Philipp Svoboda,Markus Rupp

Task: 提出一种基于张量的估计架构PLAIN，用于解决集成感知与通信（ISAC）中的高维参数估计问题。

Motivation: 集成感知与通信（ISAC）在下一代移动网络中具有重要潜力，但高维参数估计带来的计算复杂性和有限的测量时间窗口是主要挑战。

Details

Method: PLAIN架构分为三个阶段：压缩阶段、解耦估计阶段和基于输入的融合阶段，利用张量代数、子空间处理和压缩感知工具。 Result: PLAIN能够灵活扩展维度，同时保持低复杂性和超分辨率性能，优于实际顺序和联合估计基线。 Conclusion: PLAIN为解决ISAC中的高维参数估计问题提供了一种高效且可扩展的解决方案。 Abstract: Integrated sensing and communication (ISAC) is envisioned be to one of the paradigms upon which next-generation mobile networks will be built, extending localization and tracking capabilities, as well as giving birth to environment-aware wireless access. A key aspect of sensing integration is parameter estimation, which involves extracting information about the surrounding environment, such as the direction, distance, and velocity of various objects within. This is typically of a high-dimensional nature, which leads to significant computational complexity, if performed jointly across multiple sensing dimensions, such as space, frequency, and time. Additionally, due to the incorporation of sensing on top of the data transmission, the time window available for sensing is likely to be short, resulting in an estimation problem where only a single snapshot is accessible. In this work, we propose PLAIN, a tensor-based estimation architecture that flexibly scales with multiple sensing dimensions and can handle high dimensionality, limited measurement time, and super-resolution requirements. It consists of three stages: a compression stage, where the high dimensional input is converted into lower dimensionality, without sacrificing resolution; a decoupled estimation stage, where the parameters across the different dimensions are estimated in parallel with low complexity; an input-based fusion stage, where the decoupled parameters are fused together to form a paired multidimensional estimate. We investigate the performance of the architecture for different configurations and compare it against practical sequential and joint estimation baselines, as well as theoretical bounds. Our results show that PLAIN, using tools from tensor algebra, subspace-based processing, and compressed sensing, can scale flexibly with dimensionality, while operating with low complexity and maintaining super-resolution.

ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks

Erik Wallin,Fredrik Kahl,Lars Hammarstrand

Task: 提出一种框架，用于在给定的类别层次结构中检测和分类分布外（OOD）样本。

Motivation: 传统OOD检测仅将其视为二分类任务，忽略了OOD样本与分布内（ID）类别之间的语义关系。

Details

Method: 利用类别层次结构构建概率模型，并通过在多个层次深度上训练的ID分类网络实现该模型。 Result: 在三个具有预定义类别层次结构的数据集上验证了方法的有效性。 Conclusion: 提出的框架能够有效检测和分类OOD样本，并利用类别层次结构提升性能。 Abstract: Out-of-distribution (OOD) detection in deep learning has traditionally been framed as a binary task, where samples are either classified as belonging to the known classes or marked as OOD, with little attention given to the semantic relationships between OOD samples and the in-distribution (ID) classes. We propose a framework for detecting and classifying OOD samples in a given class hierarchy. Specifically, we aim to predict OOD data to their correct internal nodes of the class hierarchy, whereas the known ID classes should be predicted as their corresponding leaf nodes. Our approach leverages the class hierarchy to create a probabilistic model and we implement this model by using networks trained for ID classification at multiple hierarchy depths. We conduct experiments on three datasets with predefined class hierarchies and show the effectiveness of our method. Our code is available at https://github.com/walline/prohoc.

STAMICS: Splat, Track And Map with Integrated Consistency and Semantics for Dense RGB-D SLAM

Yongxu Wang,Xu Cao,Weiyun Yi,Zhaoxin Fan

Task: 提出一种名为STAMICS的新方法，将语义信息与3D高斯表示结合，以提升SLAM的定位和建图精度。

Motivation: 当前SLAM方法主要依赖几何线索，但在动态或密集场景中难以保证语义一致性。

Details

Method: STAMICS包含三个关键组件：基于3D高斯的场景表示、图聚类技术确保时间语义一致性，以及开放词汇系统用于未见过物体的分类。 Result: 实验表明，STAMICS显著提升了相机位姿估计和地图质量，优于现有方法并减少重建误差。 Conclusion: STAMICS通过整合语义信息，有效解决了SLAM中的语义一致性问题，提升了性能。 Abstract: Simultaneous Localization and Mapping (SLAM) is a critical task in robotics, enabling systems to autonomously navigate and understand complex environments. Current SLAM approaches predominantly rely on geometric cues for mapping and localization, but they often fail to ensure semantic consistency, particularly in dynamic or densely populated scenes. To address this limitation, we introduce STAMICS, a novel method that integrates semantic information with 3D Gaussian representations to enhance both localization and mapping accuracy. STAMICS consists of three key components: a 3D Gaussian-based scene representation for high-fidelity reconstruction, a graph-based clustering technique that enforces temporal semantic consistency, and an open-vocabulary system that allows for the classification of unseen objects. Extensive experiments show that STAMICS significantly improves camera pose estimation and map quality, outperforming state-of-the-art methods while reducing reconstruction errors. Code will be public available.

RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting

Qiyu Dai,Xingyu Ni,Qianfan Shen,Wenzheng Chen,Baoquan Chen,Mengyu Chu

Task: 在开放世界场景中以物理准确的方式添加动态雨效果。

Motivation: 现有方法（如NeRF和3DGS）在新视角合成中表现良好，但在物理模拟（如雨效果）上存在不足；传统物理模拟虽能生成逼真效果，但依赖人工设置且缺乏灵活性。

Details

Method: 结合物理模拟和3DGS技术，提出RainyGS方法，通过物理模拟雨滴和浅水效果，并在3DGS框架中高效渲染。 Result: RainyGS能以30 fps以上速度生成逼真雨效果，支持从细雨到暴雨的灵活控制，在真实场景和大规模驾驶场景中表现优于现有方法。 Conclusion: RainyGS结合物理模拟和3DGS的优势，实现了高效、逼真且物理准确的动态雨效果生成。 Abstract: We consider the problem of adding dynamic rain effects to in-the-wild scenes in a physically-correct manner. Recent advances in scene modeling have made significant progress, with NeRF and 3DGS techniques emerging as powerful tools for reconstructing complex scenes. However, while effective for novel view synthesis, these methods typically struggle with challenging scene editing tasks, such as physics-based rain simulation. In contrast, traditional physics-based simulations can generate realistic rain effects, such as raindrops and splashes, but they often rely on skilled artists to carefully set up high-fidelity scenes. This process lacks flexibility and scalability, limiting its applicability to broader, open-world environments. In this work, we introduce RainyGS, a novel approach that leverages the strengths of both physics-based modeling and 3DGS to generate photorealistic, dynamic rain effects in open-world scenes with physical accuracy. At the core of our method is the integration of physically-based raindrop and shallow water simulation techniques within the fast 3DGS rendering framework, enabling realistic and efficient simulations of raindrop behavior, splashes, and reflections. Our method supports synthesizing rain effects at over 30 fps, offering users flexible control over rain intensity -- from light drizzles to heavy downpours. We demonstrate that RainyGS performs effectively for both real-world outdoor scenes and large-scale driving scenarios, delivering more photorealistic and physically-accurate rain effects compared to state-of-the-art methods. Project page can be found at https://pku-vcl-geometry.github.io/RainyGS/

Sparse Bayesian Learning for Label Efficiency in Cardiac Real-Time MRI

Felix Terhag,Philipp Knechtges,Achim Basermann,Anja Bach,Darius Gerlach,Jens Tank,Raúl Tempone

Task: 利用稀疏贝叶斯学习（SBL）预测心脏外切片的室体积，以减少手动标注的需求。

Motivation: 实时心脏MRI技术生成大量图像，但神经网络在外切片上的预测不可靠，需要解决这一问题。

Details

Method: 通过稀疏贝叶斯学习（SBL）识别主导心室体积的稀疏频率，优化超参数并自动修剪无关成分，指导外切片标注。 Result: 实验表明，仅需少量标注图像即可实现准确的体积预测，且标注过程高效。 Conclusion: SBL方法能有效减少标注需求并提供不确定性估计，适用于心脏MRI图像分析。 Abstract: Cardiac real-time magnetic resonance imaging (MRI) is an emerging technology that images the heart at up to 50 frames per second, offering insight into the respiratory effects on the heartbeat. However, this method significantly increases the number of images that must be segmented to derive critical health indicators. Although neural networks perform well on inner slices, predictions on outer slices are often unreliable. This work proposes sparse Bayesian learning (SBL) to predict the ventricular volume on outer slices with minimal manual labeling to address this challenge. The ventricular volume over time is assumed to be dominated by sparse frequencies corresponding to the heart and respiratory rates. Moreover, SBL identifies these sparse frequencies on well-segmented inner slices by optimizing hyperparameters via type -II likelihood, automatically pruning irrelevant components. The identified sparse frequencies guide the selection of outer slice images for labeling, minimizing posterior variance. This work provides performance guarantees for the greedy algorithm. Testing on patient data demonstrates that only a few labeled images are necessary for accurate volume prediction. The labeling procedure effectively avoids selecting inefficient images. Furthermore, the Bayesian approach provides uncertainty estimates, highlighting unreliable predictions (e.g., when choosing suboptimal labels).

Embedding Compression Distortion in Video Coding for Machines

Yuxiao Sun,Yao Zhao,Meiqin Liu,Chao Yao,Weisi Lin

Task: 提出一种名为压缩失真表示嵌入（CDRE）的框架，以解决视频压缩对机器视觉任务性能的影响。

Motivation: 现有编解码器主要针对像素域和人类视觉系统优化，而忽略了机器视觉任务的需求。

Details

Method: 设计压缩敏感提取器分析特征域中的压缩退化，并引入轻量级失真编解码器压缩失真信息，随后将其嵌入下游模型。 Result: 实验表明，该框架能显著提升现有编解码器的速率-任务性能，且额外开销极小。 Conclusion: CDRE框架有效解决了视频压缩对机器视觉任务的信息丢失问题，提升了任务性能。 Abstract: Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis. However, existing codecs are primarily optimized for pixel-domain and HVS-perception metrics rather than the needs of machine vision tasks. To address this issue, we propose a Compression Distortion Representation Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models, addressing the information lost during compression and improving task performance. Specifically, to better analyze the machine-perception-related distortion, we design a compression-sensitive extractor that identifies compression degradation in the feature domain. For efficient transmission, a lightweight distortion codec is introduced to compress the distortion information into a compact representation. Subsequently, the representation is progressively embedded into the downstream model, enabling it to be better informed about compression degradation and enhancing performance. Experiments across various codecs and downstream tasks demonstrate that our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of bitrate, execution time, and number of parameters. Our codes and supplementary materials are released in https://github.com/Ws-Syx/CDRE/.

Brett Levac,Ajil Jalal,Kannan Ramchandran,Jonathan I. Tamir

Task: 提出一种基于AmbientGAN的生成技术，用于从未配对的干净图像和损坏测量中识别未知成像系统的参数分布。

Motivation: 解决盲逆问题中成像系统参数不确定性的挑战，无需成对的干净图像和系统样本。

Details

Method: 利用AmbientGAN生成技术学习未知成像系统的参数分布，并将其应用于基于模型的恢复算法。 Result: 成功演示了从噪声测量中学习高斯模糊和运动模糊先验，并在扩散后验采样中解决盲反卷积问题。 Conclusion: 该方法为盲逆问题提供了一种有效的解决方案，无需成对数据即可学习成像系统的先验分布。 Abstract: Blind inverse problems in imaging arise from uncertainties in the system used to collect (noisy) measurements of images. Recovering clean images from these measurements typically requires identifying the imaging system, either implicitly or explicitly. A common solution leverages generative models as priors for both the images and the imaging system parameters (e.g., a class of point spread functions). To learn these priors in a straightforward manner requires access to a dataset of clean images as well as samples of the imaging system. We propose an AmbientGAN-based generative technique to identify the distribution of parameters in unknown imaging systems, using only unpaired clean images and corrupted measurements. This learned distribution can then be used in model-based recovery algorithms to solve blind inverse problems such as blind deconvolution. We successfully demonstrate our technique for learning Gaussian blur and motion blur priors from noisy measurements and show their utility in solving blind deconvolution with diffusion posterior sampling.

Keyword-Oriented Multimodal Modeling for Euphemism Identification

Yuxue Hu,Junsong Li,Meixuan Chen,Dongyu Su,Tongguan Wang,Ying Sha

Task: 识别委婉语及其对应的目标关键词，特别是在多模态（文本、图像、音频）数据中。

Motivation: 现有方法主要基于文本，而社交媒体的兴起凸显了多模态分析的需求，但目前缺乏多模态委婉语数据集。

Details

Method: 引入一个关键词导向的多模态委婉语语料库（KOM-Euph），并提出一种多模态委婉语识别方法（KOM-EI），利用跨模态特征对齐和动态融合模块。 Result: KOM-EI在实验中表现优于现有最先进模型和大语言模型，并验证了多模态数据集的重要性。 Conclusion: 多模态方法在委婉语识别中具有显著优势，KOM-Euph和KOM-EI为相关研究提供了重要资源和方法。 Abstract: Euphemism identification deciphers the true meaning of euphemisms, such as linking "weed" (euphemism) to "marijuana" (target keyword) in illicit texts, aiding content moderation and combating underground markets. While existing methods are primarily text-based, the rise of social media highlights the need for multimodal analysis, incorporating text, images, and audio. However, the lack of multimodal datasets for euphemisms limits further research. To address this, we regard euphemisms and their corresponding target keywords as keywords and first introduce a keyword-oriented multimodal corpus of euphemisms (KOM-Euph), involving three datasets (Drug, Weapon, and Sexuality), including text, images, and speech. We further propose a keyword-oriented multimodal euphemism identification method (KOM-EI), which uses cross-modal feature alignment and dynamic fusion modules to explicitly utilize the visual and audio features of the keywords for efficient euphemism identification. Extensive experiments demonstrate that KOM-EI outperforms state-of-the-art models and large language models, and show the importance of our multimodal datasets.

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

Yue Li,Meng Tian,Zhenyu Lin,Jiangtong Zhu,Dechang Zhu,Haiqiang Liu,Zining Wang,Yueyi Zhang,Zhiwei Xiong,Xinhai Zhao

Task: 提出一个细粒度的数据集VLADBench，用于评估视觉语言模型在自动驾驶场景中的能力。

Motivation: 现有基准测试主要通过粗粒度的开放式视觉问答评估模型，不足以应对复杂驾驶场景的需求。

Details

Method: 构建VLADBench数据集，包含5个关键领域的封闭式问答，并细分为11个次级方面和29个三级任务。同时训练领域特定模型。 Result: 实验结果表明VLADBench能更全面地评估模型能力，揭示了现有模型的优势和局限。 Conclusion: VLADBench为自动驾驶领域视觉语言模型的评估和发展提供了重要基础。 Abstract: Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce $\textbf{VLADBench}$, a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate $\textbf{VLADBench}$ spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.

Uncertainty-aware Bayesian machine learning modelling of land cover classification

Samuel Bilson,Anna Pustogvar

Task: 提出一种基于贝叶斯分类框架的方法，用于考虑输入测量不确定性，以改进土地覆盖分类。

Motivation: 当前机器学习分类模型未考虑输入测量不确定性，而这对计量学的可追溯性至关重要。

Details

Method: 采用生成建模的贝叶斯分类框架，具体应用贝叶斯二次判别分析，并基于Copernicus Sentinel-2数据进行实验。 Result: 贝叶斯模型更具可信度，能显式建模输入不确定性，保持预测性能，且计算高效。 Conclusion: 贝叶斯分类框架在土地覆盖分类中具有优势，尤其在处理不确定性和跨数据集性能方面。 Abstract: Land cover classification involves the production of land cover maps, which determine the type of land through remote sensing imagery. Over recent years, such classification is being performed by machine learning classification models, which can give highly accurate predictions on land cover per pixel using large quantities of input training data. However, such models do not currently take account of input measurement uncertainty, which is vital for traceability in metrology. In this work we propose a Bayesian classification framework using generative modelling to take account of input measurement uncertainty. We take the specific case of Bayesian quadratic discriminant analysis, and apply it to land cover datasets from Copernicus Sentinel-2 in 2020 and 2021. We benchmark the performance of the model against more popular classification models used in land cover maps such as random forests and neural networks. We find that such Bayesian models are more trustworthy, in the sense that they are more interpretable, explicitly model the input measurement uncertainty, and maintain predictive performance of class probability outputs across datasets of different years and sizes, whilst also being computationally efficient.

SyncSDE: A Probabilistic Framework for Diffusion Synchronization

Hyunjun Lee,Hyunsoo Lee,Sookwan Han

Task: 提出一个概率框架，分析扩散同步的工作原理并揭示启发式方法应关注的重点。

Motivation: 现有方法依赖简单的启发式方法（如平均），未考虑任务特异性，导致盲目应用时效果不佳。

Details

Method: 通过建模多个轨迹之间的相关性并适应特定任务，提出概率框架。 Result: 识别出每个任务的最优相关性模型，效果优于之前的方法。 Conclusion: 任务特定的相关性建模是扩散同步成功的关键。 Abstract: There have been many attempts to leverage multiple diffusion models for collaborative generation, extending beyond the original domain. A prominent approach involves synchronizing multiple diffusion trajectories by mixing the estimated scores to artificially correlate the generation processes. However, existing methods rely on naive heuristics, such as averaging, without considering task specificity. These approaches do not clarify why such methods work and often fail when a heuristic suitable for one task is blindly applied to others. In this paper, we present a probabilistic framework for analyzing why diffusion synchronization works and reveal where heuristics should be focused - modeling correlations between multiple trajectories and adapting them to each specific task. We further identify optimal correlation models per task, achieving better results than previous approaches that apply a single heuristic across all tasks without justification.

When Astronomy Meets AI: Manazel For Crescent Visibility Prediction in Morocco

Yassir Lairgi

Task: 通过整合Arc of Vision (ARCV)和月牙总宽度(W)特征，利用逻辑回归算法提高伊斯兰历月份开始的预测准确性。

Motivation: 准确确定伊斯兰历每个月的开始对宗教、文化和行政事务至关重要，尤其是在摩洛哥。

Details

Method: 利用13年的月牙可见性数据，结合ARCV和W特征，采用逻辑回归算法进行分类预测。 Result: 预测准确率达到98.83%，为伊斯兰历月份开始提供了可靠的数据驱动框架。 Conclusion: 机器学习在天文应用中表现优异，未来可进一步优化月牙可见性模型。 Abstract: The accurate determination of the beginning of each Hijri month is essential for religious, cultural, and administrative purposes. Manazel (The code and datasets are available at https://github.com/lairgiyassir/manazel) addresses this challenge in Morocco by leveraging 13 years of crescent visibility data to refine the ODEH criterion, a widely used standard for lunar crescent visibility prediction. The study integrates two key features, the Arc of Vision (ARCV) and the total width of the crescent (W), to enhance the accuracy of lunar visibility assessments. A machine learning approach utilizing the Logistic Regression algorithm is employed to classify crescent visibility conditions, achieving a predictive accuracy of 98.83%. This data-driven methodology offers a robust and reliable framework for determining the start of the Hijri month, comparing different data classification tools, and improving the consistency of lunar calendar calculations in Morocco. The findings demonstrate the effectiveness of machine learning in astronomical applications and highlight the potential for further enhancements in the modeling of crescent visibility.

Cognitive Science-Inspired Evaluation of Core Capabilities for Object Understanding in AI

Danaja Rutar,Alva Markelius,Konstantinos Voudouris,José Hernández-Orallo,Lucy Cheke

Task: 综述物体性研究的主要理论框架，并评估当前AI范式在物体性能力上的表现。

Motivation: 物体性是世界模型的核心组成部分，对理解对象、空间和因果关系至关重要，但缺乏统一的理论框架。

Details

Method: 首先综述Gestalt心理学、能动认知和发展心理学的主要理论框架，然后评估当前AI范式在物体性能力上的表现。 Result: 发现当前AI基准测试能检测孤立物体性能力，但无法检测功能整合的缺失。 Conclusion: 提出新的评估方法，以推动AI从孤立能力向真实世界中的综合物体理解发展。 Abstract: One of the core components of our world models is 'intuitive physics' - an understanding of objects, space, and causality. This capability enables us to predict events, plan action and navigate environments, all of which rely on a composite sense of objecthood. Despite its importance, there is no single, unified account of objecthood, though multiple theoretical frameworks provide insights. In the first part of this paper, we present a comprehensive overview of the main theoretical frameworks in objecthood research - Gestalt psychology, enactive cognition, and developmental psychology - and identify the core capabilities each framework attributes to object understanding, as well as what functional roles they play in shaping world models in biological agents. Given the foundational role of objecthood in world modelling, understanding objecthood is also essential in AI. In the second part of the paper, we evaluate how current AI paradigms approach and test objecthood capabilities compared to those in cognitive science. We define an AI paradigm as a combination of how objecthood is conceptualised, the methods used for studying objecthood, the data utilised, and the evaluation techniques. We find that, whilst benchmarks can detect that AI systems model isolated aspects of objecthood, the benchmarks cannot detect when AI systems lack functional integration across these capabilities, not solving the objecthood challenge fully. Finally, we explore novel evaluation approaches that align with the integrated vision of objecthood outlined in this paper. These methods are promising candidates for advancing from isolated object capabilities toward general-purpose AI with genuine object understanding in real-world contexts.

Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

Zhiyuan Ma,Xinyue Liang,Rongyuan Wu,Xiangyu Zhu,Zhen Lei,Lei Zhang

Task: 提出一种名为渐进渲染蒸馏（PRD）的训练方案，用于从文本提示生成高质量3D网格。

Motivation: 解决现有方法因缺乏高质量3D训练数据而导致生成质量不佳的问题。

Details

Method: 通过渐进渲染蒸馏（PRD）技术，利用多视角扩散模型和Stable Diffusion（SD）进行3D生成，无需3D真实数据。 Result: 训练出的TriplaneTurbo模型在1.2秒内生成高质量3D网格，并在效率和生成质量上优于现有方法。 Conclusion: PRD方案有效解决了数据短缺问题，显著提升了文本到3D生成的效率和质量。 Abstract: It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only $2.5\%$ trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at https://github.com/theEricMa/TriplaneTurbo.

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Wenqi Zhang,Mengna Wang,Gangao Liu,Xu Huixin,Yiwei Jiang,Yongliang Shen,Guiyang Hou,Zhe Zheng,Hang Zhang,Xin Li,Weiming Lu,Peng Li,Yueting Zhuang

Task: 将深度思维模型扩展到需要连续交互的具身领域任务中。

Motivation: 当前深度思维模型在数学和编程任务上表现优异，但在需要与具身环境持续交互的任务中效果尚未充分探索。

Details

Method: 提出Embodied Reasoner模型，通过合成9.3k条连贯的观察-思考-行动轨迹，并采用三阶段训练流程（模仿学习、自我探索、自我修正）。 Result: 模型在具身搜索任务中显著优于其他先进视觉推理模型（如OpenAI o1、o3-mini和Claude-3.7），且在复杂长时任务中表现更优。 Conclusion: Embodied Reasoner通过结合空间理解、时间推理和持续自我反思，成功提升了具身任务中的推理能力。 Abstract: Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.

MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX

Liuyue Xie,George Z. Wei,Avik Kuthiala,Ce Zheng,Ananya Bal,Mosam Dabhi,Liting Wen,Taru Rustagi,Ethan Lai,Sushil Khyalia,Rohan Choudhury,Morteza Ziyadi,Xu Zhang,Hao Yang,László A. Jeni

Task: 提出一个名为MAVERIX的基准测试，用于评估多模态模型在视频和音频信息整合任务中的表现。

Motivation: 当前缺乏标准化的评估框架来全面评估多模态模型的跨模态感知能力。

Details

Method: MAVERIX包含700个视频和2,556个问题，设计用于测试模型对视频和音频信息的整合能力。 Result: 实验显示，最先进的模型（如Gemini 1.5 Pro和o1）表现接近人类水平（约70%准确率），而人类专家达到接近天花板的表现（95.1%）。 Conclusion: MAVERIX通过标准化评估协议、严格注释流程和公开工具包，为推进视听多模态智能提供了一个具有挑战性的测试平台。 Abstract: Frontier models have either been language-only or have primarily focused on vision and language modalities. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework for thoroughly assessing their cross-modality perception performance. We introduce MAVERIX~(Multimodal Audio-Visual Evaluation Reasoning IndeX), a novel benchmark with 700 videos and 2,556 questions explicitly designed to evaluate multimodal models through tasks that necessitate close integration of video and audio information. MAVERIX uniquely provides models with audiovisual tasks, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration. Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, show performance approaching human levels (around 70% accuracy), while human experts reach near-ceiling performance (95.1%). With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.