2025 03 29

ECLAIR: Enhanced Clarification for Interactive Responses in an Enterprise AI Assistant

John Murzaku,Zifan Liu,Vaishnavi Muppala,Md Mehrab Tanjim,Xiang Chen,Yunyao Li

Task: 提出ECLAIR框架，用于通过多智能体交互解决大型语言模型在现实企业级交互中的歧义问题。

Motivation: 大型语言模型在理解和生成自然语言方面取得显著进展，但在现实企业级交互中常因上下文和领域知识不足而难以解决歧义。

Details

Method: ECLAIR框架通过定义定制智能体、进行歧义推理、生成澄清问题并利用用户反馈优化最终响应，实现交互式消歧。 Result: 在真实客户数据测试中，ECLAIR在澄清问题生成方面显著优于标准少样本方法。 Conclusion: ECLAIR框架通过多智能体交互有效提升了大型语言模型在复杂场景中的歧义解决能力。 Abstract: Large language models (LLMs) have shown remarkable progress in understanding and generating natural language across various applications. However, they often struggle with resolving ambiguities in real-world, enterprise-level interactions, where context and domain-specific knowledge play a crucial role. In this demonstration, we introduce ECLAIR (Enhanced CLArification for Interactive Responses), a multi-agent framework for interactive disambiguation. ECLAIR enhances ambiguous user query clarification through an interactive process where custom agents are defined, ambiguity reasoning is conducted by the agents, clarification questions are generated, and user feedback is leveraged to refine the final response. When tested on real-world customer data, ECLAIR demonstrates significant improvements in clarification question generation compared to standard few-shot methods.

Can Zero-Shot Commercial APIs Deliver Regulatory-Grade Clinical Text DeIdentification?

Veysel Kocaman,Muhammed Santas,Yigit Gul,Mehmet Butgul,David Talby

Task: 评估三种API去标识化系统（Azure Health Data Services、AWS Comprehend Medical和OpenAI GPT-4o）与自研系统Healthcare NLP在临床文档去标识化任务中的性能。

Motivation: 验证商业API在准确性、适应性和成本效益方面是否满足临床去标识化的监管要求，并展示自研系统的优势。

Details

Method: 在48份由医学专家标注的临床文档上，对四种系统进行实体级和令牌级的性能分析。 Result: Healthcare NLP以96%的F1分数显著优于Azure（91%）、AWS（83%）和GPT-4o（79%），且成本降低80%以上。 Conclusion: 商业API无法满足临床去标识化的需求，Healthcare NLP因其高性能、定制化和经济性成为更优解决方案。 Abstract: We systematically assess the performance of three leading API-based de-identification systems - Azure Health Data Services, AWS Comprehend Medical, and OpenAI GPT-4o - against our de-identification systems on a ground truth dataset of 48 clinical documents annotated by medical experts. Our analysis, conducted at both entity-level and token-level, demonstrates that our solution, Healthcare NLP, achieves the highest accuracy, with a 96% F1-score in protected health information (PHI) detection, significantly outperforming Azure (91%), AWS (83%), and GPT-4o (79%). Beyond accuracy, Healthcare NLP is also the most cost-effective solution, reducing processing costs by over 80% compared to Azure and GPT-4o. Its fixed-cost local deployment model avoids the escalating per-request fees of cloud-based services, making it a scalable and economical choice. Our results underscore a critical limitation: zero-shot commercial APIs fail to meet the accuracy, adaptability, and cost-efficiency required for regulatory-grade clinical de-identification. Healthcare NLP's superior performance, customization capabilities, and economic advantages position it as the more viable solution for healthcare organizations seeking compliance and scalability in clinical NLP workflows.

"Whose Side Are You On?" Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection

Muhammad Haroon,Magdalena Wojcieszak,Anshuman Chhabra

Task: 利用大型语言模型（LLMs）通过上下文学习（ICL）对美国两党政治光谱中的在线内容进行政治意识形态分类。

Motivation: 现有意识形态分类方法需要大量人工标注且难以适应动态变化的意识形态环境。

Details

Method: 采用上下文学习（ICL）和标签平衡的演示选择方法，在新闻文章和YouTube视频数据集上进行实验。 Result: 该方法显著优于零样本和传统监督学习方法，并评估了元数据对分类的影响。 Conclusion: LLMs在政治意识形态分类中具有潜力，且内容来源对分类结果有显著影响。 Abstract: The rapid growth of social media platforms has led to concerns about radicalization, filter bubbles, and content bias. Existing approaches to classifying ideology are limited in that they require extensive human effort, the labeling of large datasets, and are not able to adapt to evolving ideological contexts. This paper explores the potential of Large Language Models (LLMs) for classifying the political ideology of online content in the context of the two-party US political spectrum through in-context learning (ICL). Our extensive experiments involving demonstration selection in label-balanced fashion, conducted on three datasets comprising news articles and YouTube videos, reveal that our approach significantly outperforms zero-shot and traditional supervised methods. Additionally, we evaluate the influence of metadata (e.g., content source and descriptions) on ideological classification and discuss its implications. Finally, we show how providing the source for political and non-political content influences the LLM's classification.

SE-GNN: Seed Expanded-Aware Graph Neural Network with Iterative Optimization for Semi-supervised Entity Alignment

Tao Meng,Shuo Shan,Hongen Shao,Yuntao Shou,Wei Ai,Keqin Li

Task: 提出一种名为SE-GNN的种子扩展感知图神经网络，用于半监督实体对齐。

Motivation: 解决知识图谱（KGs）规模增大时手动标注预对齐种子对的困难，以及现有方法因结构异质性和噪声种子对嵌入失真的影响。

Details

Method: 结合语义属性和结构特征获取高质量初始潜在种子对，设计局部和全局感知机制优化嵌入表示，并采用阈值最近邻嵌入校正策略消除嵌入失真。 Result: SE-GNN能够缓解KGs结构异质性影响，并通过迭代优化提高潜在种子对的质量。 Conclusion: SE-GNN有效解决了现有方法在实体对齐中的局限性，提升了半监督实体对齐的效果。 Abstract: Entity alignment aims to use pre-aligned seed pairs to find other equivalent entities from different knowledge graphs (KGs) and is widely used in graph fusion-related fields. However, as the scale of KGs increases, manually annotating pre-aligned seed pairs becomes difficult. Existing research utilizes entity embeddings obtained by aggregating single structural information to identify potential seed pairs, thus reducing the reliance on pre-aligned seed pairs. However, due to the structural heterogeneity of KGs, the quality of potential seed pairs obtained using only a single structural information is not ideal. In addition, although existing research improves the quality of potential seed pairs through semi-supervised iteration, they underestimate the impact of embedding distortion produced by noisy seed pairs on the alignment effect. In order to solve the above problems, we propose a seed expanded-aware graph neural network with iterative optimization for semi-supervised entity alignment, named SE-GNN. First, we utilize the semantic attributes and structural features of entities, combined with a conditional filtering mechanism, to obtain high-quality initial potential seed pairs. Next, we designed a local and global awareness mechanism. It introduces initial potential seed pairs and combines local and global information to obtain a more comprehensive entity embedding representation, which alleviates the impact of KGs structural heterogeneity and lays the foundation for the optimization of initial potential seed pairs. Then, we designed the threshold nearest neighbor embedding correction strategy. It combines the similarity threshold and the bidirectional nearest neighbor method as a filtering mechanism to select iterative potential seed pairs and also uses an embedding correction strategy to eliminate the embedding distortion.

Multimodal Image Matching based on Frequency-domain Information of Local Energy Response

Meng Yang,Jun Chen,Wenping Gong,Longsheng Wei,Xin Tian

Task: 解决多模态图像匹配中的非线性强度差异、局部几何畸变、噪声和旋转变换问题。

Motivation: 多模态图像匹配面临非线性强度差异、局部几何畸变、噪声和旋转变换等挑战，需要一种鲁棒且通用的方法。

Details

Method: 提出基于频率域局部能量响应（FILER）的方法，包括局部能量响应模型、边缘结构增强特征检测器和卷积特征加权描述符。 Result: FILER在多模态图像对实验中表现优于其他先进算法，具有鲁棒性和通用性。 Conclusion: FILER能有效解决多模态图像匹配中的主要挑战，并在实验中验证了其优越性。 Abstract: Complicated nonlinear intensity differences, nonlinear local geometric distortions, noises and rotation transformation are main challenges in multimodal image matching. In order to solve these problems, we propose a method based on Frequency-domain Information of Local Energy Response called FILER. The core of FILER is the local energy response model based on frequency-domain information, which can overcome the effect of nonlinear intensity differences. To improve the robustness to local nonlinear geometric distortions and noises, we design a new edge structure enhanced feature detector and convolutional feature weighted descriptor, respectively. In addition, FILER overcomes the sensitivity of the frequency-domain information to the rotation angle and achieves rotation invariance. Extensive experiments multimodal image pairs show that FILER outperforms other state-of-the-art algorithms and has good robustness and universality.

Comprehensive Manuscript Assessment with Text Summarization Using 69707 articles

Qichen Sun,Yuxing Lu,Kun Xia,Li Chen,He Sun,Jinzhuo Wang

Task: 预测科学论文在发表前的未来影响力（期刊影响力和论文自身影响力）。

Motivation: 当前的研究多局限于特定学术领域或依赖早期引用数据，无法满足论文早期评估的需求。

Details

Method: 利用Scopus构建多学科数据集，结合深度学习方法和Transformer模型提取语义特征，设计文本融合层捕捉标题与摘要的共享信息。 Result: 实验证明所提模型在影响力预测任务上表现优越，并具备生成反馈和改进建议的潜力。 Conclusion: 该方法为论文早期影响力评估提供了有效工具，并展示了多学科应用的潜力。 Abstract: Rapid and efficient assessment of the future impact of research articles is a significant concern for both authors and reviewers. The most common standard for measuring the impact of academic papers is the number of citations. In recent years, numerous efforts have been undertaken to predict citation counts within various citation windows. However, most of these studies focus solely on a specific academic field or require early citation counts for prediction, rendering them impractical for the early-stage evaluation of papers. In this work, we harness Scopus to curate a significantly comprehensive and large-scale dataset of information from 69707 scientific articles sourced from 99 journals spanning multiple disciplines. We propose a deep learning methodology for the impact-based classification tasks, which leverages semantic features extracted from the manuscripts and paper metadata. To summarize the semantic features, such as titles and abstracts, we employ a Transformer-based language model to encode semantic features and design a text fusion layer to capture shared information between titles and abstracts. We specifically focus on the following impact-based prediction tasks using information of scientific manuscripts in pre-publication stage: (1) The impact of journals in which the manuscripts will be published. (2) The future impact of manuscripts themselves. Extensive experiments on our datasets demonstrate the superiority of our proposed model for impact-based prediction tasks. We also demonstrate potentials in generating manuscript's feedback and improvement suggestions.

MedSegNet10: A Publicly Accessible Network Repository for Split Federated Medical Image Segmentation

Chamani Shiranthika,Zahra Hafezi Kafshgari,Hadi Hadizadeh,Parvaneh Saeedi

Task: 介绍一个名为MedSegNet10的公开存储库，用于基于分割联邦学习的医学图像分割。

Motivation: 解决医学图像分割中的数据隐私问题、标注数据有限和训练数据不足的挑战。

Details

Method: 利用分割联邦学习（SplitFed/SFL）技术，提供预训练的神经网络架构，支持多种医学图像类型。 Result: MedSegNet10存储库支持研究人员和从业者在保护数据隐私的前提下进行协作训练。 Conclusion: MedSegNet10通过SplitFed技术推动了医学图像分割的发展，同时确保了患者数据的隐私和完整性。 Abstract: Machine Learning (ML) and Deep Learning (DL) have shown significant promise in healthcare, particularly in medical image segmentation, which is crucial for accurate disease diagnosis and treatment planning. Despite their potential, challenges such as data privacy concerns, limited annotated data, and inadequate training data persist. Decentralized learning approaches such as federated learning (FL), split learning (SL), and split federated learning (SplitFed/SFL) address these issues effectively. This paper introduces "MedSegNet10," a publicly accessible repository designed for medical image segmentation using split-federated learning. MedSegNet10 provides a collection of pre-trained neural network architectures optimized for various medical image types, including microscopic images of human blastocysts, dermatoscopic images of skin lesions, and endoscopic images of lesions, polyps, and ulcers, with applications extending beyond these examples. By leveraging SplitFed's benefits, MedSegNet10 allows collaborative training on privately stored, horizontally split data, ensuring privacy and integrity. This repository supports researchers, practitioners, trainees, and data scientists, aiming to advance medical image segmentation while maintaining patient data privacy. The repository is available at: https://vault.sfu.ca/index.php/s/ryhf6t12O0sobuX (password upon request to the authors).

Named Entity Recognition in Context

Colin Brisson,Ayoub Kahfy,Marc Bui,Frédéric Constant

Task: 开发一个用于EvaHan2025竞赛的命名实体识别系统。

Motivation: 通过整合现代Transformer架构、检索模块和生成推理步骤，提高古典中文文本中的实体识别性能。

Details

Method: 结合Pindola（基于Transformer的双向编码器）、检索模块和生成推理步骤。 Result: 平均F1得分为85.58，比竞赛基线提高了近5分。 Conclusion: 该方法在古典中文命名实体识别任务中表现优异，显著优于基线模型。 Abstract: We present the Named Entity Recognition system developed by the Edit Dunhuang team for the EvaHan2025 competition. Our approach integrates three core components: (1) Pindola, a modern transformer-based bidirectional encoder pretrained on a large corpus of Classical Chinese texts; (2) a retrieval module that fetches relevant external context for each target sequence; and (3) a generative reasoning step that summarizes retrieved context in Classical Chinese for more robust entity disambiguation. Using this approach, we achieve an average F1 score of 85.58, improving upon the competition baseline by nearly 5 points.

Unified Multimodal Discrete Diffusion

Alexander Swerdlow,Mihir Prabhudesai,Siddharth Gandhi,Deepak Pathak,Katerina Fragkiadaki

Task: 探索离散扩散模型作为联合文本和图像领域的统一生成框架。

Motivation: 自回归模型在多模态生成任务中占主导地位，但离散扩散模型在质量与多样性控制、联合多模态修复和生成可控性方面具有优势。

Details

Method: 提出统一多模态离散扩散模型（UniDisc），用于联合理解和生成文本与图像。 Result: UniDisc在性能、推理计算、可控性、编辑性和修复能力等方面优于多模态自回归模型。 Conclusion: 离散扩散模型在多模态生成任务中具有显著优势，UniDisc为联合文本和图像生成提供了高效且可控的解决方案。 Abstract: Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches, which process tokens sequentially from left to right, or top to bottom. These models jointly handle images, text, video, and audio for various tasks such as image captioning, question answering, and image generation. In this work, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain, building upon their recent success in text generation. Discrete diffusion models offer several advantages over AR models, including improved control over quality versus diversity of generated samples, the ability to perform joint multimodal inpainting (across both text and image domains), and greater controllability in generation through guidance. Leveraging these benefits, we present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images for a variety of downstream tasks. We compare UniDisc to multimodal AR models, performing a scaling analysis and demonstrating that UniDisc outperforms them in terms of both performance and inference-time compute, enhanced controllability, editability, inpainting, and flexible trade-off between inference time and generation quality. Code and additional visualizations are available at https://unidisc.github.io.

Both Direct and Indirect Evidence Contribute to Dative Alternation Preferences in Language Models

Qing Yao,Kanishka Misra,Leonie Weissweiler,Kyle Mahowald

Task: 探究语言模型在英语双宾交替结构中的偏好来源，区分直接证据与间接证据的影响。

Motivation: 研究语言模型对句法现象的偏好是否源于直接接触该现象还是更普遍的语言特性。

Details

Method: 通过控制输入数据训练小型语言模型，分析长度和生命度对双宾交替选择的影响，并扰动数据集以操纵全局长度效应。 Result: 直接证据对偏好有影响，但即使缺乏直接证据，偏好仍存在；间接证据也能导致偏好出现。 Conclusion: 语言模型的句法偏好源于直接和间接证据的混合。 Abstract: Language models (LMs) tend to show human-like preferences on a number of syntactic phenomena, but the extent to which these are attributable to direct exposure to the phenomena or more general properties of language is unclear. We explore this with the English dative alternation (DO: "gave Y the X" vs. PO: "gave the X to Y"), using a controlled rearing paradigm wherein we iteratively train small LMs on systematically manipulated input. We focus on properties that affect the choice of alternant: length and animacy. Both properties are directly present in datives but also reflect more global tendencies for shorter elements to precede longer ones and animates to precede inanimates. First, by manipulating and ablating datives for these biases in the input, we show that direct evidence of length and animacy matters, but easy-first preferences persist even without such evidence. Then, using LMs trained on systematically perturbed datasets to manipulate global length effects (re-linearizing sentences globally while preserving dependency structure), we find that dative preferences can emerge from indirect evidence. We conclude that LMs' emergent syntactic preferences come from a mix of direct and indirect sources.

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

Silin Gao,Sheryl Mathew,Li Mi,Sepideh Mamooler,Mengjie Zhao,Hiromi Wakaki,Yuki Mitsufuji,Syrielle Montariol,Antoine Bosselut

Task: 提出一个新的基准VinaBench，用于解决视觉叙事生成中忠实性和一致性的挑战。

Motivation: 由于缺乏用于规划故事的知识约束，生成忠实于输入文本且图像间自一致的视觉叙事仍是一个开放性问题。

Details

Method: 通过标注视觉叙事样本中的常识和话语约束，提供系统化的学习支架，并提出新的评估指标。 Result: 在三种生成视觉模型上的实验表明，利用VinaBench的知识约束能有效提升生成视觉叙事的忠实性和连贯性。 Conclusion: VinaBench为视觉叙事生成提供了有效的知识约束和评估方法，显著提升了生成质量。 Abstract: Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.

GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations

Yupei Li,Qiyang Sun,Sunil Munthumoduku Krishna Murthy,Emran Alturki,Björn W. Schuller

Task: 提出一种名为GatedxLSTM的新型多模态情感识别模型，用于对话中的动态情感识别。

Motivation: 现有方法在多模态情感识别中未能充分捕捉对话中情感的动态变化，且缺乏对情感演变的解释。

Details

Method: 结合语音和文本模态，使用CLAP进行跨模态对齐，并通过门控机制强调情感关键语句，同时引入DED建模上下文依赖。 Result: 在IEMOCAP数据集上，GatedxLSTM在四类情感分类中达到开源方法的最优性能。 Conclusion: GatedxLSTM不仅提升了性能，还提供了情感演变的心理学解释，适用于对话情感识别应用。 Abstract: Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual's expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverages multiple signals but traditionally relies on utterance-level analysis, overlooking the dynamic nature of emotions in conversations. Emotion Recognition in Conversation (ERC) addresses this limitation, yet existing methods struggle to align multimodal features and explain why emotions evolve within dialogues. To bridge this gap, we propose GatedxLSTM, a novel speech-text multimodal ERC model that explicitly considers voice and transcripts of both the speaker and their conversational partner(s) to identify the most influential sentences driving emotional shifts. By integrating Contrastive Language-Audio Pretraining (CLAP) for improved cross-modal alignment and employing a gating mechanism to emphasise emotionally impactful utterances, GatedxLSTM enhances both interpretability and performance. Additionally, the Dialogical Emotion Decoder (DED) refines emotion predictions by modelling contextual dependencies. Experiments on the IEMOCAP dataset demonstrate that GatedxLSTM achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification. These results validate its effectiveness for ERC applications and provide an interpretability analysis from a psychological perspective.

BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology

Amaya Gallagher-Syed,Henry Senior,Omnia Alwazzan,Elena Pontarini,Michele Bombardieri,Costantino Pitzalis,Myles J. Lewis,Michael R. Barnes,Luca Rossi,Gregory Slabaugh

Task: 开发一种可解释的图神经网络架构BioX-CPath，用于多染色免疫组化（IHC）分析的全切片图像（WSI）分类。

Motivation: 解决计算病理学中生物可解释和可解释模型开发的关键挑战，特别是在多染色IHC分析中。

Details

Method: BioX-CPath采用新颖的染色感知注意力池化（SAAP）模块，结合空间和语义特征生成生物意义的患者嵌入。 Result: 在类风湿性关节炎和干燥综合征多染色数据集上实现最先进性能，并提供染色注意力评分、熵度量和染色交互评分等可解释性分析。 Conclusion: BioX-CPath结合高性能和生物可解释性，特别适用于临床应用中需要模型解释性的场景。 Abstract: The development of biologically interpretable and explainable models remains a key challenge in computational pathology, particularly for multistain immunohistochemistry (IHC) analysis. We present BioX-CPath, an explainable graph neural network architecture for whole slide image (WSI) classification that leverages both spatial and semantic features across multiple stains. At its core, BioX-CPath introduces a novel Stain-Aware Attention Pooling (SAAP) module that generates biologically meaningful, stain-aware patient embeddings. Our approach achieves state-of-the-art performance on both Rheumatoid Arthritis and Sjogren's Disease multistain datasets. Beyond performance metrics, BioX-CPath provides interpretable insights through stain attention scores, entropy measures, and stain interaction scores, that permit measuring model alignment with known pathological mechanisms. This biological grounding, combined with strong classification performance, makes BioX-CPath particularly suitable for clinical applications where interpretability is key. Source code and documentation can be found at: https://github.com/AmayaGS/BioX-CPath.

Hacia la interpretabilidad de la detección anticipada de riesgos de depresión utilizando grandes modelos de lenguaje

Horacio Thompson,Maximiliano Sapino,Edgardo Ferretti,Marcelo Errecalde

Task: 利用大型语言模型（LLMs）解决西班牙语文本中与抑郁症相关的早期风险检测（EDR）问题。

Motivation: 尽管LLMs在多种语言任务中表现优异，但其在特定领域（如抑郁症风险检测）中的推理能力仍需评估。

Details

Method: 通过定义推理标准分析用户，应用上下文学习于Gemini模型，并进行定量和定性评估。 Result: 结果表明，该方法能获得准确的预测，并提供可解释的推理，从而更深入地理解解决方案。 Conclusion: 该方法为利用LLMs解决EDR问题提供了新视角。 Abstract: Early Detection of Risks (EDR) on the Web involves identifying at-risk users as early as possible. Although Large Language Models (LLMs) have proven to solve various linguistic tasks efficiently, assessing their reasoning ability in specific domains is crucial. In this work, we propose a method for solving depression-related EDR using LLMs on Spanish texts, with responses that can be interpreted by humans. We define a reasoning criterion to analyze users through a specialist, apply in-context learning to the Gemini model, and evaluate its performance both quantitatively and qualitatively. The results show that accurate predictions can be obtained, supported by explanatory reasoning, providing a deeper understanding of the solution. Our approach offers new perspectives for addressing EDR problems by leveraging the power of LLMs.

Feature Modulation for Semi-Supervised Domain Generalization without Domain Labels

Venuri Amarasinghe,Asini Jayakody,Isun Randila,Kalinga Bandara,Chamuditha Jayanga Galappaththige,Ranga Rodrigo

Task: 提出一种无需域标签的半监督域泛化方法，通过特征调制和动态损失缩放提升模型性能。

Motivation: 现有半监督域泛化方法依赖伪标签和域标签，但域噪声导致伪标签不一致，影响模型性能。

Details

Method: 采用特征调制策略增强类判别特征并抑制域信息，引入动态损失缩放函数优化伪标签使用。 Result: 在四个主要域泛化基准上取得显著改进。 Conclusion: 无需域标签的方法通过特征调制和动态损失缩放有效提升了半监督域泛化性能。 Abstract: Semi-supervised domain generalization (SSDG) leverages a small fraction of labeled data alongside unlabeled data to enhance model generalization. Most of the existing SSDG methods rely on pseudo-labeling (PL) for unlabeled data, often assuming access to domain labels-a privilege not always available. However, domain shifts introduce domain noise, leading to inconsistent PLs that degrade model performance. Methods derived from FixMatch suffer particularly from lower PL accuracy, reducing the effectiveness of unlabeled data. To address this, we tackle the more challenging domain-label agnostic SSDG, where domain labels for unlabeled data are not available during training. First, we propose a feature modulation strategy that enhances class-discriminative features while suppressing domain-specific information. This modulation shifts features toward Similar Average Representations-a modified version of class prototypes-that are robust across domains, encouraging the classifier to distinguish between closely related classes and feature extractor to form tightly clustered, domain-invariant representations. Second, to mitigate domain noise and improve pseudo-label accuracy, we introduce a loss-scaling function that dynamically lowers the fixed confidence threshold for pseudo-labels, optimizing the use of unlabeled data. With these key innovations, our approach achieves significant improvements on four major domain generalization benchmarks-even without domain labels. We will make the code available.

Clean & Clear: Feasibility of Safe LLM Clinical Guidance

Julia Ive,Felix Jozsa,Nick Jackson,Paulina Bondaronek,Ciaran Scott Hill,Richard Dobson

Task: 开发并初步评估一款基于LLM的聊天机器人软件，用于可靠回答临床指南问题。

Motivation: 利用LLM在医疗问答任务中的潜力，提供快速准确的临床指南信息，提升医疗专业人员获取本地相关信息的效率。

Details

Method: 使用开源的Llama-3.1-8B LLM从UCLH临床指南中提取信息回答问题，强调信息引用的安全性和可靠性，并由七名医生评估其表现。 Result: 聊天机器人在相关性（73%非常相关）、召回率（0.98）、完整性（78%满意）和效率（平均10秒）方面表现良好，但存在少量不必要信息（14.5%）。 Conclusion: 该聊天机器人具有显著潜力，可加速并改进医疗专业人员获取本地临床信息的流程。 Abstract: Background: Clinical guidelines are central to safe evidence-based medicine in modern healthcare, providing diagnostic criteria, treatment options and monitoring advice for a wide range of illnesses. LLM-empowered chatbots have shown great promise in Healthcare Q&A tasks, offering the potential to provide quick and accurate responses to medical inquiries. Our main objective was the development and preliminary assessment of an LLM-empowered chatbot software capable of reliably answering clinical guideline questions using University College London Hospital (UCLH) clinical guidelines. Methods: We used the open-weight Llama-3.1-8B LLM to extract relevant information from the UCLH guidelines to answer questions. Our approach highlights the safety and reliability of referencing information over its interpretation and response generation. Seven doctors from the ward assessed the chatbot's performance by comparing its answers to the gold standard. Results: Our chatbot demonstrates promising performance in terms of relevance, with ~73% of its responses rated as very relevant, showcasing a strong understanding of the clinical context. Importantly, our chatbot achieves a recall of 0.98 for extracted guideline lines, substantially minimising the risk of missing critical information. Approximately 78% of responses were rated satisfactory in terms of completeness. A small portion (~14.5%) contained minor unnecessary information, indicating occasional lapses in precision. The chatbot' showed high efficiency, with an average completion time of 10 seconds, compared to 30 seconds for human respondents. Evaluation of clinical reasoning showed that 72% of the chatbot's responses were without flaws. Our chatbot demonstrates significant potential to speed up and improve the process of accessing locally relevant clinical information for healthcare professionals.

Prototype Guided Backdoor Defense

Venkat Adithya Amula,Sunayana Samavedam,Saurabh Saini,Avani Gupta,Narayanan P J

Task: 提出一种名为Prototype Guided Backdoor Defense (PGBD)的后处理防御方法，以应对深度学习模型中的后门攻击。

Motivation: 深度学习模型容易受到后门攻击，尤其是语义触发器的攻击，而生成式AI的出现加剧了这一问题，因此需要一种能够应对多种触发器类型的鲁棒防御方法。

Details

Method: PGBD利用激活空间的几何位移来惩罚触发器引起的移动，并通过一种新颖的后处理微调步骤中的净化损失实现。 Result: PGBD在所有设置中表现更优，并首次成功防御了一种针对名人面部图像的新型语义攻击。 Conclusion: PGBD是一种鲁棒且可扩展的后门防御方法，能够有效应对包括语义触发器在内的多种攻击类型。 Abstract: Deep learning models are susceptible to {\em backdoor attacks} involving malicious attackers perturbing a small subset of training data with a {\em trigger} to causes misclassifications. Various triggers have been used, including semantic triggers that are easily realizable without requiring the attacker to manipulate the image. The emergence of generative AI has eased the generation of varied poisoned samples. Robustness across types of triggers is crucial to effective defense. We propose Prototype Guided Backdoor Defense (PGBD), a robust post-hoc defense that scales across different trigger types, including previously unsolved semantic triggers. PGBD exploits displacements in the geometric spaces of activations to penalize movements toward the trigger. This is done using a novel sanitization loss of a post-hoc fine-tuning step. The geometric approach scales easily to all types of attacks. PGBD achieves better performance across all settings. We also present the first defense against a new semantic attack on celebrity face images. Project page: \hyperlink{https://venkatadithya9.github.io/pgbd.github.io/}{this https URL}.

Sociotechnical Effects of Machine Translation

Joss Moorkens,Andy Way,Séamus Lankford

Task: 探讨机器翻译（MT）的副作用、风险及其缓解方法。

Motivation: 随着神经机器翻译和大型语言模型（LLMs）的应用，其对气候变化的影响及对翻译者和用户的潜在负面影响引起了关注。

Details

Method: 分析小型模型和预训练模型的优势，讨论数据版权、所有权及伦理问题，并提出危机场景中MT的应用方法。 Result: 小型模型和预训练模型能显著降低碳足迹，MT在危机场景中可挽救生命。 Conclusion: 通过合理使用MT和关注伦理问题，可以减轻其负面影响并发挥其积极作用。 Abstract: While the previous chapters have shown how machine translation (MT) can be useful, in this chapter we discuss some of the side-effects and risks that are associated, and how they might be mitigated. With the move to neural MT and approaches using Large Language Models (LLMs), there is an associated impact on climate change, as the models built by multinational corporations are massive. They are hugely expensive to train, consume large amounts of electricity, and output huge volumes of kgCO2 to boot. However, smaller models which still perform to a high level of quality can be built with much lower carbon footprints, and tuning pre-trained models saves on the requirement to train from scratch. We also discuss the possible detrimental effects of MT on translators and other users. The topics of copyright and ownership of data are discussed, as well as ethical considerations on data and MT use. Finally, we show how if done properly, using MT in crisis scenarios can save lives, and we provide a method of how this might be done.

LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular Videos

Daniel Etaat,Dvij Kalaria,Nima Rahmanian,Shankar Sastry

Task: 设计一个能够预测对手意图的乒乓球代理系统。

Motivation: 在快速动态的乒乓球比赛中，冠军选手通过预测对手意图获得反应时间，现有系统缺乏对预测能力的充分利用。

Details

Method: 提出一个可扩展的单目视频3D重建系统和一个不确定性感知的控制器来预测对手动作。 Result: 在模拟中，该策略将高速击球的回球率从49.9%提升至59.0%。 Conclusion: 通过预测对手动作，系统显著提升了乒乓球代理的性能。 Abstract: Physical agility is a necessary skill in competitive table tennis, but by no means sufficient. Champions excel in this fast-paced and highly dynamic environment by anticipating their opponent's intent - buying themselves the necessary time to react. In this work, we take one step towards designing such an anticipatory agent. Previous works have developed systems capable of real-time table tennis gameplay, though they often do not leverage anticipation. Among the works that forecast opponent actions, their approaches are limited by dataset size and variety. Our paper contributes (1) a scalable system for reconstructing monocular video of table tennis matches in 3D and (2) an uncertainty-aware controller that anticipates opponent actions. We demonstrate in simulation that our policy improves the ball return rate against high-speed hits from 49.9% to 59.0% as compared to a baseline non-anticipatory policy.

Arnav Arora,Srishti Yadav,Maria Antoniak,Serge Belongie,Isabelle Augenstein

Task: 开发一种多模态、多标签的框架分析方法，用于分析新闻中的文本和图像。

Motivation: 现有研究通常仅关注文本，忽略了图像中的编辑选择，导致信息不完整。

Details

Method: 利用大型（视觉）语言模型进行多模态框架分析，对比文本和图像中的潜在意义。 Result: 能够识别高度党派化的框架，并提供更全面的媒体偏见理解。 Conclusion: 该方法为新闻中的多模态框架分析提供了可扩展的解决方案。 Abstract: Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, this leaves out valuable information about editorial choices, which include not just the written article but also accompanying photographs. To overcome such limitations, we present a method for conducting multi-modal, multi-label framing analysis at scale using large (vision-)language models. Grounding our work in framing theory, we extract latent meaning embedded in images used to convey a certain point and contrast that to the text by comparing the respective frames used. We also identify highly partisan framing of topics with issue-specific frame analysis found in prior qualitative work. We demonstrate a method for doing scalable integrative framing analysis of both text and image in news, providing a more complete picture for understanding media bias.

Eyes Tell the Truth: GazeVal Highlights Shortcomings of Generative AI in Medical Imaging

David Wong,Bin Wang,Gorkem Durak,Marouane Tliba,Akshay Chaudhari,Aladine Chetouani,Ahmet Enis Cetin,Cagdas Topel,Nicolo Gennaro,Camila Lopes Vendrami,Tugce Agirlar Trabzonlu,Amir Ali Rahsepar,Laetitia Perronne,Matthew Antalek,Onural Ozturk,Gokcan Okur,Andrew C. Gordon,Ayis Pyrros,Frank H. Miller,Amir Borhani,Hatice Savas,Eric Hart,Drew Torigian,Jayaram K. Udupa,Elizabeth Krupinski,Ulas Bagci

Task: 提出一种结合专家眼动数据和放射学评估的框架GazeVal，用于评估合成医学图像的质量。

Motivation: 当前合成医学图像的评估主要依赖计算指标，无法反映临床真实性，影响AI医疗工具的可靠性。

Details

Method: GazeVal结合专家眼动数据和直接放射学评估，分析专家对合成图像的感知和交互方式。 Result: 实验显示，96.6%由最新AI算法生成的图像被放射学家识别为假，揭示了生成式AI在临床准确性上的局限。 Conclusion: GazeVal提供了一种更贴近临床实践的合成图像评估方法，凸显了生成式AI在医学领域仍需改进。 Abstract: The demand for high-quality synthetic data for model training and augmentation has never been greater in medical imaging. However, current evaluations predominantly rely on computational metrics that fail to align with human expert recognition. This leads to synthetic images that may appear realistic numerically but lack clinical authenticity, posing significant challenges in ensuring the reliability and effectiveness of AI-driven medical tools. To address this gap, we introduce GazeVal, a practical framework that synergizes expert eye-tracking data with direct radiological evaluations to assess the quality of synthetic medical images. GazeVal leverages gaze patterns of radiologists as they provide a deeper understanding of how experts perceive and interact with synthetic data in different tasks (i.e., diagnostic or Turing tests). Experiments with sixteen radiologists revealed that 96.6% of the generated images (by the most recent state-of-the-art AI algorithm) were identified as fake, demonstrating the limitations of generative AI in producing clinically accurate images.

ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

Yiqiao Jin,Stefano Petrangeli,Yu Shen,Gang Wu

Task: 开发一种高效的GUI代理训练方法，以解决监督信号稀疏、数据集规模大和用户理解需求复杂的问题。

Motivation: GUI代理在智能用户辅助和自动化中具有潜力，但其训练面临监督信号稀疏、数据集规模大和用户理解需求复杂等挑战。

Details

Method: 提出状态化屏幕模式（stateful screen schema）作为GUI交互的高效表示，并基于此开发了多模态大语言模型ScreenLLM，用于高级UI理解和动作预测。 Result: 实验表明，ScreenLLM能准确建模用户行为并预测动作。 Conclusion: 该研究为开发可扩展、鲁棒且智能的GUI代理奠定了基础，可提升用户在不同软件环境中的交互体验。 Abstract: Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI interactions that captures key user actions and intentions over time. Building on this foundation, we introduce ScreenLLM, a set of multimodal large language models (MLLMs) tailored for advanced UI understanding and action prediction. Extensive experiments on both open-source and proprietary models show that ScreenLLM accurately models user behavior and predicts actions. Our work lays the foundation for scalable, robust, and intelligent GUI agents that enhance user interaction in diverse software environments.

MVFNet: Multipurpose Video Forensics Network using Multiple Forms of Forensic Evidence

Tai D. Nguyen,Matthew C. Stamm

Task: 开发一种多用途视频取证网络（MVFNet），能够检测多种视频篡改类型（如修复、深度伪造、拼接和编辑）。

Motivation: 现有取证网络通常只能检测单一篡改类型，而实际中视频篡改方式多样且未知，需要一种通用解决方案。

Details

Method: 通过提取和分析多种取证特征模态（捕捉时空异常），并采用多尺度分层Transformer模块检测不同空间尺度的不一致性。 Result: 实验表明，MVFNet在多篡改场景中表现优异，达到最先进水平，并在特定场景中媲美专用检测器。 Conclusion: MVFNet是一种高效的多用途视频取证网络，能够应对未知篡改类型的挑战。 Abstract: While videos can be falsified in many different ways, most existing forensic networks are specialized to detect only a single manipulation type (e.g. deepfake, inpainting). This poses a significant issue as the manipulation used to falsify a video is not known a priori. To address this problem, we propose MVFNet - a multipurpose video forensics network capable of detecting multiple types of manipulations including inpainting, deepfakes, splicing, and editing. Our network does this by extracting and jointly analyzing a broad set of forensic feature modalities that capture both spatial and temporal anomalies in falsified videos. To reliably detect and localize fake content of all shapes and sizes, our network employs a novel Multi-Scale Hierarchical Transformer module to identify forensic inconsistencies across multiple spatial scales. Experimental results show that our network obtains state-of-the-art performance in general scenarios where multiple different manipulations are possible, and rivals specialized detectors in targeted scenarios.

Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction

Xiaoran Xu,Zhaoqian Xue,Chi Zhang,Jhonatan Medri,Junjie Xiong,Jiayan Zhou,Jin Jin,Yongfeng Zhang,Siyuan Ma,Lingyao Li

Task: 分析公众对紧急护理设施的体验，以促进社区医疗发展。

Motivation: 传统调查方法在范围、时间和空间覆盖上存在不足，而通过在线评论或社交媒体众包可以提供更全面的见解。

Details

Method: 收集Google Maps评论，利用GPT模型进行提示工程，分析紧急护理的基于方面的情感，并研究地理空间模式和社会经济特征。 Result: 人际因素和运营效率是患者满意度的主要决定因素，而技术质量、财务和设施在多元模型中无显著独立影响；人口密度与患者评分有微弱关联。 Conclusion: 众包方法能揭示影响公众满意度的关键因素，为利益相关者提供改进紧急护理的宝贵见解。 Abstract: Investigating the public experience of urgent care facilities is essential for promoting community healthcare development. Traditional survey methods often fall short due to limited scope, time, and spatial coverage. Crowdsourcing through online reviews or social media offers a valuable approach to gaining such insights. With recent advancements in large language models (LLMs), extracting nuanced perceptions from reviews has become feasible. This study collects Google Maps reviews across the DMV and Florida areas and conducts prompt engineering with the GPT model to analyze the aspect-based sentiment of urgent care. We first analyze the geospatial patterns of various aspects, including interpersonal factors, operational efficiency, technical quality, finances, and facilities. Next, we determine Census Block Group(CBG)-level characteristics underpinning differences in public perception, including population density, median income, GINI Index, rent-to-income ratio, household below poverty rate, no insurance rate, and unemployment rate. Our results show that interpersonal factors and operational efficiency emerge as the strongest determinants of patient satisfaction in urgent care, while technical quality, finances, and facilities show no significant independent effects when adjusted for in multivariate models. Among socioeconomic and demographic factors, only population density demonstrates a significant but modest association with patient ratings, while the remaining factors exhibit no significant correlations. Overall, this study highlights the potential of crowdsourcing to uncover the key factors that matter to residents and provide valuable insights for stakeholders to improve public satisfaction with urgent care.

Forensic Self-Descriptions Are All You Need for Zero-Shot Detection, Open-Set Source Attribution, and Clustering of AI-generated Images

Tai D. Nguyen,Aref Azizpour,Matthew C. Stamm

Task: 提出一种新方法，通过建模图像的微观结构来检测合成图像并进行来源归属。

Motivation: 传统方法难以泛化到未知生成器，因为依赖已知来源的特征训练。

Details

Method: 自监督学习多样预测滤波器，提取多尺度残差以捕获微观结构，构建图像的独特法医自描述。 Result: 方法在零样本检测、开放集来源归属和聚类任务中表现优异，优于现有技术。 Conclusion: 该方法显著提升了合成媒体法医检测的准确性和适应性，推动了领域发展。 Abstract: The emergence of advanced AI-based tools to generate realistic images poses significant challenges for forensic detection and source attribution, especially as new generative techniques appear rapidly. Traditional methods often fail to generalize to unseen generators due to reliance on features specific to known sources during training. To address this problem, we propose a novel approach that explicitly models forensic microstructures - subtle, pixel-level patterns unique to the image creation process. Using only real images in a self-supervised manner, we learn a set of diverse predictive filters to extract residuals that capture different aspects of these microstructures. By jointly modeling these residuals across multiple scales, we obtain a compact model whose parameters constitute a unique forensic self-description for each image. This self-description enables us to perform zero-shot detection of synthetic images, open-set source attribution of images, and clustering based on source without prior knowledge. Extensive experiments demonstrate that our method achieves superior accuracy and adaptability compared to competing techniques, advancing the state of the art in synthetic media forensics.

Hannah Kim,Sofia Martinez,Jason Lee

Task: 提出一种跨模态状态空间图推理（CSS-GR）框架，用于从大规模多模态数据中提取紧凑且有意义的摘要。

Motivation: 现有跨模态摘要方法存在计算开销高和可解释性差的问题，需要一种更高效且可解释的解决方案。

Details

Method: 结合状态空间模型和图推理，构建捕捉模态间和模态内关系的图结构，实现更全面的文本和视觉流推理。 Result: 在标准多模态摘要基准测试中，显著提升摘要质量和可解释性，同时保持计算效率。 Conclusion: CSS-GR框架通过图推理和状态空间模型的结合，有效解决了跨模态摘要中的计算和可解释性问题。 Abstract: The ability to extract compact, meaningful summaries from large-scale and multimodal data is critical for numerous applications, ranging from video analytics to medical reports. Prior methods in cross-modal summarization have often suffered from high computational overheads and limited interpretability. In this paper, we propose a \textit{Cross-Modal State-Space Graph Reasoning} (\textbf{CSS-GR}) framework that incorporates a state-space model with graph-based message passing, inspired by prior work on efficient state-space models. Unlike existing approaches relying on purely sequential models, our method constructs a graph that captures inter- and intra-modal relationships, allowing more holistic reasoning over both textual and visual streams. We demonstrate that our approach significantly improves summarization quality and interpretability while maintaining computational efficiency, as validated on standard multimodal summarization benchmarks. We also provide a thorough ablation study to highlight the contributions of each component.

Reconstructing Gridded Data from Higher Autocorrelations

W. Riley Casper,Bobby Orozco

Task: 从高阶自相关中重建网格化数据。

Motivation: 高阶自相关在X射线晶体学、计算机视觉、相关层析成像等领域有广泛应用，但如何从这些自相关中重建原始数据是一个重要问题。

Details

Method: 提出了一种显式重建算法，并证明3r + 3阶自相关足以确定数据（r为网格维度）。 Result: 证明了3r + 3阶自相关足以确定数据，并给出了3r + 2阶自相关不足以确定数据的例子。 Conclusion: 高阶自相关在网格化数据重建中具有重要作用，3r + 3阶是充分条件。 Abstract: The higher-order autocorrelations of integer-valued or rational-valued gridded data sets appear naturally in X-ray crystallography, and have applications in computer vision systems, correlation tomography, correlation spectroscopy, and pattern recognition. In this paper, we consider the problem of reconstructing a gridded data set from its higher-order autocorrelations. We describe an explicit reconstruction algorithm, and prove that the autocorrelations up to order 3r + 3 are always sufficient to determine the data up to translation, where r is the dimension of the grid. We also provide examples of rational-valued gridded data sets which are not determined by their autocorrelations up to order 3r + 2.

Multi-head Reward Aggregation Guided by Entropy

Xiaomin Li,Xupeng Chen,Jingxuan Fan,Eric Hanchen Jiang,Mingye Gao

Task: 提出一种基于熵的多头奖励组合方法（ENCORE），用于优化大型语言模型（LLM）的安全对齐。

Motivation: 人类生成的安全规则评分一致性差，高熵规则在识别人类偏好时可靠性低。

Details

Method: 通过熵引导的方法，降低高熵规则的权重，组合多头奖励。 Result: 在RewardBench安全任务上显著优于多种基线方法。 Conclusion: ENCORE是一种无需训练、通用且可解释的多属性奖励建模方法。 Abstract: Aligning large language models (LLMs) with safety guidelines typically involves reinforcement learning from human feedback (RLHF), relying on human-generated preference annotations. However, assigning consistent overall quality ratings is challenging, prompting recent research to shift towards detailed evaluations based on multiple specific safety criteria. This paper uncovers a consistent observation: safety rules characterized by high rating entropy are generally less reliable in identifying responses preferred by humans. Leveraging this finding, we introduce ENCORE, a straightforward entropy-guided approach that composes multi-head rewards by downweighting rules exhibiting high rating entropy. Theoretically, we demonstrate that rules with elevated entropy naturally receive minimal weighting in the Bradley-Terry optimization framework, justifying our entropy-based penalization. Through extensive experiments on RewardBench safety tasks, our method significantly surpasses several competitive baselines, including random weighting, uniform weighting, single-head Bradley-Terry models, and LLM-based judging methods. Our proposed approach is training-free, broadly applicable to various datasets, and maintains interpretability, offering a practical and effective solution for multi-attribute reward modeling.

What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

Chi-Hsi Kung,Frangil Ramirez,Juhyung Ha,Yi-Ting Chen,David Crandall,Yi-Hsuan Tsai

Task: 研究如何通过结合大型语言模型生成的状态变化描述和反事实推理，学习过程感知的视频表示。

Motivation: 现有工作未明确学习场景变化（状态变化），而理解过程活动需要建模动作步骤如何改变场景以及场景变化如何影响动作序列。

Details

Method: 利用大型语言模型生成的状态变化描述作为视频编码器的监督信号，并生成状态变化反事实以模拟假设的失败结果。 Result: 在时间动作分割和错误检测等任务上取得了显著改进。 Conclusion: 提出的状态变化描述及其反事实推理有效提升了模型的过程感知能力。 Abstract: Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Existing work has studied procedure-aware video representations by proposing novel approaches such as modeling the temporal order of actions and has not explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by Large Language Models (LLMs) as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining the unseen ``What if'' scenarios. This counterfactual reasoning facilitates the model's ability to understand the cause and effect of each step in an activity. To verify the procedure awareness of our model, we conduct extensive experiments on procedure-aware tasks, including temporal action segmentation and error detection. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals and achieve significant improvements on multiple tasks. We will make our source code and data publicly available soon.

Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters

Mahmoud Alwakeel,Emory Buck,Jonathan G. Martin,Imran Aslam,Sudarshan Rajagopal,Jian Pei,Mihai V. Podgoreanu,Christopher J. Lindsell,An-Kwok Ian Wong

Task: 评估大型语言模型（LLMs）在从CTPE报告中提取肺栓塞（PE）相关概念的准确性。

Motivation: 肺栓塞（PE）是心血管死亡的主要原因，但由于放射学文档的异质性和难以获取，对其最佳管理的理解有限。PERT Consortium注册表标准化了PE管理数据，但依赖资源密集型的手动提取。

Details

Method: 回顾性分析MIMIC-IV和Duke Health的CTPE报告，使用多个LLaMA模型进行比较。 Result: 较大的模型（70B）表现优于较小的模型（8B），在PE检测、位置、右心应变和图像伪影等方面取得了较高的kappa值。双模型审查框架实现了80-90%的精确度。 Conclusion: LLMs在自动化PE注册表提取方面表现出强大潜力，可减少手动工作量并保持准确性。 Abstract: Pulmonary embolism (PE) is a leading cause of cardiovascular mortality, yet our understanding of optimal management remains limited due to heterogeneous and inaccessible radiology documentation. The PERT Consortium registry standardizes PE management data but depends on resource-intensive manual abstraction. Large language models (LLMs) offer a scalable alternative for automating concept extraction from computed tomography PE (CTPE) reports. This study evaluated the accuracy of LLMs in extracting PE-related concepts compared to a human-curated criterion standard. We retrospectively analyzed MIMIC-IV and Duke Health CTPE reports using multiple LLaMA models. Larger models (70B) outperformed smaller ones (8B), achieving kappa values of 0.98 (PE detection), 0.65-0.75 (PE location), 0.48-0.51 (right heart strain), and 0.65-0.70 (image artifacts). Moderate temperature tuning (0.2-0.5) improved accuracy, while excessive in-context examples reduced performance. A dual-model review framework achieved >80-90% precision. LLMs demonstrate strong potential for automating PE registry abstraction, minimizing manual workload while preserving accuracy.

Online Reasoning Video Segmentation with Just-in-Time Digital Twins

Yiqing Shen,Bohan Liu,Chenjia Li,Lalithkumar Seenivasan,Mathias Unberath

Task: 开发一种无需微调多模态大语言模型（LLM）的在线视频推理分割（RS）代理框架。

Motivation: 当前推理分割方法依赖多模态大语言模型的视觉感知能力，存在多步推理困难、需要频繁微调以及难以扩展到在线视频数据的问题。

Details

Method: 提出一种代理框架，通过即时数字孪生概念，将感知与推理解耦，利用专家视觉模型构建低层次场景表示，并由LLM进行推理。 Result: 引入了一个包含200个视频和895个隐含文本查询的综合视频推理分割基准，覆盖语义、空间和时间三类推理任务。 Conclusion: 该框架有效解决了当前方法的局限性，无需LLM微调即可实现高效的在线视频推理分割。 Abstract: Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where -- given an implicit query -- a LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as "just-in-time" because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity.

Can Large Language Models Predict Associations Among Human Attitudes?

Ana Ma,Derek Powell

Task: 研究大型语言模型（LLM）能否预测人类在不同主题间的态度关联。

Motivation: 探索LLM是否能捕捉人类态度之间的深层关联，而不仅是表面相似的态度。

Details

Method: 使用GPT-4o模型分析人类对多样化态度陈述的反应数据，测试其在无表面相似性情况下的预测能力。 Result: GPT-4o能够重建态度间的相关性，并在无表面相似性时仍能生成有意义的社会推断。 Conclusion: LLM能够捕捉人类信念系统的深层潜在结构。 Abstract: Prior work has shown that large language models (LLMs) can predict human attitudes based on other attitudes, but this work has largely focused on predictions from highly similar and interrelated attitudes. In contrast, human attitudes are often strongly associated even across disparate and dissimilar topics. Using a novel dataset of human responses toward diverse attitude statements, we found that a frontier language model (GPT-4o) was able to recreate the pairwise correlations among individual attitudes and to predict individuals' attitudes from one another. Crucially, in an advance over prior work, we tested GPT-4o's ability to predict in the absence of surface-similarity between attitudes, finding that while surface similarity improves prediction accuracy, the model was still highly-capable of generating meaningful social inferences between dissimilar attitudes. Altogether, our findings indicate that LLMs capture crucial aspects of the deeper, latent structure of human belief systems.

Neural Architecture Search by Learning a Hierarchical Search Space

Mehraveh Javan Roshtkhari,Matthew Toews,Marco Pedersoli

Task: 研究蒙特卡洛树搜索（MCTS）在神经架构搜索（NAS）中的应用，特别是在图像分类任务中。

Motivation: MCTS的性能高度依赖于节点分支的顺序，而NAS中只有最终架构重要，因此优化分支顺序可以提高搜索效率。

Details

Method: 提出通过基于架构相似性的层次聚类来学习分支顺序，相似性通过架构输出向量的成对距离衡量。 Result: 在CIFAR10和ImageNet上的实验表明，MCTS在良好的分支层次结构下比其他NAS方法更高效地找到有前景的解决方案。 Conclusion: MCTS在NAS中具有潜力，尤其是在优化分支顺序的情况下，能够显著提高搜索效率。 Abstract: Monte-Carlo Tree Search (MCTS) is a powerful tool for many non-differentiable search related problems such as adversarial games. However, the performance of such approach highly depends on the order of the nodes that are considered at each branching of the tree. If the first branches cannot distinguish between promising and deceiving configurations for the final task, the efficiency of the search is exponentially reduced. In Neural Architecture Search (NAS), as only the final architecture matters, the visiting order of the branching can be optimized to improve learning. In this paper, we study the application of MCTS to NAS for image classification. We analyze several sampling methods and branching alternatives for MCTS and propose to learn the branching by hierarchical clustering of architectures based on their similarity. The similarity is measured by the pairwise distance of output vectors of architectures. Extensive experiments on two challenging benchmarks on CIFAR10 and ImageNet show that MCTS, if provided with a good branching hierarchy, can yield promising solutions more efficiently than other approaches for NAS problems.

Enhancing Korean Dependency Parsing with Morphosyntactic Features

Jungyeul Park,Yige Chen,Kyuwon Kim,KyungTae Lim,Chulwoo Park

Task: 提出UniDive框架，整合Universal Dependencies和Universal Morphology以改进韩语的形态句法表示与处理。

Motivation: 韩语的丰富屈折形态和灵活语序对现有框架构成挑战，导致形态与句法分析不一致。

Details

Method: 通过结合UD的句法依赖和UniMorph的形态特征，构建统一标注的数据集，并应用于依存句法分析。 Result: 实验表明，增强的形态句法特征提高了依存句法分析的准确性，尤其在区分受形态影响的语法关系时。 Conclusion: 明确的形态信息有助于更准确的句法分析，UniDive框架有效提升了韩语处理的一致性。 Abstract: This paper introduces UniDive for Korean, an integrated framework that bridges Universal Dependencies (UD) and Universal Morphology (UniMorph) to enhance the representation and processing of Korean {morphosyntax}. Korean's rich inflectional morphology and flexible word order pose challenges for existing frameworks, which often treat morphology and syntax separately, leading to inconsistencies in linguistic analysis. UniDive unifies syntactic and morphological annotations by preserving syntactic dependencies while incorporating UniMorph-derived features, improving consistency in annotation. We construct an integrated dataset and apply it to dependency parsing, demonstrating that enriched morphosyntactic features enhance parsing accuracy, particularly in distinguishing grammatical relations influenced by morphology. Our experiments, conducted with both encoder-only and decoder-only models, confirm that explicit morphological information contributes to more accurate syntactic analysis.

Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing

Fan Qi,Yu Duan,Changsheng Xu

Task: 通过Janus-Pro驱动的提示解析和MIGLoRA模块改进文本引导扩散模型，以生成复杂多对象场景。

Motivation: 现有文本引导扩散模型在生成复杂多对象场景时存在空间定位不精确和可扩展性有限的问题。

Details

Method: 提出Janus-Pro驱动的提示解析模块和MIGLoRA参数高效插件，结合LoRA技术优化UNet和DiT架构。 Result: 在COCO和LVIS基准测试中达到最先进性能，同时保持参数效率和高保真布局。 Conclusion: 该方法显著提升了复杂场景生成的布局保真度和可扩展性。 Abstract: Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model's parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis.

Shared Global and Local Geometry of Language Model Embeddings

Andrew Lee,Melanie Weber,Fernanda Viégas,Martin Wattenberg

Task: 探索语言模型中词嵌入的共同几何结构及其在模型间的可转移性。

Motivation: 发现语言模型共享的表示结构，并利用这种结构开发可解释性应用。

Details

Method: 通过全局相似性和局部几何特征（如局部线性嵌入和内在维度）分析词嵌入的几何结构。 Result: 词嵌入位于低维流形上，且低内在维度的词嵌入具有语义一致性；词嵌入的对齐性在隐藏状态中持续存在，支持模型间的向量转移。 Conclusion: 语言模型的词嵌入具有共同的几何结构，可用于跨模型的可解释性应用。 Abstract: Researchers have recently suggested that models share common representations. In this work, we find that the token embeddings of language models exhibit common geometric structure. First, we find ``global'' similarities: token embeddings often share similar relative orientations. Next, we characterize local geometry in two ways: (1) by using Locally Linear Embeddings, and (2) by defining a simple measure for the intrinsic dimension of each token embedding. Our intrinsic dimension measure demonstrates that token embeddings lie on a lower dimensional manifold. We qualitatively show that tokens with lower intrinsic dimensions often have semantically coherent clusters, while those with higher intrinsic dimensions do not. Both characterizations allow us to find similarities in the local geometry of token embeddings. Perhaps most surprisingly, we find that alignment in token embeddings persists through the hidden states of language models, allowing us to develop an application for interpretability. Namely, we empirically demonstrate that steering vectors from one language model can be transferred to another, despite the two models having different dimensions.

HSLiNets: Evaluating Band Ordering Strategies in Hyperspectral and LiDAR Fusion

Judy X Yang,Jing Wang,Zhuanfeng,Li,Chenhong Sui Zekun Long,Jun Zhou

Task: 研究高光谱成像（HSI）和激光雷达（LiDAR）数据融合中波段顺序对分类性能的影响，并提出一种新的融合架构。

Motivation: 以往研究忽视了波段顺序在HSI-LiDAR融合中的重要性，而实验表明其对分类精度有显著影响。

Details

Method: 提出一种新颖的融合架构，通过学习多种波段顺序配置自适应融合不同光谱序列。 Result: 在Houston 2013和Trento数据集上，所提方法优于现有融合模型。 Conclusion: 波段顺序是影响HSI-LiDAR融合性能的关键因素，新方法通过自适应融合显著提升了分类精度。 Abstract: The integration of hyperspectral imaging (HSI) and Light Detection and Ranging (LiDAR) data provides complementary spectral and spatial information for remote sensing applications. While previous studies have explored the role of band selection and grouping in HSI classification, little attention has been given to how the spectral sequence or band order affects classification outcomes when fused with LiDAR. In this work, we systematically investigate the influence of band order on HSI-LiDAR fusion performance. Through extensive experiments, we demonstrate that band order significantly impacts classification accuracy, revealing a previously overlooked factor in fusion-based models. Motivated by this observation, we propose a novel fusion architecture that not only integrates HSI and LiDAR data but also learns from multiple band order configurations. The proposed method enhances feature representation by adaptively fusing different spectral sequences, leading to improved classification accuracy. Experimental results on the Houston 2013 and Trento datasets show that our approach outperforms state-of-the-art fusion models. Data and code are available at https://github.com/Judyxyang/HSLiNets.

EQ-Negotiator: An Emotion-Reasoning LLM Agent in Credit Dialogues

Yuhan Liu,Yunbo Long

Task: 开发一种结合情感感知与情感推理的EQ谈判者，以提升LLM聊天机器人在信用对话中的动态情感表达能力。

Motivation: 现有聊天机器人主要依赖被动共情，缺乏情感推理能力，无法在客户情绪消极时动态调整策略以引导对话。

Details

Method: 结合预训练语言模型（PLM）的情感感知、基于博弈论和隐马尔可夫模型的情感推理，并利用公开情感数据集微调PLM。 Result: EQ谈判者能有效捕捉客户情绪变化，并根据情感决策策略动态调整回应语气，提升信用服务满意度。 Conclusion: EQ谈判者为信用机构提供了一种增强客户关系的工具，展示了情感智能在金融谈判中的潜力。 Abstract: While large language model (LLM)-based chatbots have been applied for effective engagement in credit dialogues, their capacity for dynamic emotional expression remains limited. Current agents primarily rely on passive empathy rather than affective reasoning. For instance, when faced with persistent client negativity, the agent should employ strategic emotional adaptation by expressing measured anger to discourage counterproductive behavior and guide the conversation toward resolution. This context-aware emotional modulation is essential for imitating the nuanced decision-making of human negotiators. This paper introduces an EQ-negotiator that combines emotion sensing from pre-trained language models (PLMs) with emotional reasoning based on Game Theory and Hidden Markov Models. It takes into account both the current and historical emotions of the client to better manage and address negative emotions during interactions. By fine-tuning pre-trained language models (PLMs) on public emotion datasets and validating them on the credit dialogue datasets, our approach enables LLM-based agents to effectively capture shifts in client emotions and dynamically adjust their response tone based on our emotion decision policies in real-world financial negotiations. This EQ-negotiator can also help credit agencies foster positive client relationships, enhancing satisfaction in credit services.

Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing Systems

Ooha Lakkadi Reddy

Task: 研究印度河谷文字与藏彝走廊象形文字系统之间的历史联系。

Motivation: 探索印度河谷文字与藏彝走廊象形文字之间的视觉形态相似性，挑战传统孤立文字发展的观点。

Details

Method: 采用混合CNN-Transformer架构和人类学框架，通过15个独立训练模型对三种目标文字进行集成分析。 Result: 藏彝走廊文字与印度河谷文字的视觉相似性（61.7%-63.5%）显著高于与青铜时代原始楔形文字（10.2%-10.9%）或原始埃兰文字（7.6%-8.7%）的相似性。 Conclusion: 研究结果表明古代南亚与东亚之间存在复杂的文化传播网络，挑战了传统孤立发展的观点。 Abstract: This thesis employs a hybrid CNN-Transformer architecture, in conjunction with a detailed anthropological framework, to investigate potential historical connections between the visual morphology of the Indus Valley script and pictographic systems of the Tibetan-Yi Corridor. Through an ensemble methodology of three target scripts across 15 independently trained models, we demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold higher visual similarity to the Indus script (61.7%-63.5%) than to the Bronze Age Proto-Cuneiform (10.2%-10.9%) or Proto-Elamite (7.6%-8.7%) systems. Additionally and contrarily to our current understanding of the networks of the Indus Valley Civilization, the Indus script unexpectedly maps closer to Tibetan-Yi Corridor scripts, with a mean cosine similarity of 0.629, than to the aforementioned contemporaneous West Asian signaries, both of which recorded mean cosine similarities of 0.104 and 0.080 despite their close geographic proximity and evident trade relations. Across various dimensionality reduction practices and clustering methodologies, the Indus script consistently clusters closest to Tibetan-Yi Corridor scripts. Our computational results align with qualitative observations of specific pictorial parallels in numeral systems, gender markers, and key iconographic elements; this is further supported by archaeological evidence of sustained contact networks along the ancient Shu-Shendu road in tandem with the Indus Valley Civilization's decline, providing a plausible transmission pathway. While alternative explanations cannot be ruled out, the specificity and consistency of observed similarities challenge conventional narratives of isolated script development and suggest more complex ancient cultural transmission networks between South and East Asia than previously recognized.

ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging

Haoming Xu,Shuxun Wang,Yanqiu Zhao,Yi Zhong,Ziyan Jiang,Ningyuan Zhao,Shumin Deng,Huajun Chen,Ningyu Zhang

Task: 选择性从大型语言模型中消除敏感知识。

Motivation: 解决大型语言模型在消除敏感内容时可能出现的过度遗忘或遗忘不足问题。

Details

Method: 利用模型合并（特别是TIES-Merging）技术，将两个专用模型结合为一个更平衡的未学习模型。 Result: 在26个团队中排名第二，Task Aggregate得分为0.944，整体Aggregate得分为0.487。 Conclusion: 当前评估方法（如MIA和ROUGE）不足以全面评估未学习效果，未来需更全面的评估方法和重新思考未学习目标。 Abstract: This paper presents the ZJUKLAB team's submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models. This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues. We propose an unlearning system that leverages Model Merging (specifically TIES-Merging), combining two specialized models into a more balanced unlearned model. Our system achieves competitive results, ranking second among 26 teams, with an online score of 0.944 for Task Aggregate and 0.487 for overall Aggregate. In this paper, we also conduct local experiments and perform a comprehensive analysis of the unlearning process, examining performance trajectories, loss dynamics, and weight perspectives, along with several supplementary experiments, to understand the effectiveness of our method. Furthermore, we analyze the shortcomings of our method and evaluation metrics, emphasizing that MIA scores and ROUGE-based metrics alone are insufficient to fully evaluate successful unlearning. Finally, we emphasize the need for more comprehensive evaluation methodologies and rethinking of unlearning objectives in future research. Code is available at https://github.com/zjunlp/unlearn/tree/main/semeval25.

KAC: Kolmogorov-Arnold Classifier for Continual Learning

Yusong Hu,Zichen Liang,Fei Yang,Qibin Hou,Xialei Liu,Ming-Ming Cheng

Task: 探索基于Kolmogorov-Arnold Networks（KAN）的新型分类器KAC在持续学习中的潜力。

Motivation: 现有线性分类器在持续学习中难以保持稳定的分类空间，而KAN在简单持续回归任务中表现出稳定性。

Details

Method: 提出Kolmogorov-Arnold分类器（KAC），结合KAN结构和径向基函数（RBF）以提升持续学习兼容性。 Result: 在多个持续学习基准测试中，KAC替换线性分类器后均表现出性能提升。 Conclusion: KAC在持续学习中具有高效性和鲁棒性。 Abstract: Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Inspired by the success of Kolmogorov-Arnold Networks (KAN) in preserving learning stability during simple continual regression tasks, we set out to explore their potential in more complex continual learning scenarios. In this paper, we introduce the Kolmogorov-Arnold Classifier (KAC), a novel classifier developed for continual learning based on the KAN structure. We delve into the impact of KAN's spline functions and introduce Radial Basis Functions (RBF) for improved compatibility with continual learning. We replace linear classifiers with KAC in several recent approaches and conduct experiments across various continual learning benchmarks, all of which demonstrate performance improvements, highlighting the effectiveness and robustness of KAC in continual learning. The code is available at https://github.com/Ethanhuhuhu/KAC.

Function Alignment: A New Theory for Mind and Intelligence, Part I: Foundations

Gus G. Xia

Task: 提出一种新颖的心智与智能理论——功能对齐，旨在解释意义、解释和类比如何从分层表征的互动中产生。

Motivation: 为解决认知科学中分散的概念（如有限理性、符号接地和类比生成）提供统一解释，并连接计算架构、心理学理论和禅宗等传统。

Details

Method: 通过功能对齐理论，明确建模分层表征之间的互动，形成一个连贯的框架。 Result: 功能对齐理论不仅能够建模心智，还为构建心智提供了蓝图，并提出了有限可解释性这一关键理论见解。 Conclusion: 功能对齐理论为理解心智提供了结构基础，无需依赖哲学系统，支持多学科的重建。 Abstract: This paper introduces function alignment, a novel theory of mind and intelligence that is both intuitively compelling and structurally grounded. It explicitly models how meaning, interpretation, and analogy emerge from interactions among layered representations, forming a coherent framework capable not only of modeling minds but also of serving as a blueprint for building them. One of the key theoretical insights derived from function alignment is bounded interpretability, which provides a unified explanation for previously fragmented ideas in cognitive science, such as bounded rationality, symbol grounding, and analogy-making. Beyond modeling, the function alignment framework bridges disciplines often kept apart, linking computational architecture, psychological theory, and even contemplative traditions such as Zen. Rather than building on any philosophical systems, it offers a structural foundation upon which multiple ways of understanding the mind may be reconstructed.

Can Video Diffusion Model Reconstruct 4D Geometry?

Jinjie Mai,Wenxuan Zhu,Haozhe Liu,Bing Li,Cheng Zheng,Jürgen Schmidhuber,Bernard Ghanem

Task: 从单目视频中重建动态3D场景（即4D几何）。

Motivation: 传统多视图几何方法难以处理动态运动，而基于学习的方法需要专门的4D表示或复杂的优化过程。

Details

Method: 提出Sora3R框架，利用大规模视频扩散模型的时空先验，通过两阶段流程直接推断4D点图：1）从预训练视频VAE适配点图VAE；2）在视频和点图潜在空间联合微调扩散模型。 Result: Sora3R无需外部模块或迭代全局对齐，能可靠恢复相机姿态和详细场景几何，性能与最先进的动态4D重建方法相当。 Conclusion: Sora3R提供了一种高效且完全前馈的动态4D重建方法。 Abstract: Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemporal priors of large-scale video diffusion models to directly infer 4D pointmaps from casual videos. Sora3R follows a two-stage pipeline: (1) we adapt a pointmap VAE from a pretrained video VAE, ensuring compatibility between the geometry and video latent spaces; (2) we finetune a diffusion backbone in combined video and pointmap latent space to generate coherent 4D pointmaps for every frame. Sora3R operates in a fully feedforward manner, requiring no external modules (e.g., depth, optical flow, or segmentation) or iterative global alignment. Extensive experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction across diverse scenarios.

Leveraging Large Language Models for Risk Assessment in Hyperconnected Logistic Hub Network Deployment

Yinzhu Quan,Yujia Xu,Guanlin Chen,Frederick Benaben,Benoit Montreuil

Task: 设计一个基于大语言模型（LLM）的风险评估框架，用于评估超连接物流枢纽网络的部署风险。

Motivation: 全球供应链对能源效率和环境可持续性的重视增加，传统方法难以有效处理非结构化信息，动态风险评估在VUCA环境中变得至关重要。

Details

Method: 结合多种分析工具，利用LLM分析非结构化数据（如地缘政治、金融趋势、历史风暴事件等），并通过提示设计指导LLM评估风险类型和级别。 Result: LLM能够系统识别潜在风险，聚类具有相似风险特征的物流枢纽，支持数据驱动的决策过程。 Conclusion: 该框架具有可扩展性和长期记忆能力，通过解释和解释增强决策，为超连接供应链网络提供全面的风险评估。 Abstract: The growing emphasis on energy efficiency and environmental sustainability in global supply chains introduces new challenges in the deployment of hyperconnected logistic hub networks. In current volatile, uncertain, complex, and ambiguous (VUCA) environments, dynamic risk assessment becomes essential to ensure successful hub deployment. However, traditional methods often struggle to effectively capture and analyze unstructured information. In this paper, we design an Large Language Model (LLM)-driven risk assessment pipeline integrated with multiple analytical tools to evaluate logistic hub deployment. This framework enables LLMs to systematically identify potential risks by analyzing unstructured data, such as geopolitical instability, financial trends, historical storm events, traffic conditions, and emerging risks from news sources. These data are processed through a suite of analytical tools, which are automatically called by LLMs to support a structured and data-driven decision-making process for logistic hub selection. In addition, we design prompts that instruct LLMs to leverage these tools for assessing the feasibility of hub selection by evaluating various risk types and levels. Through risk-based similarity analysis, LLMs cluster logistic hubs with comparable risk profiles, enabling a structured approach to risk assessment. In conclusion, the framework incorporates scalability with long-term memory and enhances decision-making through explanation and interpretation, enabling comprehensive risk assessments for logistic hub deployment in hyperconnected supply chain networks.

Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection

Yun Zhu,Le Hui,Hang Yang,Jianjun Qian,Jin Xie,Jian Yang

Task: 提出一种统一的稀疏监督3D物体检测方法，适用于室内和室外场景。

Motivation: 现有稀疏监督3D物体检测方法仅关注室外场景，忽略了室内场景的需求。

Details

Method: 通过类原型学习利用未标记物体，提出原型匹配模块和多标签协同优化模块。 Result: 在ScanNet V2、SUN RGB-D和KITTI数据集上，性能分别达到全监督检测器的78%、90%和96%。 Conclusion: 该方法在稀疏监督条件下表现出色，具有较高的可扩展性。 Abstract: Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objects. Specifically, we first propose a prototype-based object mining module that converts the unlabeled object mining into a matching problem between class prototypes and unlabeled features. By using optimal transport matching results, we assign prototype labels to high-confidence features, thereby achieving the mining of unlabeled objects. We then present a multi-label cooperative refinement module to effectively recover missed detections through pseudo label quality control and prototype label cooperation. Experiments show that our method achieves state-of-the-art performance under the one object per scene sparse supervised setting across indoor and outdoor datasets. With only one labeled object per scene, our method achieves about 78%, 90%, and 96% performance compared to the fully supervised detector on ScanNet V2, SUN RGB-D, and KITTI, respectively, highlighting the scalability of our method. Code is available at https://github.com/zyrant/CPDet3D.

Collaborative Evolution: Multi-Round Learning Between Large and Small Language Models for Emergent Fake News Detection

Ziyi Zhou,Xiaoming Zhang,Shenghan Tan,Litian Zhang,Chaozhuo Li

Task: 提出一种名为Multi-Round Collaboration Detection (MRCD)的新框架，用于更有效地检测社交媒体上的虚假新闻。

Motivation: 传统的小语言模型(SLMs)需要大量监督训练且难以适应快速变化的环境，而大语言模型(LLMs)由于缺乏相关演示和动态知识，在虚假新闻检测上表现不佳。

Details

Method: MRCD框架结合了LLMs的泛化能力和SLMs的专业功能，采用两阶段检索模块选择相关且最新的演示和知识，并通过多轮学习框架提高检测结果的可靠性。 Result: MRCD在Pheme和Twitter16数据集上取得了SOTA结果，准确率分别比仅使用SLMs提高了7.4%和12.8%。 Conclusion: MRCD框架有效解决了当前模型的局限性，提升了新兴虚假新闻的检测能力。 Abstract: The proliferation of fake news on social media platforms has exerted a substantial influence on society, leading to discernible impacts and deleterious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from the necessity for extensive supervised training and the challenge of adapting to rapidly evolving circumstances. Large language models (LLMs), despite their robust zero-shot capabilities, have fallen short in effectively identifying fake news due to a lack of pertinent demonstrations and the dynamic nature of knowledge. In this paper, a novel framework Multi-Round Collaboration Detection (MRCD) is proposed to address these aforementioned limitations. The MRCD framework is capable of enjoying the merits from both LLMs and SLMs by integrating their generalization abilities and specialized functionalities, respectively. Our approach features a two-stage retrieval module that selects relevant and up-to-date demonstrations and knowledge, enhancing in-context learning for better detection of emerging news events. We further design a multi-round learning framework to ensure more reliable detection results. Our framework MRCD achieves SOTA results on two real-world datasets Pheme and Twitter16, with accuracy improvements of 7.4\% and 12.8\% compared to using only SLMs, which effectively addresses the limitations of current models and improves the detection of emergent fake news.

StyledStreets: Multi-style Street Simulator with Spatial and Temporal Consistency

Yuyin Chen,Yida Wang,Xueyang Zhang,Kun Zhan,Peng Jia,Yifei Zhan,Xianpeng Lang

Task: 提出一种多风格街道模拟器StyledStreets，实现指令驱动的场景编辑，并保证空间和时间一致性。

Motivation: 城市场景重建需要同时建模静态基础设施和动态元素，并支持多样化的环境条件。

Details

Method: 基于高斯泼溅框架，结合姿态优化和多视角训练，通过混合嵌入方案、不确定性感知渲染和统一参数化模型实现风格迁移。 Result: 在季节、天气和相机设置下实现逼真的风格迁移，保持几何精度和多视角一致性。 Conclusion: 该方法为城市模拟提供了新能力，适用于自动驾驶测试和增强现实系统。 Abstract: Urban scene reconstruction requires modeling both static infrastructure and dynamic elements while supporting diverse environmental conditions. We present \textbf{StyledStreets}, a multi-style street simulator that achieves instruction-driven scene editing with guaranteed spatial and temporal consistency. Building on a state-of-the-art Gaussian Splatting framework for street scenarios enhanced by our proposed pose optimization and multi-view training, our method enables photorealistic style transfers across seasons, weather conditions, and camera setups through three key innovations: First, a hybrid embedding scheme disentangles persistent scene geometry from transient style attributes, allowing realistic environmental edits while preserving structural integrity. Second, uncertainty-aware rendering mitigates supervision noise from diffusion priors, enabling robust training across extreme style variations. Third, a unified parametric model prevents geometric drift through regularized updates, maintaining multi-view consistency across seven vehicle-mounted cameras. Our framework preserves the original scene's motion patterns and geometric relationships. Qualitative results demonstrate plausible transitions between diverse conditions (snow, sandstorm, night), while quantitative evaluations show state-of-the-art geometric accuracy under style transfers. The approach establishes new capabilities for urban simulation, with applications in autonomous vehicle testing and augmented reality systems requiring reliable environmental consistency. Codes will be publicly available upon publication.

UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning

Hongxuan Tang,Hao Liu,Xinyan Xiao

Task: 提出一种统一的自动回归多模态模型UGen，用于同时处理文本处理、图像理解和图像生成任务。

Motivation: 解决统一多模态学习中的挑战，提升模型在多种任务上的性能。

Details

Method: 通过渐进式词汇学习机制，逐步激活和整合视觉标记ID，使用单一Transformer以自动回归方式统一生成文本和图像。 Result: 在综合文本和图像任务上，UGen比传统统一自动回归方法性能提升13.3%，并在所有任务中与任务专用模型竞争。 Conclusion: UGen展示了统一多模态学习的有效性，并在多种任务上实现了显著性能提升。 Abstract: We introduce UGen, a unified autoregressive multimodal model that demonstrates strong performance across text processing, image understanding, and image generation tasks simultaneously. UGen converts both texts and images into discrete token sequences and utilizes a single transformer to generate them uniformly in an autoregressive manner. To address the challenges associated with unified multimodal learning, UGen is trained using a novel mechanism, namely progressive vocabulary learning. In this process, visual token IDs are incrementally activated and integrated into the training phase, ultimately enhancing the effectiveness of unified multimodal learning. Experiments on comprehensive text and image tasks show that UGen achieves a significant overall performance improvement of 13.3% compared to the vanilla unified autoregressive method, and it also delivers competitive results across all tasks against several task-specific models.

One Snapshot is All You Need: A Generalized Method for mmWave Signal Generation

Teng Huang,Han Ding,Wenxin Sun,Cui Zhao,Ge Wang,Fei Wang,Kun Zhao,Zhi Wang,Wei Xi

Task: 提出mmGen框架，用于生成全场景毫米波信号。

Motivation: 现有毫米波数据集稀缺且格式不一致，限制了多样化应用的发展。

Details

Method: 通过构建物理信号传输模型，从3D网格合成人类反射和环境反射的毫米波信号，并考虑材料特性、天线增益和多径反射。 Result: 合成信号与实际捕获信号在Range-Angle和微多普勒特征上的平均相似度分别超过0.91和0.89。 Conclusion: mmGen框架有效且具有实际应用价值。 Abstract: Wireless sensing systems, particularly those using mmWave technology, offer distinct advantages over traditional vision-based approaches, such as enhanced privacy and effectiveness in poor lighting conditions. These systems, leveraging FMCW signals, have shown success in human-centric applications like localization, gesture recognition, and so on. However, comprehensive mmWave datasets for diverse applications are scarce, often constrained by pre-processed signatures (e.g., point clouds or RA heatmaps) and inconsistent annotation formats. To overcome these limitations, we propose mmGen, a novel and generalized framework tailored for full-scene mmWave signal generation. By constructing physical signal transmission models, mmGen synthesizes human-reflected and environment-reflected mmWave signals from the constructed 3D meshes. Additionally, we incorporate methods to account for material properties, antenna gains, and multipath reflections, enhancing the realism of the synthesized signals. We conduct extensive experiments using a prototype system with commercial mmWave devices and Kinect sensors. The results show that the average similarity of Range-Angle and micro-Doppler signatures between the synthesized and real-captured signals across three different environments exceeds 0.91 and 0.89, respectively, demonstrating the effectiveness and practical applicability of mmGen.

LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models

Hengyuan Zhao,Ziqin Wang,Qixin Sun,Kaiyou Song,Yilin Li,Xiaolin Hu,Qingpei Guo,Si Liu

Task: 提出一种名为LLaVA-CMoE的创新框架，用于解决大规模语言模型中连续学习的问题。

Motivation: 解决在连续学习中参数扩展导致模型过大以及路由器参数修改导致知识遗忘的两大挑战。

Details

Method: 采用Probe-Guided Knowledge Extension (PGKE)方法评估是否需要额外知识，并引入Probabilistic Task Locator (PTL)分层路由算法。 Result: 在Coin基准测试中显著提升模型性能，同时保持合理的参数数量。 Conclusion: LLaVA-CMoE框架有效解决了连续学习中的挑战，提升了模型效率和性能。 Abstract: Although applying Mixture of Experts to large language models for learning new tasks is widely regarded as an effective strategy for continuous learning, there still remain two major challenges: (1) As the number of tasks grows, simple parameter expansion strategies can lead to excessively large models. (2) Modifying the parameters of the existing router results in the erosion of previously acquired knowledge. In this paper, we present an innovative framework named LLaVA-CMoE, which is a continuous Mixture of Experts (MoE) architecture without any replay data. Specifically, we have developed a method called Probe-Guided Knowledge Extension (PGKE), which employs probe experts to assess whether additional knowledge is required for a specific layer. This approach enables the model to adaptively expand its network parameters based on task distribution, thereby significantly improving the efficiency of parameter expansion. Additionally, we introduce a hierarchical routing algorithm called Probabilistic Task Locator (PTL), where high-level routing captures inter-task information and low-level routing focuses on intra-task details, ensuring that new task experts do not interfere with existing ones. Our experiments shows that our efficient architecture has substantially improved model performance on the Coin benchmark while maintaining a reasonable parameter count.

AdaMHF: Adaptive Multimodal Hierarchical Fusion for Survival Prediction

Shuaiyu Zhang,Xun Lin,Rongxiang Zhang,Yu Bai,Yong Xu,Tao Tan,Xunbin Zheng,Zitong Yu

Task: 提出一种名为AdaMHF的框架，用于病理图像和基因组数据的多模态融合，以进行生存分析。

Motivation: 当前方法忽视生物特征（如异质性和稀疏性），限制了在临床实践中的适应性。

Details

Method: AdaMHF采用专家扩展和残差结构提取特征，并通过选择和聚合优化特征，最后进行层次化融合。 Result: 在TCGA数据集上的实验表明，AdaMHF在完整和不完整模态设置下均优于现有方法。 Conclusion: AdaMHF能够高效、全面地提取和融合特征，适应临床实践中的挑战性场景。 Abstract: The integration of pathologic images and genomic data for survival analysis has gained increasing attention with advances in multimodal learning. However, current methods often ignore biological characteristics, such as heterogeneity and sparsity, both within and across modalities, ultimately limiting their adaptability to clinical practice. To address these challenges, we propose AdaMHF: Adaptive Multimodal Hierarchical Fusion, a framework designed for efficient, comprehensive, and tailored feature extraction and fusion. AdaMHF is specifically adapted to the uniqueness of medical data, enabling accurate predictions with minimal resource consumption, even under challenging scenarios with missing modalities. Initially, AdaMHF employs an experts expansion and residual structure to activate specialized experts for extracting heterogeneous and sparse features. Extracted tokens undergo refinement via selection and aggregation, reducing the weight of non-dominant features while preserving comprehensive information. Subsequently, the encoded features are hierarchically fused, allowing multi-grained interactions across modalities to be captured. Furthermore, we introduce a survival prediction benchmark designed to resolve scenarios with missing modalities, mirroring real-world clinical conditions. Extensive experiments on TCGA datasets demonstrate that AdaMHF surpasses current state-of-the-art (SOTA) methods, showcasing exceptional performance in both complete and incomplete modality settings.

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu,Zonglin Yang,Tong Xie,Jinjie Ni,Ben Gao,Yuqiang Li,Shixiang Tang,Wanli Ouyang,Erik Cambria,Dongzhan Zhou

Task: 评估大语言模型（LLMs）在科学发现任务中的表现，特别是假设生成能力。

Motivation: 填补缺乏专门评估LLMs在科学发现中能力的基准的空白。

Details

Method: 开发了一个自动化框架，从12个学科的论文中提取关键组件，并引入包含灵感检索、假设生成和假设排名的基准。 Result: LLMs在灵感检索任务中表现优异，显示出其发现新颖知识关联的能力。 Conclusion: LLMs可作为“研究假设挖掘工具”，能够大规模生成创新假设，推动自动化科学发现。 Abstract: Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.

Omni-AD: Learning to Reconstruct Global and Local Features for Multi-class Anomaly Detection

Jiajie Quan,Ao Tong,Yuxuan Cai,Xinwei He,Yulong Wang,Yang Zhou

Task: 解决多类无监督异常检测（MUAD）中基于重构方法的学习捷径问题。

Motivation: 现有方法在重构输入图像时容易陷入学习捷径问题，无法有效区分正常与异常样本。

Details

Method: 提出一种两分支解码器块（Omni-block），分别学习全局和局部特征，并通过堆叠构建Omni-AD框架。 Result: 在公开异常检测基准测试中表现优于现有方法。 Conclusion: Omni-AD通过全局和局部特征学习，有效解决了学习捷径问题，提升了异常检测性能。 Abstract: In multi-class unsupervised anomaly detection(MUAD), reconstruction-based methods learn to map input images to normal patterns to identify anomalous pixels. However, this strategy easily falls into the well-known "learning shortcut" issue when decoders fail to capture normal patterns and reconstruct both normal and abnormal samples naively. To address that, we propose to learn the input features in global and local manners, forcing the network to memorize the normal patterns more comprehensively. Specifically, we design a two-branch decoder block, named Omni-block. One branch corresponds to global feature learning, where we serialize two self-attention blocks but replace the query and (key, value) with learnable tokens, respectively, thus capturing global features of normal patterns concisely and thoroughly. The local branch comprises depth-separable convolutions, whose locality enables effective and efficient learning of local features for normal patterns. By stacking Omni-blocks, we build a framework, Omni-AD, to learn normal patterns of different granularity and reconstruct them progressively. Comprehensive experiments on public anomaly detection benchmarks show that our method outperforms state-of-the-art approaches in MUAD. Code is available at https://github.com/easyoo/Omni-AD.git.

Cultivating Game Sense for Yourself: Making VLMs Gaming Experts

Wenxuan Lu,Jiangyang He,Zhanqiu Zhang,Yiwen Guo,Tianning Zang

Task: 开发一种能够在第一/第三人称游戏中实现流畅游戏玩法的智能代理，无需API访问。

Motivation: 当前基于视觉语言模型（VLM）的直接控制方法效率低下，无法处理需要高反应性或动态适应性的任务。

Details

Method: 提出一种新的游戏代理设计范式，VLM开发专门的任务执行模块（如射击和战斗模块），负责实时游戏交互，而VLM则作为高级开发者。 Result: 提出的GameSense框架首次在多种游戏类型（如ACT、FPS和Flappy Bird）中实现了流畅的游戏玩法。 Conclusion: GameSense框架为游戏代理设计设立了新的基准，展示了其在多样化游戏任务中的高效表现。 Abstract: Developing agents capable of fluid gameplay in first/third-person games without API access remains a critical challenge in Artificial General Intelligence (AGI). Recent efforts leverage Vision Language Models (VLMs) as direct controllers, frequently pausing the game to analyze screens and plan action through language reasoning. However, this inefficient paradigm fundamentally restricts agents to basic and non-fluent interactions: relying on isolated VLM reasoning for each action makes it impossible to handle tasks requiring high reactivity (e.g., FPS shooting) or dynamic adaptability (e.g., ACT combat). To handle this, we propose a paradigm shift in gameplay agent design: instead of directly controlling gameplay, VLM develops specialized execution modules tailored for tasks like shooting and combat. These modules handle real-time game interactions, elevating VLM to a high-level developer. Building upon this paradigm, we introduce GameSense, a gameplay agent framework where VLM develops task-specific game sense modules by observing task execution and leveraging vision tools and neural network training pipelines. These modules encapsulate action-feedback logic, ranging from direct action rules to neural network-based decisions. Experiments demonstrate that our framework is the first to achieve fluent gameplay in diverse genres, including ACT, FPS, and Flappy Bird, setting a new benchmark for game-playing agents.

Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation

Junjie Chen,Weilong Chen,Yifan Zuo,Yuming Fang

Task: 提出一种新颖的框架，通过循环挖掘细粒度和结构感知（FGSA）特征，实现类别无关的姿态估计。

Motivation: 现有方法忽略了从支持和查询图像中挖掘细粒度和结构感知特征的重要性，而这些特征对于像素级关键点定位至关重要。

Details

Method: 设计基于可变形注意力机制的FGSA挖掘模块，通过多尺度特征图挖掘细粒度特征，并通过偏移关键点的参考点挖掘结构感知特征。 Result: 在大规模MP-100数据集上显著优于现有方法（+3.2% PCK@0.05）。 Conclusion: 提出的框架通过循环挖掘FGSA特征和混合关键点填充方法，显著提升了类别无关姿态估计的性能。 Abstract: Category-agnostic pose estimation aims to locate keypoints on query images according to a few annotated support images for arbitrary novel classes. Existing methods generally extract support features via heatmap pooling, and obtain interacted features from support and query via cross-attention. Hence, these works neglect to mine fine-grained and structure-aware (FGSA) features from both support and query images, which are crucial for pixel-level keypoint localization. To this end, we propose a novel yet concise framework, which recurrently mines FGSA features from both support and query images. Specifically, we design a FGSA mining module based on deformable attention mechanism. On the one hand, we mine fine-grained features by applying deformable attention head over multi-scale feature maps. On the other hand, we mine structure-aware features by offsetting the reference points of keypoints to their linked keypoints. By means of above module, we recurrently mine FGSA features from support and query images, and thus obtain better support features and query estimations. In addition, we propose to use mixup keypoints to pad various classes to a unified keypoint number, which could provide richer supervision than the zero padding used in existing works. We conduct extensive experiments and in-depth studies on large-scale MP-100 dataset, and outperform SOTA method dramatically (+3.2\%PCK@0.05). Code is avaiable at https://github.com/chenbys/FMMP.

R-PRM: Reasoning-Driven Process Reward Modeling

Shuaijie She,Junxiao Liu,Yifeng Liu,Jiajun Chen,Xin Huang,Shujian Huang

Task: 提出了一种名为R-PRM的方法，用于改进过程奖励模型（PRMs）在逐步数学推理中的评估能力。

Motivation: 现有的PRMs直接输出评估分数，限制了学习效率和评估准确性，且标注数据稀缺。

Details

Method: 利用更强的LLMs生成种子数据，通过偏好优化增强性能，并引入推理时间缩放。 Result: 在ProcessBench和PRMBench上分别超过基线11.9和8.5个F1分数点，在六个数据集上实现超过8.5点的准确率提升。 Conclusion: R-PRM展现出更全面的评估和更强的泛化能力，具有显著潜力。 Abstract: Large language models (LLMs) inevitably make mistakes when performing step-by-step mathematical reasoning. Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step. However, existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy, which is further exacerbated by the scarcity of annotated data. To address these issues, we propose Reasoning-Driven Process Reward Modeling (R-PRM). First, we leverage stronger LLMs to generate seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities and enabling comprehensive step-by-step evaluation. Second, we further enhance performance through preference optimization, without requiring additional annotated data. Third, we introduce inference-time scaling to fully harness the model's reasoning potential. Extensive experiments demonstrate R-PRM's effectiveness: on ProcessBench and PRMBench, it surpasses strong baselines by 11.9 and 8.5 points in F1 scores, respectively. When applied to guide mathematical reasoning, R-PRM achieves consistent accuracy improvements of over 8.5 points across six challenging datasets. Further analysis reveals that R-PRM exhibits more comprehensive evaluation and stronger generalization capabilities, thereby highlighting its significant potential.

ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model

Jinwei Qi,Chaonan Ji,Sheng Xu,Peng Zhang,Bang Zhang,Liefeng Bo

Task: 提出一种新颖的框架，用于生成风格化的实时肖像视频，支持从头部到上半身的交互式视频聊天。

Motivation: 现有方法主要关注头部运动的实时生成，但难以实现与头部动作同步的身体运动，同时对说话风格和面部表情的细粒度控制仍具挑战性。

Details

Method: 采用两阶段方法：第一阶段通过分层运动扩散模型生成多样化的面部表情和同步的头部与身体运动；第二阶段通过注入显式手部控制信号生成包含上半身动作的肖像视频。 Result: 实验结果表明，该方法能够生成具有丰富表现力和自然上半身动作的肖像视频，支持实时交互式视频聊天。 Conclusion: 提出的框架解决了现有方法的局限性，实现了风格化、高表现力和同步的实时肖像视频生成。 Abstract: Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.

Taewon Yun,Jihwan Oh,Hyangsuk Min,Yuho Lee,Jihwan Bang,Jason Cai,Hwanjun Song

Task: 提出ReFeed，一种通过反馈的反思推理增强多维度摘要精炼的流程。

Motivation: 解决多维度摘要精炼中的挑战，尤其是维度间的权衡问题。

Details

Method: 引入SumFeed-CoT数据集，训练轻量级模型进行反思推理。 Result: 实验表明反思推理和多反馈处理对性能至关重要，ReFeed对噪声反馈和顺序具有鲁棒性。 Conclusion: 数据的目标和指导原则是有效推理的基础，数据集和模型将公开。 Abstract: Summarization refinement faces challenges when extending to multi-dimension. In this paper, we introduce ReFeed, a powerful summarization refinement pipeline that enhances multiple dimensions through reflective reasoning on feedback. To achieve this, we release SumFeed-CoT, a large-scale Long-CoT-based dataset optimized for training a lightweight model with reflective reasoning. Our experiments reveal how the number of dimensions, feedback exposure, and reasoning policy influence refinement performance, highlighting reflective reasoning and simultaneously addressing multiple feedback is crucial to mitigate trade-off between dimensions. Furthermore, ReFeed is robust to noisy feedback and feedback order. Lastly, our finding emphasizes that creating data with a proper goal and guideline constitutes a fundamental pillar of effective reasoning. The dataset and model will be released.

The Devil is in Low-Level Features for Cross-Domain Few-Shot Segmentation

Yuhan Liu,Yixiong Zou,Yuhua Li,Ruixuan Li

Task: 研究跨域少样本分割（CDFSS）中性能早期峰值后下降的现象并提出解决方案。

Motivation: 发现CDFSS中目标域性能在早期训练后急剧下降的现象，并探究其原因是低层特征对域偏移敏感。

Details

Method: 提出两个模块：一种新颖的锐度感知最小化方法用于平滑损失景观，以及基于低层特征的校准模块。 Result: 在四个目标数据集上验证，性能显著优于现有方法，1-shot和5-shot场景下平均MIoU分别提升3.71%和5.34%。 Conclusion: 通过解决低层特征的域偏移问题，显著提升了CDFSS的性能。 Abstract: Cross-Domain Few-Shot Segmentation (CDFSS) is proposed to transfer the pixel-level segmentation capabilities learned from large-scale source-domain datasets to downstream target-domain datasets, with only a few annotated images per class. In this paper, we focus on a well-observed but unresolved phenomenon in CDFSS: for target domains, particularly those distant from the source domain, segmentation performance peaks at the very early epochs, and declines sharply as the source-domain training proceeds. We delve into this phenomenon for an interpretation: low-level features are vulnerable to domain shifts, leading to sharper loss landscapes during the source-domain training, which is the devil of CDFSS. Based on this phenomenon and interpretation, we further propose a method that includes two plug-and-play modules: one to flatten the loss landscapes for low-level features during source-domain training as a novel sharpness-aware minimization method, and the other to directly supplement target-domain information to the model during target-domain testing by low-level-based calibration. Extensive experiments on four target datasets validate our rationale and demonstrate that our method surpasses the state-of-the-art method in CDFSS signifcantly by 3.71% and 5.34% average MIoU in 1-shot and 5-shot scenarios, respectively.

Fine-Tuning LLMs on Small Medical Datasets: Text Classification and Normalization Effectiveness on Cardiology reports and Discharge records

Noah Losch,Lucas Plagwitz,Antonius Büscher,Julian Varghese

Task: 研究在小规模医学数据集上微调大型语言模型（LLMs）在文本分类和命名实体识别任务中的有效性。

Motivation: 探索通过本地微调小型LLMs在有限训练数据下提升性能，以替代大型模型。

Details

Method: 使用德国心脏病学报告数据集和i2b2 Smoking Challenge数据集进行实验。 Result: 微调显著提升了任务性能，仅需200-300个训练样本即可取得与大型模型相当的结果。 Conclusion: 研究表明任务特定微调LLMs在自动化临床工作流和高效提取结构化医学数据方面具有潜力。 Abstract: We investigate the effectiveness of fine-tuning large language models (LLMs) on small medical datasets for text classification and named entity recognition tasks. Using a German cardiology report dataset and the i2b2 Smoking Challenge dataset, we demonstrate that fine-tuning small LLMs locally on limited training data can improve performance achieving comparable results to larger models. Our experiments show that fine-tuning improves performance on both tasks, with notable gains observed with as few as 200-300 training examples. Overall, the study highlights the potential of task-specific fine-tuning of LLMs for automating clinical workflows and efficiently extracting structured data from unstructured medical text.

Integrating Travel Behavior Forecasting and Generative Modeling for Predicting Future Urban Mobility and Spatial Transformations

Eugene Denteh,Andrews Danyo,Joshua Kofi Asamoah,Blessing Agyei Kyem,Twitchell Addai,Armstrong Aboah

Task: 整合时间融合变换器和生成对抗网络，预测旅行模式和未来城市布局。

Motivation: 传统交通规划方法难以准确预测长期城市增长和交通需求，可能导致基础设施浪费。

Details

Method: 结合时间融合变换器预测旅行行为，生成对抗网络预测未来城市卫星图像。 Result: 旅行行为预测R平方值为0.76，卫星图像结构相似性指数为0.81。 Conclusion: 数据驱动方法显著提升决策效率，促进可持续城市发展。 Abstract: Transportation planning plays a critical role in shaping urban development, economic mobility, and infrastructure sustainability. However, traditional planning methods often struggle to accurately predict long-term urban growth and transportation demands. This may sometimes result in infrastructure demolition to make room for current transportation planning demands. This study integrates a Temporal Fusion Transformer to predict travel patterns from demographic data with a Generative Adversarial Network to predict future urban settings through satellite imagery. The framework achieved a 0.76 R-square score in travel behavior prediction and generated high-fidelity satellite images with a Structural Similarity Index of 0.81. The results demonstrate that integrating predictive analytics and spatial visualization can significantly improve the decision-making process, fostering more sustainable and efficient urban development. This research highlights the importance of data-driven methodologies in modern transportation planning and presents a step toward optimizing infrastructure placement, capacity, and long-term viability.

From User Preferences to Optimization Constraints Using Large Language Models

Manuela Sanguinetti,Alessandra Perniciano,Luca Zedda,Andrea Loddo,Cecilia Di Ruberto,Maurizio Atzori

Task: 利用大型语言模型（LLMs）将用户偏好转化为家庭能源优化的约束条件。

Motivation: 在可再生能源社区（REC）和意大利场景下，将自然语言用户需求转化为智能家电的正式约束条件。

Details

Method: 评估多种意大利语LLM在零样本、单样本和少样本学习设置下的表现，使用配对的意大利用户请求和正式约束表示数据集。 Result: 建立了该任务的基准性能，公开了数据集和代码，并总结了LLM在该领域的最佳实践和局限性。 Conclusion: LLM在将用户偏好转化为能源优化约束方面具有潜力，但仍需进一步研究以克服局限性。 Abstract: This work explores using Large Language Models (LLMs) to translate user preferences into energy optimization constraints for home appliances. We describe a task where natural language user utterances are converted into formal constraints for smart appliances, within the broader context of a renewable energy community (REC) and in the Italian scenario. We evaluate the effectiveness of various LLMs currently available for Italian in translating these preferences resorting to classical zero-shot, one-shot, and few-shot learning settings, using a pilot dataset of Italian user requests paired with corresponding formal constraint representation. Our contributions include establishing a baseline performance for this task, publicly releasing the dataset and code for further research, and providing insights on observed best practices and limitations of LLMs in this particular domain

Adversarial Wear and Tear: Exploiting Natural Damage for Generating Physical-World Adversarial Examples

Samra Irshad,Seungkyu Lee,Nassir Navab,Hong Joo Lee,Seong Tae Kim

Task: 提出一种新的物理世界对抗样本生成方法AdvWT，模拟自然磨损现象以欺骗深度神经网络。

Motivation: 现有物理对抗样本方法多为临时性修改（如阴影、贴纸），缺乏自然性和普适性。

Details

Method: 采用两步法：1) 使用GAN建模自然磨损；2) 在磨损风格编码中引入对抗扰动。 Result: AdvWT在数字和物理域均有效欺骗DNN，攻击成功率高且外观自然。 Conclusion: AdvWT不仅提升对抗样本的自然性和鲁棒性，还能增强模型对真实磨损标志的泛化能力。 Abstract: The presence of adversarial examples in the physical world poses significant challenges to the deployment of Deep Neural Networks in safety-critical applications such as autonomous driving. Most existing methods for crafting physical-world adversarial examples are ad-hoc, relying on temporary modifications like shadows, laser beams, or stickers that are tailored to specific scenarios. In this paper, we introduce a new class of physical-world adversarial examples, AdvWT, which draws inspiration from the naturally occurring phenomenon of `wear and tear', an inherent property of physical objects. Unlike manually crafted perturbations, `wear and tear' emerges organically over time due to environmental degradation, as seen in the gradual deterioration of outdoor signboards. To achieve this, AdvWT follows a two-step approach. First, a GAN-based, unsupervised image-to-image translation network is employed to model these naturally occurring damages, particularly in the context of outdoor signboards. The translation network encodes the characteristics of damaged signs into a latent `damage style code'. In the second step, we introduce adversarial perturbations into the style code, strategically optimizing its transformation process. This manipulation subtly alters the damage style representation, guiding the network to generate adversarial images where the appearance of damages remains perceptually realistic, while simultaneously ensuring their effectiveness in misleading neural networks. Through comprehensive experiments on two traffic sign datasets, we show that AdvWT effectively misleads DNNs in both digital and physical domains. AdvWT achieves an effective attack success rate, greater robustness, and a more natural appearance compared to existing physical-world adversarial examples. Additionally, integrating AdvWT into training enhances a model's generalizability to real-world damaged signs.

Retrieving Time-Series Differences Using Natural Language Queries

Kota Dohi,Tomoya Nishida,Harsh Purohit,Takashi Endo,Yohei Kawaguchi

Task: 提出一种基于自然语言查询的方法，用于检索时间序列数据对，并基于查询中指定的差异进行匹配。

Motivation: 传统方法需要领域专业知识定义搜索标准，而现有自然语言搜索方法难以处理时间序列数据之间的差异。

Details

Method: 定义了时间序列差异的六个关键特征，构建了相应数据集，并开发了基于对比学习的模型以对齐时间序列差异与查询文本。 Result: 实验结果显示，模型在检索时间序列对时的整体mAP得分为0.994。 Conclusion: 该方法有效解决了时间序列数据差异的检索问题，性能优越。 Abstract: Effectively searching time-series data is essential for system analysis; however, traditional methods often require domain expertise to define search criteria. Recent advancements have enabled natural language-based search, but these methods struggle to handle differences between time-series data. To address this limitation, we propose a natural language query-based approach for retrieving pairs of time-series data based on differences specified in the query. Specifically, we define six key characteristics of differences, construct a corresponding dataset, and develop a contrastive learning-based model to align differences between time-series data with query texts. Experimental results demonstrate that our model achieves an overall mAP score of 0.994 in retrieving time-series pairs.

VADMamba: Exploring State Space Models for Fast Video Anomaly Detection

Jiahao Lyu,Minghua Zhao,Jing Hu,Xuewen Huang,Yifei Chen,Shuangli Du

Task: 提出一种基于Mamba模型的视频异常检测方法VADMamba，结合多任务学习框架预测帧和重构光流。

Motivation: 现有视频异常检测方法（CNN或Transformer）在检测精度和推理速度之间存在权衡，Mamba模型在计算效率和长范围建模方面表现出潜力。

Details

Method: 提出VQ-Mamba Unet（VQ-MaU）框架，结合向量量化层和Mamba-based NVSS模块，分别预测帧和重构光流，并通过片段级融合策略提升精度。 Result: 在三个基准数据集上验证了VADMamba的有效性，推理速度优于先前工作。 Conclusion: VADMamba结合Mamba模型和多任务学习，在视频异常检测中实现了高效且高性能的解决方案。 Abstract: Video anomaly detection (VAD) methods are mostly CNN-based or Transformer-based, achieving impressive results, but the focus on detection accuracy often comes at the expense of inference speed. The emergence of state space models in computer vision, exemplified by the Mamba model, demonstrates improved computational efficiency through selective scans and showcases the great potential for long-range modeling. Our study pioneers the application of Mamba to VAD, dubbed VADMamba, which is based on multi-task learning for frame prediction and optical flow reconstruction. Specifically, we propose the VQ-Mamba Unet (VQ-MaU) framework, which incorporates a Vector Quantization (VQ) layer and Mamba-based Non-negative Visual State Space (NVSS) block. Furthermore, two individual VQ-MaU networks separately predict frames and reconstruct corresponding optical flows, further boosting accuracy through a clip-level fusion evaluation strategy. Experimental results validate the efficacy of the proposed VADMamba across three benchmark datasets, demonstrating superior performance in inference speed compared to previous work. Code is available at https://github.com/jLooo/VADMamba.

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Haoxiang Sun,Yingqian Min,Zhipeng Chen,Wayne Xin Zhao,Zheng Liu,Zhongyuan Wang,Lei Fang,Ji-Rong Wen

Task: 提出并介绍了一个名为OlymMATH的奥林匹克级别数学基准测试，用于严格评估大型推理模型的复杂推理能力。

Motivation: 现有数学推理评估基准已因大型推理模型的快速发展而饱和，亟需更具挑战性和严谨性的评估框架。

Details

Method: OlymMATH包含200个精心设计的问题，分为两个难度层级（AIME级和更难的级别），覆盖四个核心数学领域，并提供可验证的数值解。 Result: 实验结果表明，当前最先进的模型（如DeepSeek-R1和OpenAI的o3-mini）在难题子集上的表现显著受限。 Conclusion: OlymMATH填补了主流数学推理基准在双语评估方面的空白，并为模型推理能力的严格测试提供了新工具。 Abstract: In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark at the STILL project: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.

Model as a Game: On Numerical and Spatial Consistency for Generative Games

Jingye Chen,Yuzhong Zhao,Yupan Huang,Lei Cui,Li Dong,Tengchao Lv,Qifeng Chen,Furu Wei

Task: 探索如何通过设计专门的数值和空间模块，提升生成模型在游戏生成中的数值和空间一致性。

Motivation: 现有生成模型在游戏生成中虽能产生高质量图形，但难以维持数值和空间一致性，影响游戏体验。

Details

Method: 基于DiT架构设计数值模块（LogicNet）和空间模块（地图维护），并通过实验验证其效果。 Result: 实验表明，集成模块在一致性指标上显著优于基线，且推理时间开销极小。 Conclusion: 提出的模块有效解决了生成游戏中的数值和空间一致性问题，为Model as a Game（MaaG）提供了可行方案。 Abstract: Recent advances in generative models have significantly impacted game generation. However, despite producing high-quality graphics and adequately receiving player input, existing models often fail to maintain fundamental game properties such as numerical and spatial consistency. Numerical consistency ensures gameplay mechanics correctly reflect score changes and other quantitative elements, while spatial consistency prevents jarring scene transitions, providing seamless player experiences. In this paper, we revisit the paradigm of generative games to explore what truly constitutes a Model as a Game (MaaG) with a well-developed mechanism. We begin with an empirical study on ``Traveler'', a 2D game created by an LLM featuring minimalist rules yet challenging generative models in maintaining consistency. Based on the DiT architecture, we design two specialized modules: (1) a numerical module that integrates a LogicNet to determine event triggers, with calculations processed externally as conditions for image generation; and (2) a spatial module that maintains a map of explored areas, retrieving location-specific information during generation and linking new observations to ensure continuity. Experiments across three games demonstrate that our integrated modules significantly enhance performance on consistency metrics compared to baselines, while incurring minimal time overhead during inference.

Controlling Large Language Model with Latent Actions

Chengxing Jia,Ziniu Li,Pengyuan Wang,Yi-Chen Li,Zhenyu Hou,Yuxiao Dong,Yang Yu

Task: 学习一个紧凑的潜在动作空间以增强大型语言模型（LLMs）在强化学习（RL）中的可控性和探索性。

Motivation: LLMs在RL训练中缺乏明确的动作空间结构，限制了其在下游任务中的应用潜力。

Details

Method: 提出CoLA框架，将潜在动作空间集成到预训练的LLMs中，并在Llama-3.1-8B模型上应用。 Result: CoLA在数学任务（math500）上得分42.4，优于基线38.2；在基于代理的任务中表现稳定，且计算时间减半。 Conclusion: CoLA框架在提升LLMs的RL适应性和下游任务性能方面具有显著潜力。 Abstract: Adapting Large Language Models (LLMs) to downstream tasks using Reinforcement Learning (RL) has proven to be an effective approach. However, LLMs do not inherently define the structure of an agent for RL training, particularly in terms of defining the action space. This paper studies learning a compact latent action space to enhance the controllability and exploration of RL for LLMs. We propose Controlling Large Language Models with Latent Actions (CoLA), a framework that integrates a latent action space into pre-trained LLMs. We apply CoLA to the Llama-3.1-8B model. Our experiments demonstrate that, compared to RL with token-level actions, CoLA's latent action enables greater semantic diversity in text generation. For enhancing downstream tasks, we show that CoLA with RL achieves a score of 42.4 on the math500 benchmark, surpassing the baseline score of 38.2, and reaches 68.2 when augmented with a Monte Carlo Tree Search variant. Furthermore, CoLA with RL consistently improves performance on agent-based tasks without degrading the pre-trained LLM's capabilities, unlike the baseline. Finally, CoLA reduces computation time by half in tasks involving enhanced thinking prompts for LLMs by RL. These results highlight CoLA's potential to advance RL-based adaptation of LLMs for downstream applications.

DGSUnet: An Improved Unet Model with DINO-Guided SAM2 for Multi-Scale Feature Collaboration

Yimin Xu

Task: 提出一种基于DINOv2和SAM2的多尺度特征协作框架，以解决通用图像分割模型在专业领域中的性能限制问题。

Motivation: 通用图像分割模型（如SAM和DINOv2）在专业领域表现受限，主要由于高训练成本和缺乏特定领域特征表示能力。

Details

Method: 通过特征协作机制、轻量级适配器模块和U形网络结构，实现跨域知识注入和多粒度特征自适应聚合。 Result: 在伪装目标检测和显著目标检测等下游任务中超越现有方法，且无需高成本训练。 Conclusion: 该框架为视觉图像分割的高效部署提供了技术路径，在专业领域具有广泛应用价值。 Abstract: Despite the significant advancements in general image segmentation achieved by large-scale pre-trained foundation models (such as Meta's Segment Any-thing Model (SAM) series and DINOv2), their performance in specialized fields remains limited by two critical issues: the excessive training costs due to large model parameters, and the insufficient ability to represent specific domain characteristics. This paper proposes a multi-scale feature collabora-tion framework guided by DINOv2 for SAM2, with core innovations in three aspects: (1) Establishing a feature collaboration mechanism between DINOv2 and SAM2 backbones, where high-dimensional semantic features extracted by the self-supervised model guide multi-scale feature fusion; (2) Designing lightweight adapter modules and cross-modal, cross-layer feature fusion units to inject cross-domain knowledge while freezing the base model parameters; (3) Constructing a U-shaped network structure based on U-net, which utilizes attention mechanisms to achieve adaptive aggregation decoding of multi-granularity features. This framework surpasses existing state-of-the-art meth-ods in downstream tasks such as camouflage target detection and salient ob-ject detection, without requiring costly training processes. It provides a tech-nical pathway for efficient deployment of visual image segmentation, demon-strating significant application value in a wide range of downstream tasks and specialized fields within image segmentation.Project page: https://github.com/CheneyXuYiMin/SAM2DINO-Seg

An evaluation of LLMs and Google Translate for translation of selected Indian languages via sentiment and semantic analyses

Rohitash Chandra,Aryan Chaudhary,Yeshwanth Rayavarapu

Task: 评估大型语言模型（LLMs）在印度语言（如梵语、泰卢固语和印地语）翻译中的质量，包括语义和情感分析。

Motivation: 目前对LLMs（如Gemini、GPT和Google Translate）生成的翻译质量评估研究有限，尤其是在低资源语言中。

Details

Method: 选择专家翻译的文本，使用LLMs生成其英语翻译，并与专家翻译进行语义和情感分析的比较。 Result: LLMs在翻译准确性上有显著进步，但在保留情感和语义完整性方面仍有挑战，尤其是在比喻和哲学语境中。GPT-4o和GPT-3.5在情感保留上优于Google Translate。 Conclusion: LLMs在情感捕捉上优于Google Translate，但在复杂语境中仍需改进。 Abstract: Large Language models (LLMs) have been prominent for language translation, including low-resource languages. There has been limited study about the assessment of the quality of translations generated by LLMs, including Gemini, GPT and Google Translate. In this study, we address this limitation by using semantic and sentiment analysis of selected LLMs for Indian languages, including Sanskrit, Telugu and Hindi. We select prominent texts that have been well translated by experts and use LLMs to generate their translations to English, and then we provide a comparison with selected expert (human) translations. Our findings suggest that while LLMs have made significant progress in translation accuracy, challenges remain in preserving sentiment and semantic integrity, especially in figurative and philosophical contexts. The sentiment analysis revealed that GPT-4o and GPT-3.5 are better at preserving the sentiments for the Bhagavad Gita (Sanskrit-English) translations when compared to Google Translate. We observed a similar trend for the case of Tamas (Hindi-English) and Maha P (Telugu-English) translations. GPT-4o performs similarly to GPT-3.5 in the translation in terms of sentiments for the three languages. We found that LLMs are generally better at translation for capturing sentiments when compared to Google Translate.

Erika Mori,Yue Qiu,Hirokatsu Kataoka,Yoshimitsu Aoki

Task: 提出一种名为Looped Video Debating (LVD)的框架，结合大型语言模型（LLMs）与视觉信息，以提升涉及人类互动视频的问答任务的透明度和可靠性。

Motivation: 随着机器人和AI系统在护理、医疗和教育中的普及，对能够自然与人类交互的AI需求增加，但现有方法在整合多模态信息（如视觉和语音）方面存在挑战。

Details

Method: 提出LVD框架，将LLMs与视觉信息（如面部表情和身体动作）结合，用于人类互动视频的问答任务。 Result: 在Social-IQ 2.0基准测试中，LVD无需微调即达到最先进性能，并通过补充人类标注数据提供了模型准确性的见解。 Conclusion: LVD框架为AI驱动的社交智能提供了改进方向，展示了在多模态任务中的潜力。 Abstract: Social intelligence, the ability to interpret emotions, intentions, and behaviors, is essential for effective communication and adaptive responses. As robots and AI systems become more prevalent in caregiving, healthcare, and education, the demand for AI that can interact naturally with humans grows. However, creating AI that seamlessly integrates multiple modalities, such as vision and speech, remains a challenge. Current video-based methods for social intelligence rely on general video recognition or emotion recognition techniques, often overlook the unique elements inherent in human interactions. To address this, we propose the Looped Video Debating (LVD) framework, which integrates Large Language Models (LLMs) with visual information, such as facial expressions and body movements, to enhance the transparency and reliability of question-answering tasks involving human interaction videos. Our results on the Social-IQ 2.0 benchmark show that LVD achieves state-of-the-art performance without fine-tuning. Furthermore, supplementary human annotations on existing datasets provide insights into the model's accuracy, guiding future improvements in AI-driven social intelligence.

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo,Weizhi Zhang,Ye Yuan,Yusheng Zhao,Junwei Yang,Yiyang Gu,Bohan Wu,Binqi Chen,Ziyue Qiao,Qingqing Long,Rongcheng Tu,Xiao Luo,Wei Ju,Zhiping Xiao,Yifan Wang,Meng Xiao,Chenwu Liu,Jingyang Yuan,Shichang Zhang,Yiqiao Jin,Fan Zhang,Xian Wu,Hanqing Zhao,Dacheng Tao,Philip S. Yu,Ming Zhang

Task: 系统性地解构大型语言模型（LLM）智能体系统，通过方法论为中心的分类法，链接架构基础、协作机制和进化路径。

Motivation: 揭示智能体设计原则与其在复杂环境中涌现行为之间的基本联系，统一碎片化的研究线索，为研究者提供结构化分类法。

Details

Method: 通过方法论为中心的分类法，分析智能体的构建、协作和进化，同时涵盖评估方法、工具应用、实际挑战和多样化应用领域。 Result: 提供了一个统一的架构视角，总结了LLM智能体的最新发展，并指出了未来研究的有前景方向。 Conclusion: LLM智能体是通向人工通用智能的关键路径，本研究为理解其系统提供了结构化框架，并推动了未来研究的探索。 Abstract: The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at https://github.com/luo-junyu/Awesome-Agent-Papers.

An improved EfficientNetV2 for garbage classification

Wenxuan Qiu,Chengxin Xie,Jingui Huang

Task: 提出一种基于EfficientNetV2的增强型垃圾分类框架，解决数据获取成本、泛化性和实时性能的挑战。

Motivation: 解决垃圾分类中数据获取成本高、模型泛化能力不足和实时性能需求的问题。

Details

Method: 提出CE-Attention模块减少全局池化中的特征损失，并开发轻量级多尺度空间特征提取模块（SAFM），结合深度可分离卷积降低模型复杂度，同时采用数据增强策略提升泛化能力。 Result: 在华为云垃圾分类数据集上，分类准确率达到95.4%，比基线提升3.2%，优于主流模型。 Conclusion: 该方法在准确性和效率之间取得了平衡，适用于实际垃圾分类场景。 Abstract: This paper presents an enhanced waste classification framework based on EfficientNetV2 to address challenges in data acquisition cost, generalization, and real-time performance. We propose a Channel-Efficient Attention (CE-Attention) module that mitigates feature loss during global pooling without introducing dimensional scaling, effectively enhancing critical feature extraction. Additionally, a lightweight multi-scale spatial feature extraction module (SAFM) is developed by integrating depthwise separable convolutions, significantly reducing model complexity. Comprehensive data augmentation strategies are further employed to improve generalization. Experiments on the Huawei Cloud waste classification dataset demonstrate that our method achieves a classification accuracy of 95.4\%, surpassing the baseline by 3.2\% and outperforming mainstream models. The results validate the effectiveness of our approach in balancing accuracy and efficiency for practical waste classification scenarios.

Harnessing Chain-of-Thought Metadata for Task Routing and Adversarial Prompt Detection

Ryan Marinelli,Josef Pichlmeier,Tamas Bisztray

Task: 提出一种名为“Number of Thoughts (NofT)”的指标，用于评估任务难度并支持大型语言模型（LLMs）在生产环境中的应用。

Motivation: 通过量化任务难度，优化提示路由，降低延迟，并有效检测对抗性提示攻击。

Details

Method: 基于“思想数量”设定阈值，用于提示路由和对抗性提示检测。 Result: 在MathInstruct数据集上实现了2%的延迟降低，对抗性提示检测准确率达到95%。 Conclusion: NofT指标在优化LLM性能和安全性方面具有实际应用价值。 Abstract: In this work, we propose a metric called Number of Thoughts (NofT) to determine the difficulty of tasks pre-prompting and support Large Language Models (LLMs) in production contexts. By setting thresholds based on the number of thoughts, this metric can discern the difficulty of prompts and support more effective prompt routing. A 2% decrease in latency is achieved when routing prompts from the MathInstruct dataset through quantized, distilled versions of Deepseek with 1.7 billion, 7 billion, and 14 billion parameters. Moreover, this metric can be used to detect adversarial prompts used in prompt injection attacks with high efficacy. The Number of Thoughts can inform a classifier that achieves 95% accuracy in adversarial prompt detection. Our experiments ad datasets used are available on our GitHub page: https://github.com/rymarinelli/Number_Of_Thoughts/tree/main.

FakeReasoning: Towards Generalizable Forgery Detection and Reasoning

Yueying Gao,Dongliang Chang,Bingyao Yu,Haotian Qin,Lei Chen,Kongming Liang,Zhanyu Ma

Task: 开发一种可解释的AI生成图像检测方法，解决生成模型之间的领域差距问题。

Motivation: 由于AI生成图像的广泛使用和潜在滥用风险，需要一种能够准确检测并提供解释的方法。

Details

Method: 提出FDR-Task任务，利用视觉语言模型（VLMs）进行结构化推理，并引入MMFR-Dataset和FakeReasoning框架。 Result: FakeReasoning在多个生成模型上表现出色，检测和推理任务均优于现有方法。 Conclusion: FakeReasoning通过结合对比学习和分类概率映射，实现了对AI生成图像的鲁棒检测和解释。 Abstract: Accurate and interpretable detection of AI-generated images is essential for mitigating risks associated with AI misuse. However, the substantial domain gap among generative models makes it challenging to develop a generalizable forgery detection model. Moreover, since every pixel in an AI-generated image is synthesized, traditional saliency-based forgery explanation methods are not well suited for this task. To address these challenges, we propose modeling AI-generated image detection and explanation as a Forgery Detection and Reasoning task (FDR-Task), leveraging vision-language models (VLMs) to provide accurate detection through structured and reliable reasoning over forgery attributes. To facilitate this task, we introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 100K images across 10 generative models, with 10 types of forgery reasoning annotations, enabling comprehensive evaluation of FDR-Task. Additionally, we propose FakeReasoning, a forgery detection and reasoning framework with two key components. First, Forgery-Aligned Contrastive Learning enhances VLMs' understanding of forgery-related semantics through both cross-modal and intra-modal contrastive learning between images and forgery attribute reasoning. Second, a Classification Probability Mapper bridges the optimization gap between forgery detection and language modeling by mapping the output logits of VLMs to calibrated binary classification probabilities. Experiments across multiple generative models demonstrate that FakeReasoning not only achieves robust generalization but also outperforms state-of-the-art methods on both detection and reasoning tasks.

OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs

John Murzaku,Owen Rambow

Task: 系统评估四种omni-LLMs在零样本情感识别任务中的表现。

Motivation: 研究omni-LLMs在多模态认知状态任务（尤其是涉及语音的任务）中的应用尚未充分探索。

Details

Method: 提出OmniVox，评估omni-LLMs在IEMOCAP和MELD两个多模态情感基准上的表现，并引入声学提示策略。 Result: 零样本omni-LLMs表现优于或与微调的音频模型相当，且上下文窗口分析显示使用上下文有助于提升性能。 Conclusion: 声学提示策略和上下文分析为omni-LLMs在多模态情感任务中的应用提供了有效方法。 Abstract: The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.

VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

Alan Dao,Norapat Buppodom

Task: 提出一种利用视觉语言模型（VLM）从体素数据中提取高级语义信息（如物体身份、颜色和位置）的新方法。

Motivation: 理解3D环境对智能系统（如机器人和自主导航）至关重要，但现有体素网格方法在提取高级语义信息方面仍存在挑战。

Details

Method: 通过沿主轴（如Z轴）系统切片体素空间，将2D切片输入标准VLM的图像编码器，利用预训练的2D VLM实现3D语义理解。 Result: 该方法能够从体素表示中高效提取语义信息，避免了复杂3D网络的使用。 Conclusion: 基于切片的策略有效利用了预训练2D VLM的能力，为直接从体素表示中理解3D语义提供了高效途径。 Abstract: Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract "voxel semantics"-object identity, color, and location-from voxel data. Critically, instead of employing complex 3D networks, our method processes the voxel space by systematically slicing it along a primary axis (e.g., the Z-axis, analogous to CT scan slices). These 2D slices are then formatted and sequentially fed into the image encoder of a standard VLM. The model learns to aggregate information across slices and correlate spatial patterns with semantic concepts provided by the language component. This slice-based strategy aims to leverage the power of pre-trained 2D VLMs for efficient 3D semantic understanding directly from voxel representations.

OpenHuEval: Evaluating Large Language Model on Hungarian Specifics

Haote Yang,Xingjian Wei,Jiang Wu,Noémi Ligeti-Nagy,Jiaxing Sun,Yinfan Wang,Zijian Győző Yang,Junyuan Gao,Jingchao Wang,Bowen Jiang,Shasha Wang,Nanjun Yu,Zihao Zhang,Shixin Hong,Hongwei Liu,Wei Li,Songyang Zhang,Dahua Lin,Lijun Wu,Gábor Prószéky,Conghui He

Task: 构建OpenHuEval，首个专注于匈牙利语言和特性的LLM基准测试。

Motivation: 为匈牙利语言和特性提供全面、深入且科学准确的LLM性能评估工具。

Details

Method: 利用多源匈牙利语材料，结合最新LLM评估设计原则（如真实用户查询、生成能力评估、LLM-as-judge）。 Result: OpenHuEval包含8个维度、5个任务和3953个问题，评估主流LLM并揭示匈牙利语优化的必要性。 Conclusion: OpenHuEval为匈牙利语LLM评估和优化提供了科学框架，并揭示了非英语语言模型的思维模式。 Abstract: We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .

GenFusion: Closing the Loop between Reconstruction and Generation via Videos

Sibo Wu,Congrong Xu,Binbin Huang,Andreas Geiger,Anpei Chen

Task: 提出一种重建驱动的视频扩散模型，以解决3D重建与生成之间的条件差距问题。

Motivation: 现有3D重建和生成方法之间存在条件差距，例如重建需要密集视图而生成仅需单视图，限制了应用。

Details

Method: 提出重建驱动的视频扩散模型，通过循环融合流程迭代添加生成模型的恢复帧到训练集。 Result: 在稀疏视图和掩码输入下的视图合成验证了方法的有效性。 Conclusion: 该方法通过结合重建与生成，解决了条件差距问题，并提升了性能。 Abstract: Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g., scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach.

Keyword-Oriented Multimodal Modeling for Euphemism Identification

Yuxue Hu,Junsong Li,Meixuan Chen,Dongyu Su,Tongguan Wang,Ying Sha

Task: 识别委婉语的真实含义，例如将‘weed’（委婉语）与‘marijuana’（目标关键词）关联起来。

Motivation: 现有方法主要基于文本，而社交媒体的兴起凸显了多模态分析（结合文本、图像和音频）的需求，但缺乏多模态委婉语数据集限制了进一步研究。

Details

Method: 引入一个关键词导向的多模态委婉语语料库（KOM-Euph），并提出一种关键词导向的多模态委婉语识别方法（KOM-EI），利用跨模态特征对齐和动态融合模块。 Result: KOM-EI在实验中表现优于现有最先进模型和大语言模型，并验证了多模态数据集的重要性。 Conclusion: 通过引入多模态数据集和方法，本研究为委婉语识别提供了更高效的解决方案。 Abstract: Euphemism identification deciphers the true meaning of euphemisms, such as linking "weed" (euphemism) to "marijuana" (target keyword) in illicit texts, aiding content moderation and combating underground markets. While existing methods are primarily text-based, the rise of social media highlights the need for multimodal analysis, incorporating text, images, and audio. However, the lack of multimodal datasets for euphemisms limits further research. To address this, we regard euphemisms and their corresponding target keywords as keywords and first introduce a keyword-oriented multimodal corpus of euphemisms (KOM-Euph), involving three datasets (Drug, Weapon, and Sexuality), including text, images, and speech. We further propose a keyword-oriented multimodal euphemism identification method (KOM-EI), which uses cross-modal feature alignment and dynamic fusion modules to explicitly utilize the visual and audio features of the keywords for efficient euphemism identification. Extensive experiments demonstrate that KOM-EI outperforms state-of-the-art models and large language models, and show the importance of our multimodal datasets.

Frequency-Aware Gaussian Splatting Decomposition

Yishai Lavi,Leo Segre,Shai Avidan

Task: 提出一种频率分解的3D高斯泼溅框架，用于分离低频结构与高频细节。

Motivation: 3D高斯泼溅（3D-GS）缺乏频率可解释性，难以分离低频结构与高频细节。

Details

Method: 通过拉普拉斯金字塔分组3D高斯，并引入正则化和渐进训练方案。 Result: 实现了频率分离，支持高级3D编辑、动态细节控制和交互式渲染。 Conclusion: 该方法在场景编辑和交互式渲染中提供了更好的控制与灵活性。 Abstract: 3D Gaussian Splatting (3D-GS) has revolutionized novel view synthesis with its efficient, explicit representation. However, it lacks frequency interpretability, making it difficult to separate low-frequency structures from fine details. We introduce a frequency-decomposed 3D-GS framework that groups 3D Gaussians that correspond to subbands in the Laplacian Pyrmaids of the input images. Our approach enforces coherence within each subband (i.e., group of 3D Gaussians) through dedicated regularization, ensuring well-separated frequency components. We extend color values to both positive and negative ranges, allowing higher-frequency layers to add or subtract residual details. To stabilize optimization, we employ a progressive training scheme that refines details in a coarse-to-fine manner. Beyond interpretability, this frequency-aware design unlocks a range of practical benefits. Explicit frequency separation enables advanced 3D editing and stylization, allowing precise manipulation of specific frequency bands. It also supports dynamic level-of-detail control for progressive rendering, streaming, foveated rendering and fast geometry interaction. Through extensive experiments, we demonstrate that our method provides improved control and flexibility for emerging applications in scene editing and interactive rendering. Our code will be made publicly available.

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

Yue Li,Meng Tian,Zhenyu Lin,Jiangtong Zhu,Dechang Zhu,Haiqiang Liu,Zining Wang,Yueyi Zhang,Zhiwei Xiong,Xinhai Zhao

Task: 提出一个细粒度的视觉语言模型（VLM）基准测试VLADBench，用于评估自动驾驶（AD）场景中的能力。

Motivation: 现有的VLM基准测试主要通过粗粒度的开放式视觉问答（QA）评估可解释性，不足以应对复杂驾驶场景的需求。

Details

Method: 构建VLADBench数据集，包含5个关键领域的封闭式QA任务，并进一步细分为11个次级方面和29个三级任务。同时，训练领域特定（DS）模型以评估其性能。 Result: 实验结果表明，VLADBench能更全面地评估VLM在AD中的能力，揭示了其优势和局限性。 Conclusion: VLADBench为开发更具认知和推理能力的AD系统提供了重要基础。 Abstract: Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce $\textbf{VLADBench}$, a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate $\textbf{VLADBench}$ spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.

Clean Image May be Dangerous: Data Poisoning Attacks Against Deep Hashing

Shuai Li,Jie Zhang,Yuang Qi,Kejiang Chen,Tianwei Zhang,Weiming Zhang,Nenghai Yu

Task: 研究针对深度哈希的数据投毒攻击（PADHASH）。

Motivation: 深度哈希方法易受恶意攻击，但现有攻击通常涉及修改查询图像，而实际场景中无需修改查询图像即可诱导恶意检索结果。

Details

Method: 首先训练一个代理模型模拟目标深度哈希模型的行为，然后提出严格的梯度匹配策略生成投毒图像。 Result: 在不同模型、数据集、哈希方法和哈希码长度上的实验证明了攻击方法的有效性和通用性。 Conclusion: 首次研究了针对深度哈希的数据投毒攻击，并验证了其在实际场景中的威胁。 Abstract: Large-scale image retrieval using deep hashing has become increasingly popular due to the exponential growth of image data and the remarkable feature extraction capabilities of deep neural networks (DNNs). However, deep hashing methods are vulnerable to malicious attacks, including adversarial and backdoor attacks. It is worth noting that these attacks typically involve altering the query images, which is not a practical concern in real-world scenarios. In this paper, we point out that even clean query images can be dangerous, inducing malicious target retrieval results, like undesired or illegal images. To the best of our knowledge, we are the first to study data \textbf{p}oisoning \textbf{a}ttacks against \textbf{d}eep \textbf{hash}ing \textbf{(\textit{PADHASH})}. Specifically, we first train a surrogate model to simulate the behavior of the target deep hashing model. Then, a strict gradient matching strategy is proposed to generate the poisoned images. Extensive experiments on different models, datasets, hash methods, and hash code lengths demonstrate the effectiveness and generality of our attack method.

Ana-Maria Bucur,Andreea-Codrina Moldovan,Krutika Parvatikar,Marcos Zampieri,Ashiqur R. KhudaBukhsh,Liviu P. Dinu

Task: 提供一份用于分析和预测抑郁症的社交媒体数据集清单。

Motivation: 抑郁症是最常见的心理健康问题，COVID-19疫情期间其发病率上升，研究希望通过社交媒体数据增强传统筛查方法。

Details

Method: 综述2019年至2024年间发布的数据集，并提供在线持续更新的资源。 Result: 提供了一份全面的数据集清单，支持早期研究者进行跨学科研究。 Conclusion: 该资源有望促进社交媒体上抑郁症语言表达的进一步研究。 Abstract: Depression is the most common mental health disorder, and its prevalence increased during the COVID-19 pandemic. As one of the most extensively researched psychological conditions, recent research has increasingly focused on leveraging social media data to enhance traditional methods of depression screening. This paper addresses the growing interest in interdisciplinary research on depression, and aims to support early-career researchers by providing a comprehensive and up-to-date list of datasets for analyzing and predicting depression through social media data. We present an overview of datasets published between 2019 and 2024. We also make the comprehensive list of datasets available online as a continuously updated resource, with the hope that it will facilitate further interdisciplinary research into the linguistic expressions of depression on social media.

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

Haoyu Zhao,Zhongang Qi,Cong Wang,Qingping Zheng,Guansong Lu,Fei Chen,Hang Xu,Zuxuan Wu

Task: 提出DynamiCtrl框架，解决人像动画中的架构限制和文本信息利用不足问题。

Motivation: 现有方法依赖U-Net架构且忽视文本信息，限制了生成模型的性能和控制能力。

Details

Method: 采用MM-DiT架构，引入Shared VAE编码器和Pose-adaptive Layer Norm (PadaLN)，结合文本与视觉特征对齐。 Result: 实验证明DynamiCtrl在身份保留、异质角色驱动、背景控制和高质量合成方面表现优越。 Conclusion: DynamiCtrl通过结合文本和姿态控制，显著提升了人像动画的生成质量和可控性。 Abstract: Human image animation has recently gained significant attention due to advancements in generative models. However, existing methods still face two major challenges: (1) architectural limitations, most models rely on U-Net, which underperforms compared to the MM-DiT; and (2) the neglect of textual information, which can enhance controllability. In this work, we introduce DynamiCtrl, a novel framework that not only explores different pose-guided control structures in MM-DiT, but also reemphasizes the crucial role of text in this task. Specifically, we employ a Shared VAE encoder for both reference images and driving pose videos, eliminating the need for an additional pose encoder and simplifying the overall framework. To incorporate pose features into the full attention blocks, we propose Pose-adaptive Layer Norm (PadaLN), which utilizes adaptive layer normalization to encode sparse pose features. The encoded features are directly added to the visual input, preserving the spatiotemporal consistency of the backbone while effectively introducing pose control into MM-DiT. Furthermore, within the full attention mechanism, we align textual and visual features to enhance controllability. By leveraging text, we not only enable fine-grained control over the generated content, but also, for the first time, achieve simultaneous control over both background and motion. Experimental results verify the superiority of DynamiCtrl on benchmark datasets, demonstrating its strong identity preservation, heterogeneous character driving, background controllability, and high-quality synthesis. The project page is available at https://gulucaptain.github.io/DynamiCtrl/.

Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models

Umer Butt,Stalin Veranasi,Günter Neumann

Task: 提出一种基于Transformer的方法，用于乌尔都语与其罗马化形式之间的音译。

Motivation: 解决低资源语言（如乌尔都语）音译任务中领域适应性和评估不足的问题。

Details

Method: 使用m2m100多语言翻译模型，结合掩码语言建模（MLM）预训练和在Roman-Urdu-Parl及Dakshina数据集上的微调。 Result: 模型在Char-BLEU指标上表现优异（乌尔都语->罗马化乌尔都语96.37，罗马化乌尔都语->乌尔都语97.44），优于RNN基线和GPT-4o Mini。 Conclusion: 多语言迁移学习对低资源音译任务有效。 Abstract: As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. Transliteration between Urdu and its Romanized form, Roman Urdu, remains underexplored despite the widespread use of both scripts in South Asia. Prior work using RNNs on the Roman-Urdu-Parl dataset showed promising results but suffered from poor domain adaptability and limited evaluation. We propose a transformer-based approach using the m2m100 multilingual translation model, enhanced with masked language modeling (MLM) pretraining and fine-tuning on both Roman-Urdu-Parl and the domain-diverse Dakshina dataset. To address previous evaluation flaws, we introduce rigorous dataset splits and assess performance using BLEU, character-level BLEU, and CHRF. Our model achieves strong transliteration performance, with Char-BLEU scores of 96.37 for Urdu->Roman-Urdu and 97.44 for Roman-Urdu->Urdu. These results outperform both RNN baselines and GPT-4o Mini and demonstrate the effectiveness of multilingual transfer learning for low-resource transliteration tasks.

Orange Quality Grading with Deep Learning

Mohamed Lamine Mekhalfi,Paul Chippendale,Francisco Fraile,Marcos Rico

Task: 实现基于深度学习的多视角橙子分级方法。

Motivation: 橙子分级对水果行业至关重要，自动化分级能提高效率、精度并减少人力。

Details

Method: 通过机器视觉捕获多视角图像，合成一张完整图像，并使用卷积神经网络（CNN）进行分级。 Result: 实验证明多视角分级优于单视角分级。 Conclusion: 多视角深度学习方法是橙子分级的有效解决方案。 Abstract: Orange grading is a crucial step in the fruit industry, as it helps to sort oranges according to different criteria such as size, quality, ripeness, and health condition, ensuring safety for human consumption and better price allocation and client satisfaction. Automated grading enables faster processing, precision, and reduced human labor. In this paper, we implement a deep learning-based solution for orange grading via machine vision. Unlike typical grading systems that analyze fruits from a single view, we capture multiview images of each single orange in order to enable a richer representation. Afterwards, we compose the acquired images into one collage. This enables the analysis of the whole orange skin. We train a convolutional neural network (CNN) on the composed images to grade the oranges into three classes, namely good, bad, and undefined. We also evaluate the performance with two different CNNs (ResNet-18 and SqueezeNet). We show experimentally that multi-view grading is superior to single view grading.

SWI: Speaking with Intent in Large Language Models

Yuwei Yin,EunJeong Hwang,Giuseppe Carenini

Task: 提出并验证了在大型语言模型（LLMs）中引入‘Speaking with Intent（SWI）’的概念，以增强其推理能力和生成质量。

Motivation: 通过模拟人类有目的性的思维过程，SWI旨在为LLMs提供明确的意图和高层次规划，从而提升其分析和沟通能力。

Details

Method: 在数学推理、问答和文本摘要等任务上进行广泛实验，比较SWI与基线方法及其他提示方法的性能。 Result: SWI在数学推理基准测试中表现优于基线方法及Chain-of-Thought和Plan-and-Solve方法，并在问答和文本摘要任务中展现出更高的准确性和一致性。 Conclusion: SWI为增强LLMs的推理能力提供了一种新的认知途径，其生成的意图具有连贯性、有效性和可解释性。 Abstract: Intent, typically clearly formulated and planned, functions as a cognitive framework for reasoning and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model's underlying intention and provides high-level planning to guide subsequent analysis and communication. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on mathematical reasoning benchmarks consistently demonstrate the superiority of Speaking with Intent over Baseline (i.e., generation without explicit intent). Moreover, SWI outperforms answer-trigger prompting methods Chain-of-Thought and Plan-and-Solve and maintains competitive performance with the strong method ARR (Analyzing, Retrieving, and Reasoning). Additionally, the effectiveness and generalizability of SWI are solidified on reasoning-intensive question answering (QA) and text summarization benchmarks, where SWI brings consistent improvement to the Baseline generation. In text summarization, SWI-generated summaries exhibit greater accuracy, conciseness, and factual correctness, with fewer hallucinations. Furthermore, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. This proof-of-concept study creates a novel avenue for enhancing LLMs' reasoning abilities with cognitive notions.

Vision-to-Music Generation: A Survey

Zhaokai Wang,Chenxi Bao,Le Zhuo,Jingrui Han,Yang Yue,Yihong Tang,Victor Shea-Jay Huang,Yue Liao

Task: 系统综述视觉到音乐生成领域的研究进展，包括输入类型、输出类型、方法、数据集和评估指标。

Motivation: 视觉到音乐生成是多模态人工智能的重要分支，具有广阔应用前景，但研究尚处于初步阶段，缺乏全面讨论。

Details

Method: 分析技术特点和核心挑战，总结现有方法，详细综述数据集和评估指标。 Result: 提供了对视觉到音乐生成领域的系统性综述，总结了当前挑战和未来方向。 Conclusion: 希望该综述能促进视觉到音乐生成及多模态生成领域的进一步创新。 Abstract: Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies on vision-to-music generation from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and promising directions for future research. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications. To follow latest works and foster further innovation in this field, we are continuously maintaining a GitHub repository at https://github.com/wzk1015/Awesome-Vision-to-Music-Generation.

Evaluating book summaries from internal knowledge in Large Language Models: a cross-model and semantic consistency approach

Javier Coronado-Blázquez

Task: 研究大型语言模型（LLMs）仅凭内部知识生成全面且准确的书籍摘要的能力。

Motivation: 探讨LLMs能否在不依赖原文的情况下合成符合人类理解的叙述。

Details

Method: 使用多样化的书籍和多种LLM架构，通过LLM作为评判者的范式进行评估，包括交叉模型评估和ROUGE、BERTScore量化对齐。 Result: 揭示了模型在内容表达和风格偏好上的细微差异，展示了依赖内部知识进行摘要任务的优缺点。 Conclusion: 这些发现有助于更深入地理解LLM对事实信息的内部编码和跨模型评估的动态，为开发更强大的自然语言生成系统提供启示。 Abstract: We study the ability of large language models (LLMs) to generate comprehensive and accurate book summaries solely from their internal knowledge, without recourse to the original text. Employing a diverse set of books and multiple LLM architectures, we examine whether these models can synthesize meaningful narratives that align with established human interpretations. Evaluation is performed with a LLM-as-a-judge paradigm: each AI-generated summary is compared against a high-quality, human-written summary via a cross-model assessment, where all participating LLMs evaluate not only their own outputs but also those produced by others. This methodology enables the identification of potential biases, such as the proclivity for models to favor their own summarization style over others. In addition, alignment between the human-crafted and LLM-generated summaries is quantified using ROUGE and BERTScore metrics, assessing the depth of grammatical and semantic correspondence. The results reveal nuanced variations in content representation and stylistic preferences among the models, highlighting both strengths and limitations inherent in relying on internal knowledge for summarization tasks. These findings contribute to a deeper understanding of LLM internal encodings of factual information and the dynamics of cross-model evaluation, with implications for the development of more robust natural language generative systems.

Learn by Reasoning: Analogical Weight Generation for Few-Shot Class-Incremental Learning

Jizhou Han,Chenhao Ding,Yuhang He,Songlin Dong,Qiang Wang,Xinyuan Gao,Yihong Gong

Task: Few-shot class-incremental Learning (FSCIL) enables models to learn new classes from limited data while retaining performance on previously learned classes.

Motivation: Traditional FSCIL methods suffer from a separation between learning new classes and utilizing old knowledge, and require fine-tuning parameters with limited new class data.

Details

Method: Proposed a novel analogical generative method inspired by human brain analogical learning mechanisms, including the Brain-Inspired Analogical Generator (BiAG) with three components: Weight Self-Attention Module (WSA), Weight & Prototype Analogical Attention Module (WPAA), and Semantic Conversion Module (SCM). Result: Experiments on miniImageNet, CUB-200, and CIFAR-100 datasets show higher final and average accuracy compared to SOTA methods. Conclusion: The proposed method effectively addresses the limitations of traditional FSCIL by generating new class weights without fine-tuning, leveraging analogical learning. Abstract: Few-shot class-incremental Learning (FSCIL) enables models to learn new classes from limited data while retaining performance on previously learned classes. Traditional FSCIL methods often require fine-tuning parameters with limited new class data and suffer from a separation between learning new classes and utilizing old knowledge. Inspired by the analogical learning mechanisms of the human brain, we propose a novel analogical generative method. Our approach includes the Brain-Inspired Analogical Generator (BiAG), which derives new class weights from existing classes without parameter fine-tuning during incremental stages. BiAG consists of three components: Weight Self-Attention Module (WSA), Weight & Prototype Analogical Attention Module (WPAA), and Semantic Conversion Module (SCM). SCM uses Neural Collapse theory for semantic conversion, WSA supplements new class weights, and WPAA computes analogies to generate new class weights. Experiments on miniImageNet, CUB-200, and CIFAR-100 datasets demonstrate that our method achieves higher final and average accuracy compared to SOTA methods.

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

Xiaoye Qu,Yafu Li,Zhaochen Su,Weigao Sun,Jianhao Yan,Dongrui Liu,Ganqu Cui,Daizong Liu,Shuxian Liang,Junxian He,Peng Li,Wei Wei,Jing Shao,Chaochao Lu,Yue Zhang,Xian-Sheng Hua,Bowen Zhou,Yu Cheng

Task: 综述大型推理模型（LRMs）中推理效率提升的最新研究进展。

Motivation: 尽管LRMs在推理长度扩展上表现优异，但其生成的冗长推理痕迹带来了训练、推理和实际部署中的效率问题，亟需解决。

Details

Method: 通过分析LRMs生命周期（从预训练到推理）中的低效模式，总结现有方法，并讨论未来研究方向。 Result: 提供了对LRMs推理效率问题的全面概述，并维护了一个实时更新的GitHub仓库以跟踪进展。 Conclusion: 本综述旨在为后续研究提供基础，并推动这一快速发展的领域的创新。 Abstract: Recent Large Reasoning Models (LRMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference. However, a growing concern lies in their tendency to produce excessively long reasoning traces, which are often filled with redundant content (e.g., repeated definitions), over-analysis of simple problems, and superficial exploration of multiple reasoning paths for harder tasks. This inefficiency introduces significant challenges for training, inference, and real-world deployment (e.g., in agent-based systems), where token economy is critical. In this survey, we provide a comprehensive overview of recent efforts aimed at improving reasoning efficiency in LRMs, with a particular focus on the unique challenges that arise in this new paradigm. We identify common patterns of inefficiency, examine methods proposed across the LRM lifecycle, i.e., from pretraining to inference, and discuss promising future directions for research. To support ongoing development, we also maintain a real-time GitHub repository tracking recent progress in the field. We hope this survey serves as a foundation for further exploration and inspires innovation in this rapidly evolving area.

Reducing CT Metal Artifacts by Learning Latent Space Alignment with Gemstone Spectral Imaging Data

Wencheng Han,Dongqian Guo,Xiao Chen,Pang Lyu,Yi Jin,Jianbing Shen

Task: 提出一种名为Latent Gemstone Spectral Imaging (GSI) Alignment Framework的方法，用于减少CT切片中的金属伪影。

Motivation: 金属伪影降低了CT图像质量，影响对金属植入物周围组织的准确诊断。

Details

Method: 通过调整普通CT图像的表示以匹配GSI CT序列，从而抑制金属伪影并清晰显示细节结构。 Result: 实验结果表明，该方法显著减少了金属伪影，并大幅提升了CT切片的可读性。 Conclusion: 提出的Alignment Framework有效解决了金属伪影问题，且未引入额外噪声信息。 Abstract: Metal artifacts in CT slices have long posed challenges in medical diagnostics. These artifacts degrade image quality, resulting in suboptimal visualization and complicating the accurate interpretation of tissues adjacent to metal implants. To address these issues, we introduce the Latent Gemstone Spectral Imaging (GSI) Alignment Framework, which effectively reduces metal artifacts while avoiding the introduction of noise information. Our work is based on a key finding that even artifact-affected ordinary CT sequences contain sufficient information to discern detailed structures. The challenge lies in the inability to clearly represent this information. To address this issue, we developed an Alignment Framework that adjusts the representation of ordinary CT images to match GSI CT sequences. GSI is an advanced imaging technique using multiple energy levels to mitigate artifacts caused by metal implants. By aligning the representation to GSI data, we can effectively suppress metal artifacts while clearly revealing detailed structure, without introducing extraneous information into CT sequences. To facilitate the application, we propose a new dataset, Artifacts-GSI, captured from real patients with metal implants, and establish a new benchmark based on this dataset. Experimental results show that our method significantly reduces metal artifacts and greatly enhances the readability of CT slices. All our code and data are available at: https://um-lab.github.io/GSI-MAR/

COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing

Rajvee Sheth,Himanshu Beniwal,Mayank Singh

Task: 构建并发布COMI-LINGUA数据集，用于支持五种基本NLP任务，包括语言识别、矩阵语言识别、词性标注、命名实体识别和翻译。

Motivation: 解决现有数据集在捕捉真实世界语言混合（如印地语-英语）时的不足，如依赖合成数据或范围有限。

Details

Method: 创建并手动标注包含100,970个实例的COMI-LINGUA数据集，由三位专家在Devanagari和罗马脚本中进行评估。 Result: 评估大型语言模型在COMI-LINGUA上的表现，揭示了当前多语言建模策略的局限性。 Conclusion: COMI-LINGUA为改进混合文本处理能力提供了重要资源，并公开可用。 Abstract: The rapid growth of digital communication has driven the widespread use of code-mixing, particularly Hindi-English, in multilingual communities. Existing datasets often focus on romanized text, have limited scope, or rely on synthetic data, which fails to capture realworld language nuances. Human annotations are crucial for assessing the naturalness and acceptability of code-mixed text. To address these challenges, We introduce COMI-LINGUA, the largest manually annotated dataset for code-mixed text, comprising 100,970 instances evaluated by three expert annotators in both Devanagari and Roman scripts. The dataset supports five fundamental NLP tasks: Language Identification, Matrix Language Identification, Part-of-Speech Tagging, Named Entity Recognition, and Translation. We evaluate LLMs on these tasks using COMILINGUA, revealing limitations in current multilingual modeling strategies and emphasizing the need for improved code-mixed text processing capabilities. COMI-LINGUA is publically availabe at: https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA.

vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition

Yunusa Haruna,Adamu Lawan

Task: 提出一种结合状态空间模型（SSMs）和注意力机制的视觉主干网络vGamba，以高效捕获长距离依赖关系。

Motivation: 现有方法（如CNNs和ViTs）在长距离依赖建模上存在局限性，CNNs感受野受限，ViTs计算成本高，而SSMs在视觉任务中的应用尚未充分探索。

Details

Method: 设计vGamba，结合Gamba Cell（基于Mamba的2D空间结构）、多头自注意力（MHSA）和门控融合模块，以高效建模长距离依赖。 Result: 在分类、检测和分割任务中，vGamba在准确性和计算效率上优于现有模型。 Conclusion: vGamba通过结合SSMs和注意力机制，实现了高效且准确的长距离依赖建模。 Abstract: Capturing long-range dependencies efficiently is essential for visual recognition tasks, yet existing methods face limitations. Convolutional neural networks (CNNs) struggle with restricted receptive fields, while Vision Transformers (ViTs) achieve global context and long-range modeling at a high computational cost. State-space models (SSMs) offer an alternative, but their application in vision remains underexplored. This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness. At its core, the Gamba bottleneck block that includes, Gamba Cell, an adaptation of Mamba for 2D spatial structures, alongside a Multi-Head Self-Attention (MHSA) mechanism and a Gated Fusion Module for effective feature representation. The interplay of these components ensures that vGamba leverages the low computational demands of SSMs while maintaining the accuracy of attention mechanisms for modeling long-range dependencies in vision tasks. Additionally, the Fusion module enables seamless interaction between these components. Extensive experiments on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.

How do language models learn facts? Dynamics, curricula and hallucinations

Nicolas Zucchet,Jörg Bornschein,Stephanie Chan,Andrew Lampinen,Razvan Pascanu,Soham De

Task: 研究语言模型在合成事实回忆任务中的学习动态。

Motivation: 理解语言模型在预训练中知识获取的动态机制。

Details

Method: 通过合成事实回忆任务分析语言模型的学习过程，关注注意力机制和数据分布的影响。 Result: 发现学习分为三个阶段，数据分布影响学习动态，幻觉与知识同时出现，新知识整合困难。 Conclusion: 数据分布对知识获取至关重要，提出新的数据调度策略以加速训练。 Abstract: Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.

Ming Yan,Xincheng Lin,Yuhua Luo,Shuqi Fan,Yudi Dai,Qixin Zhong,Lincai Zhong,Yuexin Ma,Lan Xu,Chenglu Wen,Siqi Shen,Cheng Wang

Task: 研究如何恢复攀岩运动的三维动作。

Motivation: 现有的人类动作恢复研究主要关注地面运动，攀岩运动的动作捕捉研究较少，且缺乏大规模标注数据集。

Details

Method: 收集并标注了大规模攀岩动作数据集AscendMotion，并提出了结合RGB和LiDAR模态的ClimbingCap方法。 Result: 展示了AscendMotion数据集的质量，并通过ClimbingCap方法取得了有希望的结果。 Conclusion: AscendMotion数据集和ClimbingCap方法填补了攀岩动作恢复研究的空白，并公开了数据集和源代码。 Abstract: Human Motion Recovery (HMR) research mainly focuses on ground-based motions such as running. The study on capturing climbing motion, an off-ground motion, is sparse. This is partly due to the limited availability of climbing motion datasets, especially large-scale and challenging 3D labeled datasets. To address the insufficiency of climbing motion datasets, we collect AscendMotion, a large-scale well-annotated, and challenging climbing motion dataset. It consists of 412k RGB, LiDAR frames, and IMU measurements, including the challenging climbing motions of 22 skilled climbing coaches across 12 different rock walls. Capturing the climbing motions is challenging as it requires precise recovery of not only the complex pose but also the global position of climbers. Although multiple global HMR methods have been proposed, they cannot faithfully capture climbing motions. To address the limitations of HMR methods for climbing, we propose ClimbingCap, a motion recovery method that reconstructs continuous 3D human climbing motion in a global coordinate system. One key insight is to use the RGB and LiDAR modalities to separately reconstruct motions in camera coordinates and global coordinates and to optimize them jointly. We demonstrate the quality of the AscendMotion dataset and present promising results from ClimbingCap. The AscendMotion dataset and source code release publicly at \href{this link}{http://www.lidarhumanmotion.net/climbingcap/}

JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai Community

Yunze Xiao,Tingyu He,Lionel Z. Wang,Yiming Ma,Xingyu Song,Xiaohang Xu,Irene Li,Ka Chung Ng

Task: Introducing JiraiBench, a bilingual benchmark for evaluating large language models' effectiveness in detecting self-destructive content in Chinese and Japanese social media communities.

Motivation: Addressing the transnational 'Jirai' online subculture and its associated self-destructive behaviors, the study aims to incorporate linguistic and cultural dimensions into content moderation.

Details

Method: A comprehensive evaluation framework with a dataset of 10,419 Chinese and 5,000 Japanese posts, annotated across three behavioral categories, and tested on four state-of-the-art models. Result: Japanese prompts outperformed Chinese prompts when processing Chinese content, indicating cultural proximity can outweigh linguistic similarity. Cross-lingual transfer experiments showed potential for knowledge transfer without explicit target language training. Conclusion: The findings emphasize the need for culturally-informed approaches in multilingual content moderation and highlight the importance of cultural context in developing effective detection systems for vulnerable online communities. Abstract: This paper introduces JiraiBench, the first bilingual benchmark for evaluating large language models' effectiveness in detecting self-destructive content across Chinese and Japanese social media communities. Focusing on the transnational "Jirai" (landmine) online subculture that encompasses multiple forms of self-destructive behaviors including drug overdose, eating disorders, and self-harm, we present a comprehensive evaluation framework incorporating both linguistic and cultural dimensions. Our dataset comprises 10,419 Chinese posts and 5,000 Japanese posts with multidimensional annotation along three behavioral categories, achieving substantial inter-annotator agreement. Experimental evaluations across four state-of-the-art models reveal significant performance variations based on instructional language, with Japanese prompts unexpectedly outperforming Chinese prompts when processing Chinese content. This emergent cross-cultural transfer suggests that cultural proximity can sometimes outweigh linguistic similarity in detection tasks. Cross-lingual transfer experiments with fine-tuned models further demonstrate the potential for knowledge transfer between these language systems without explicit target language training. These findings highlight the need for culturally-informed approaches to multilingual content moderation and provide empirical evidence for the importance of cultural context in developing more effective detection systems for vulnerable online communities.

Delving Deep into Semantic Relation Distillation

Zhaoyi Yan,Kangjun Liu,Qixiang Ye

Task: 提出一种基于语义关系的新型知识蒸馏方法（SeRKD），以改进传统实例级知识蒸馏的不足。

Motivation: 传统知识蒸馏方法仅关注实例级知识传递，未能捕捉数据中的语义关系，限制了知识迁移的效果。

Details

Method: 结合超像素的语义提取与基于关系的知识蒸馏，提出SeRKD方法，特别适用于视觉Transformer（ViT）领域。 Result: 在基准数据集上的实验表明，SeRKD在模型性能和泛化能力上优于现有方法。 Conclusion: SeRKD通过语义关系视角重新定义知识蒸馏，为模型压缩和知识迁移提供了更有效的解决方案。 Abstract: Knowledge distillation has become a cornerstone technique in deep learning, facilitating the transfer of knowledge from complex models to lightweight counterparts. Traditional distillation approaches focus on transferring knowledge at the instance level, but fail to capture nuanced semantic relationships within the data. In response, this paper introduces a novel methodology, Semantics-based Relation Knowledge Distillation (SeRKD), which reimagines knowledge distillation through a semantics-relation lens among each sample. By leveraging semantic components, \ie, superpixels, SeRKD enables a more comprehensive and context-aware transfer of knowledge, which skillfully integrates superpixel-based semantic extraction with relation-based knowledge distillation for a sophisticated model compression and distillation. Particularly, the proposed method is naturally relevant in the domain of Vision Transformers (ViTs), where visual tokens serve as fundamental units of representation. Experimental evaluations on benchmark datasets demonstrate the superiority of SeRKD over existing methods, underscoring its efficacy in enhancing model performance and generalization capabilities.

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Wenqi Zhang,Mengna Wang,Gangao Liu,Xu Huixin,Yiwei Jiang,Yongliang Shen,Guiyang Hou,Zhe Zheng,Hang Zhang,Xin Li,Weiming Lu,Peng Li,Yueting Zhuang

Task: 扩展深度思维模型到具身搜索任务，解决其在连续交互环境中的推理能力不足问题。

Motivation: 现有深度思维模型在数学和编码任务中表现出色，但在需要与环境连续交互的具身领域效果不佳。

Details

Method: 提出Embodied Reasoner模型，通过合成9.3k条连贯的观察-思考-行动轨迹，并采用三阶段训练流程（模仿学习、自我探索、自我修正）。 Result: 模型在具身搜索任务中显著优于其他先进视觉推理模型，性能提升9%至24%，且在复杂长时任务中表现更优。 Conclusion: Embodied Reasoner在具身领域展现出更强的推理能力和更少的逻辑不一致性，适用于复杂交互环境。 Abstract: Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.

Zero-Shot Visual Concept Blending Without Text Guidance

Hiroya Makino,Takahiro Yamaguchi,Hiroyuki Sakai

Task: 提出一种名为“视觉概念混合”的零样本图像生成技术，实现对多参考图像特征的细粒度控制。

Motivation: 单参考图像难以隔离特定需转移的元素，而多参考图像能区分共有与独特特征。

Details

Method: 在部分解耦的CLIP嵌入空间（来自IP-Adapter）中操作，灵活转移纹理、形状、运动、风格等抽象概念，无需额外训练或文本提示。 Result: 在风格迁移、形态变形和概念转换等任务中表现优异，用户研究显示参与者能准确识别需转移的特征。 Conclusion: 该技术因其简单性、灵活性和高级控制，对艺术、设计和内容创作等创意领域具有重要价值。 Abstract: We propose a novel, zero-shot image generation technique called "Visual Concept Blending" that provides fine-grained control over which features from multiple reference images are transferred to a source image. If only a single reference image is available, it is difficult to isolate which specific elements should be transferred. However, using multiple reference images, the proposed approach distinguishes between common and unique features by selectively incorporating them into a generated output. By operating within a partially disentangled Contrastive Language-Image Pre-training (CLIP) embedding space (from IP-Adapter), our method enables the flexible transfer of texture, shape, motion, style, and more abstract conceptual transformations without requiring additional training or text prompts. We demonstrate its effectiveness across a diverse range of tasks, including style transfer, form metamorphosis, and conceptual transformations, showing how subtle or abstract attributes (e.g., brushstroke style, aerodynamic lines, and dynamism) can be seamlessly combined into a new image. In a user study, participants accurately recognized which features were intended to be transferred. Its simplicity, flexibility, and high-level control make Visual Concept Blending valuable for creative fields such as art, design, and content creation, where combining specific visual qualities from multiple inspirations is crucial.

As easy as PIE: understanding when pruning causes language models to disagree

Pietro Tropeano,Maria Maistro,Tuukka Ruotsalo,Christina Lioma

Task: 研究语言模型剪枝对特定数据子集（PIEs）的影响及其在NLP中的表现。

Motivation: 剪枝通常关注效率提升而忽略对某些数据点的负面影响，尤其是PIEs在NLP中未被研究。

Details

Method: 通过分析多种NLP数据集、剪枝方法和压缩级别，研究PIEs对推理质量的影响。 Result: 发现PIEs显著影响推理质量，BERT比BiLSTM更易受影响，且PIEs包含对模型泛化能力至关重要的数据点。 Conclusion: 剪枝可能对最关键的数据点造成巨大损害，揭示了语言模型剪枝的新视角。 Abstract: Language Model (LM) pruning compresses the model by removing weights, nodes, or other parts of its architecture. Typically, pruning focuses on the resulting efficiency gains at the cost of effectiveness. However, when looking at how individual data points are affected by pruning, it turns out that a particular subset of data points always bears most of the brunt (in terms of reduced accuracy) when pruning, but this effect goes unnoticed when reporting the mean accuracy of all data points. These data points are called PIEs and have been studied in image processing, but not in NLP. In a study of various NLP datasets, pruning methods, and levels of compression, we find that PIEs impact inference quality considerably, regardless of class frequency, and that BERT is more prone to this than BiLSTM. We also find that PIEs contain a high amount of data points that have the largest influence on how well the model generalises to unseen data. This means that when pruning, with seemingly moderate loss to accuracy across all data points, we in fact hurt tremendously those data points that matter the most. We trace what makes PIEs both hard and impactful to inference to their overall longer and more semantically complex text. These findings are novel and contribute to understanding how LMs are affected by pruning. The code is available at: https://github.com/pietrotrope/AsEasyAsPIE

Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression

Hanyue Tu,Siqi Wu,Li Li,Wengang Zhou,Houqiang Li

Task: 提出一种基于可逆变换的可变速率图像压缩模型。

Motivation: 自编码器结构在图像压缩中存在信息损失，限制了其在高比特率下的性能与速率适应灵活性。

Details

Method: 设计轻量级多尺度可逆神经网络，将输入图像双射映射到多尺度潜在表示，并采用多尺度空间-通道上下文模型优化压缩效率。 Result: 实验表明，该方法在可变速率方法中表现最优，且在高比特率下首次超越VVC。 Conclusion: 该方法通过单一模型实现了广泛的比特率范围内的高性能压缩。 Abstract: Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweight multi-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit rates.The source code is available at \href{https://github.com/hytu99/MSINN-VRLIC}{https://github.com/hytu99/MSINN-VRLIC}.

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

Jiefu Ou,William Gantt Walden,Kate Sanders,Zhengping Jiang,Kaiser Sun,Jeffrey Cheng,William Jurayj,Miriam Wanner,Shaobo Liang,Candice Morgan,Seunghoon Han,Weiqi Wang,Chandler May,Hannah Recknor,Daniel Khashabi,Benjamin Van Durme

Task: 引入CLAIMCHECK数据集，用于评估LLM在科学论文评审中的表现。

Motivation: 解决自动生成科学评审时确保评审内容基于论文主张的挑战。

Details

Method: 构建CLAIMCHECK数据集，包含NeurIPS 2023和2024论文及评审，标注评审中的弱点与论文主张的关联性、有效性、客观性等。 Result: 实验显示，前沿LLM在预测弱点标签方面表现尚可，但在其他任务上仍逊于人类专家。 Conclusion: CLAIMCHECK为LLM在科学评审中的评估提供了基准，但LLM仍需改进以匹配人类专家水平。 Abstract: A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper's claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.

InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression

Dongchen Lu,Yuyao Sun,Zilu Zhang,Leping Huang,Jianliang Zeng,Mao Shu,Huo Cao

Task: 提出一种名为InternVL-X的多模态大语言模型，通过三种视觉令牌压缩方法提升性能和效率。

Motivation: 现有MLLMs将视觉令牌视为文本序列处理，导致计算资源和时间需求大幅增加。

Details

Method: 结合PVTC、LVTC和RVTC三种视觉令牌压缩方法，优化视觉特征转换和计算效率。 Result: InternVL-X在7个公共MLLM基准测试中达到最优性能，12项任务平均指标提升2.34%。 Conclusion: InternVL-X通过高效视觉令牌压缩，显著提升模型性能与效率。 Abstract: Most multimodal large language models (MLLMs) treat visual tokens as "a sequence of text", integrating them with text tokens into a large language model (LLM). However, a great quantity of visual tokens significantly increases the demand for computational resources and time. In this paper, we propose InternVL-X, which outperforms the InternVL model in both performance and efficiency by incorporating three visual token compression methods. First, we propose a novel vision-language projector, PVTC. This component integrates adjacent visual embeddings to form a local query and utilizes the transformed CLS token as a global query, then performs point-to-region cross-attention through these local and global queries to more effectively convert visual features. Second, we present a layer-wise visual token compression module, LVTC, which compresses tokens in the LLM shallow layers and then expands them through upsampling and residual connections in the deeper layers. This significantly enhances the model computational efficiency. Futhermore, we propose an efficient high resolution slicing method, RVTC, which dynamically adjusts the number of visual tokens based on image area or length filtering. RVTC greatly enhances training efficiency with only a slight reduction in performance. By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.

Outlier dimensions favor frequent tokens in language model

Iuri Macocco,Nora Graichen,Gemma Boleda,Marco Baroni

Task: 研究现代语言模型中最后一层异常维度（outlier dimensions）的现象及其功能。

Motivation: 探索异常维度在多种语言模型中的普遍性及其与频繁词预测启发式的关系。

Details

Method: 分析异常维度的功能，研究模型如何通过分配权重来平衡异常维度，并调查其产生条件和训练过程中的出现时机。 Result: 发现异常维度是多种模型为实现有用的词预测启发式而发现的专门机制。 Conclusion: 异常维度是语言模型实现频繁词预测启发式的一种有效机制。 Abstract: We study last-layer outlier dimensions, i.e.dimensions that display extreme activations for the majority of inputs. We show that outlier dimensions arise in many different modern language models, and trace their function back to the heuristic of constantly predicting frequent words. We further show how a model can block this heuristic when it is not contextually appropriate, by assigning a counterbalancing weight mass to the remaining dimensions, and we investigate which model parameters boost outlier dimensions and when they arise during training. We conclude that outlier dimensions are a specialized mechanism discovered by many distinct models to implement a useful token prediction heuristic.

FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval

Zixu Li,Zhiheng Fu,Yupeng Hu,Zhiwei Chen,Haokun Wen,Liqiang Nie

Task: 开发一个细粒度的组合图像检索（CIR）框架和数据标注流程，以解决现有CIR数据集中粗粒度修改文本（CoarseMT）带来的问题。

Motivation: 现有CIR数据集使用粗粒度修改文本，无法准确捕捉细粒度检索意图，导致检索结果不精确和模糊性增加。

Details

Method: 开发了一个细粒度的CIR数据标注流程，并基于FashionIQ和CIRR数据集创建了两个细粒度CIR数据集（Fine-FashionIQ和Fine-CIRR），同时提出了FineCIR框架，专门用于解析修改文本。 Result: FineCIR在细粒度和传统CIR基准数据集上均优于现有最先进的CIR基线方法。 Conclusion: FineCIR框架和细粒度数据集显著提升了组合图像检索的精确性，解决了现有粗粒度方法的局限性。 Abstract: Composed Image Retrieval (CIR) facilitates image retrieval through a multimodal query consisting of a reference image and modification text. The reference image defines the retrieval context, while the modification text specifies desired alterations. However, existing CIR datasets predominantly employ coarse-grained modification text (CoarseMT), which inadequately captures fine-grained retrieval intents. This limitation introduces two key challenges: (1) ignoring detailed differences leads to imprecise positive samples, and (2) greater ambiguity arises when retrieving visually similar images. These issues degrade retrieval accuracy, necessitating manual result filtering or repeated queries. To address these limitations, we develop a robust fine-grained CIR data annotation pipeline that minimizes imprecise positive samples and enhances CIR systems' ability to discern modification intents accurately. Using this pipeline, we refine the FashionIQ and CIRR datasets to create two fine-grained CIR datasets: Fine-FashionIQ and Fine-CIRR. Furthermore, we introduce FineCIR, the first CIR framework explicitly designed to parse the modification text. FineCIR effectively captures fine-grained modification semantics and aligns them with ambiguous visual entities, enhancing retrieval precision. Extensive experiments demonstrate that FineCIR consistently outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR benchmark datasets. Our FineCIR code and fine-grained CIR datasets are available at https://github.com/SDU-L/FineCIR.git.

Collab: Controlled Decoding using Mixture of Agents for LLM Alignment

Souradip Chakraborty,Sujay Bhatt,Udari Madhushani Sehwag,Soumya Suvra Ghosal,Jiahao Qiu,Mengdi Wang,Dinesh Manocha,Furong Huang,Alec Koppel,Sumitra Ganesh

Task: 提出一种基于多智能体协作的解码方法，以提升大型语言模型在推理时的对齐性能。

Motivation: 传统的强化学习从人类反馈（RLHF）方法计算成本高，而单智能体解码方法难以适应多样化任务的复杂性。

Details

Method: 通过动态选择多个预对齐模型的策略，实现令牌级别的智能体协作解码。 Result: Collab方法在平均奖励和GPT-4胜率上显著优于单智能体基线，分别提升1.56倍和71.89%。 Conclusion: 多智能体协作解码是一种高效且灵活的对齐方法，适用于多样化任务。 Abstract: Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences and broader utilities, but it requires updating billions of model parameters, which is computationally expensive. Controlled Decoding, by contrast, provides a mechanism for aligning a model at inference time without retraining. However, single-agent decoding approaches often struggle to adapt to diverse tasks due to the complexity and variability inherent in these tasks. To strengthen the test-time performance w.r.t the target task, we propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies. Treating each prior policy as an agent in the spirit of mixture of agent collaboration, we develop a decoding method that allows for inference-time alignment through a token-level selection strategy among multiple agents. For each token, the most suitable LLM is dynamically chosen from a pool of models based on a long-term utility metric. This policy-switching mechanism ensures optimal model selection at each step, enabling efficient collaboration and alignment among LLMs during decoding. Theoretical analysis of our proposed algorithm establishes optimal performance with respect to the target task represented via a target reward for the given off-the-shelf models. We conduct comprehensive empirical evaluations with open-source aligned models on diverse tasks and preferences, which demonstrates the merits of this approach over single-agent decoding baselines. Notably, Collab surpasses the current SoTA decoding strategy, achieving an improvement of up to 1.56x in average reward and 71.89% in GPT-4 based win-tie rate.

HORT: Monocular Hand-held Objects Reconstruction with Transformers

Zerui Chen,Rolandos Alexandros Potamias,Shizhe Chen,Cordelia Schmid

Task: 从单目图像中高效重建手持物体的3D点云。

Motivation: 现有方法依赖隐式3D表示导致重建结果过于平滑且耗时，或使用扩散模型直接重建点云但多步去噪效率低。

Details

Method: 提出基于Transformer的模型，采用由粗到细策略，结合图像特征与3D手部几何联合预测物体点云及其姿态。 Result: 在合成和真实数据集上实现最先进精度且推理速度更快，泛化能力强。 Conclusion: 该方法高效、准确且适用于实际场景。 Abstract: Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.

ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation

Zhicheng Lee,Shulin Cao,Jinxin Liu,Jiajie Zhang,Weichuan Liu,Xiaoyin Che,Lei Hou,Juanzi Li

Task: 提出ReaRAG模型，增强大型推理模型（LRMs）的事实准确性，同时避免过度推理。

Motivation: 现有基于强化学习的LRMs在检索增强生成（RAG）中存在过度推理和推理鲁棒性不足的问题，影响问答任务的准确性。

Details

Method: 结合LRMs生成深思熟虑的推理步骤，通过预定义动作空间（搜索和完成）选择动作，并利用RAG引擎返回结果指导后续推理。 Result: ReaRAG在多跳问答任务中表现优于现有基线，并展现出识别错误和优化推理轨迹的能力。 Conclusion: ReaRAG显著提升了LRMs的事实准确性，同时实现了稳健的推理能力与检索增强生成的结合。 Abstract: Large Reasoning Models (LRMs) exhibit remarkable reasoning abilities but rely primarily on parametric knowledge, limiting factual accuracy. While recent works equip reinforcement learning (RL)-based LRMs with retrieval capabilities, they suffer from overthinking and lack robustness in reasoning, reducing their effectiveness in question answering (QA) tasks. To address this, we propose ReaRAG, a factuality-enhanced reasoning model that explores diverse queries without excessive iterations. Our solution includes a novel data construction framework with an upper bound on the reasoning chain length. Specifically, we first leverage an LRM to generate deliberate thinking, then select an action from a predefined action space (Search and Finish). For Search action, a query is executed against the RAG engine, where the result is returned as observation to guide reasoning steps later. This process iterates until a Finish action is chosen. Benefiting from ReaRAG's strong reasoning capabilities, our approach outperforms existing baselines on multi-hop QA. Further analysis highlights its strong reflective ability to recognize errors and refine its reasoning trajectory. Our study enhances LRMs' factuality while effectively integrating robust reasoning for Retrieval-Augmented Generation (RAG).

DuckSegmentation: A segmentation model based on the AnYue Hemp Duck Dataset

Ling Feng,Tianyu Xie,Wei Ma,Ruijie Fu,Yingxiao Zhang,Jun Li,Bei Zhou

Task: 提出一种基于真实养殖场的鸭子识别与分割方法，以解决智能农业中目标检测和分割模型的实用性问题。

Motivation: 尽管现有大模型在目标识别和分割任务中精度高，但由于其可解释性差和计算量大，难以实际应用于农业领域。

Details

Method: 构建AnYue Shelduck数据集，基于YOLOv8和DuckSegmentation模型进行目标检测和分割，并通过知识蒸馏优化模型。 Result: YOLOv8模块的Precision为98.10%，Recall为96.53%，F1分数为0.95；DuckSegmentation的mIoU为96.43%；学生模型（Deeplabv3 r50）的mIoU为94.49%。 Conclusion: 该方法为智能农业中的鸭子养殖提供了新的实用解决方案。 Abstract: The modernization of smart farming is a way to improve agricultural production efficiency, and improve the agricultural production environment. Although many large models have achieved high accuracy in the task of object recognition and segmentation, they cannot really be put into use in the farming industry due to their own poor interpretability and limitations in computational volume. In this paper, we built AnYue Shelduck Dateset, which contains a total of 1951 Shelduck datasets, and performed target detection and segmentation annotation with the help of professional annotators. Based on AnYue ShelduckDateset, this paper describes DuckProcessing, an efficient and powerful module for duck identification based on real shelduckfarms. First of all, using the YOLOv8 module designed to divide the mahjong between them, Precision reached 98.10%, Recall reached 96.53% and F1 score reached 0.95 on the test set. Again using the DuckSegmentation segmentation model, DuckSegmentation reached 96.43% mIoU. Finally, the excellent DuckSegmentation was used as the teacher model, and through knowledge distillation, Deeplabv3 r50 was used as the student model, and the final student model achieved 94.49% mIoU on the test set. The method provides a new way of thinking in practical sisal duck smart farming.

Effective Skill Unlearning through Intervention and Abstention

Yongce Li,Chung-En Sun,Tsui-Wei Weng

Task: 研究大型语言模型（LLMs）中特定技能的遗忘方法。

Motivation: 理解LLMs的能力机制并实现对它们的控制，以开发更好的模型。

Details

Method: 提出两种轻量级、无需训练的技能遗忘技术：Neuron Adjust和Key Space Detection。 Result: 在数学解题、Python编程和理解能力等技能上，方法表现出色，Key Space Detection在目标技能上性能下降超过80%，其他技能和通用知识（MMLU）性能下降小于10%。 Conclusion: 提出的方法有效实现了LLMs中特定技能的遗忘，同时保留了模型的整体能力。 Abstract: Large language Models (LLMs) have demonstrated remarkable skills across various domains. Understanding the mechanisms behind their abilities and implementing controls over them is becoming increasingly important for developing better models. In this paper, we focus on skill unlearning in LLMs, specifically unlearning a particular skill while retaining their overall capabilities. We introduce two lightweight, training-free machine skill unlearning techniques for LLMs. First, we observe that the pre-activation distribution of neurons in each Feed-Forward Layer (FFL) differs when the model demonstrates different skills. Additionally, we find that queries triggering the same skill cluster within the FFL key space and can be separated from other queries using a hypercube. Based on these observations, we propose two lightweight, training-free skill unlearning methods via \textit{intervention} and \textit{abstention} respectively: \texttt{Neuron Adjust} and \texttt{Key Space Detection}. We evaluate our methods on unlearning math-solving, Python-coding, and comprehension skills across seven different languages. The results demonstrate their strong unlearning capabilities for the designated skills. Specifically, \texttt{Key Space Detection} achieves over 80\% relative performance drop on the forgetting skill and less than 10\% relative performance drop on other skills and the model's general knowledge (MMLU) for most unlearning tasks. Our code is available at https://github.com/Trustworthy-ML-Lab/effective_skill_unlearning

UGNA-VPR: A Novel Training Paradigm for Visual Place Recognition Based on Uncertainty-Guided NeRF Augmentation

Yehui Shen,Lei Zhang,Qingqiu Li,Xiongwei Zhao,Yue Wang,Huimin Lu,Xieyuanli Chen

Task: 提出一种通过不确定性估计和NeRF数据增强提升视觉地点识别（VPR）性能的新训练范式。

Motivation: 现有VPR数据集多为单视角场景，导致多方向驾驶或特征稀疏场景下识别精度下降，且获取额外数据成本高昂。

Details

Method: 利用NeRF生成合成数据，并通过自监督不确定性估计网络识别高不确定性区域，生成新观测数据以增强训练。 Result: 在三个数据集和三种VPR骨干网络上的实验表明，该方法显著提升了VPR性能，优于其他训练方法。 Conclusion: 提出的训练范式通过充分利用现有数据，显著提升了VPR性能，并在自录数据集上验证了其有效性。 Abstract: Visual place recognition (VPR) is crucial for robots to identify previously visited locations, playing an important role in autonomous navigation in both indoor and outdoor environments. However, most existing VPR datasets are limited to single-viewpoint scenarios, leading to reduced recognition accuracy, particularly in multi-directional driving or feature-sparse scenes. Moreover, obtaining additional data to mitigate these limitations is often expensive. This paper introduces a novel training paradigm to improve the performance of existing VPR networks by enhancing multi-view diversity within current datasets through uncertainty estimation and NeRF-based data augmentation. Specifically, we initially train NeRF using the existing VPR dataset. Then, our devised self-supervised uncertainty estimation network identifies places with high uncertainty. The poses of these uncertain places are input into NeRF to generate new synthetic observations for further training of VPR networks. Additionally, we propose an improved storage method for efficient organization of augmented and original training data. We conducted extensive experiments on three datasets and tested three different VPR backbone networks. The results demonstrate that our proposed training paradigm significantly improves VPR performance by fully utilizing existing data, outperforming other training approaches. We further validated the effectiveness of our approach on self-recorded indoor and outdoor datasets, consistently demonstrating superior results. Our dataset and code have been released at \href{https://github.com/nubot-nudt/UGNA-VPR}{https://github.com/nubot-nudt/UGNA-VPR}.

MemInsight: Autonomous Memory Augmentation for LLM Agents

Rana Salama,Jason Cai,Michelle Yuan,Anna Currey,Monica Sunkara,Yi Zhang,Yassine Benajiba

Task: 提出一种名为MemInsight的自主记忆增强方法，以改进LLM代理的语义数据表示和检索机制。

Motivation: LLM代理需要长期记忆能力以利用历史交互和知识，但记忆规模增长和语义结构化需求带来了挑战。

Details

Method: 通过自主增强历史交互，MemInsight提升语义数据表示和检索机制。 Result: 在三个任务场景（对话推荐、问答和事件摘要）中验证了MemInsight的有效性，推荐说服力提升14%，LoCoMo检索召回率比RAG基线高34%。 Conclusion: MemInsight能够显著提升LLM代理在多种任务中的上下文性能。 Abstract: Large language model (LLM) agents have evolved to intelligently process information, make decisions, and interact with users or tools. A key capability is the integration of long-term memory capabilities, enabling these agents to draw upon historical interactions and knowledge. However, the growing memory size and need for semantic structuring pose significant challenges. In this work, we propose an autonomous memory augmentation approach, MemInsight, to enhance semantic data representation and retrieval mechanisms. By leveraging autonomous augmentation to historical interactions, LLM agents are shown to deliver more accurate and contextualized responses. We empirically validate the efficacy of our proposed approach in three task scenarios; conversational recommendation, question answering and event summarization. On the LLM-REDIAL dataset, MemInsight boosts persuasiveness of recommendations by up to 14%. Moreover, it outperforms a RAG baseline by 34% in recall for LoCoMo retrieval. Our empirical results show the potential of MemInsight to enhance the contextual performance of LLM agents across multiple tasks.

LandMarkSystem Technical Report

Zhenxiang Ma,Zhenyu Yang,Miao Tao,Yuanzhen Zhou,Zeyu He,Yuchang Zhang,Rong Fu,Hengjie Li

Task: 提出一种名为LandMarkSystem的新型计算框架，用于增强多尺度场景重建和渲染。

Motivation: 传统深度学习框架难以满足对场景质量和规模日益增长的需求，尤其是在3D重建领域。

Details

Method: 通过组件化模型适配层支持多种NeRF和3DGS结构，并利用分布式并行计算和模型参数卸载优化计算效率。 Result: 系统解决了现有框架的局限性，提供了针对复杂3D稀疏计算的专用算子，实现了高效训练和快速推理。 Conclusion: LandMarkSystem通过模块化架构和动态加载策略，提升了3D重建任务的效率和效果，并开源以促进进一步研究。 Abstract: 3D reconstruction is vital for applications in autonomous driving, virtual reality, augmented reality, and the metaverse. Recent advancements such as Neural Radiance Fields(NeRF) and 3D Gaussian Splatting (3DGS) have transformed the field, yet traditional deep learning frameworks struggle to meet the increasing demands for scene quality and scale. This paper introduces LandMarkSystem, a novel computing framework designed to enhance multi-scale scene reconstruction and rendering. By leveraging a componentized model adaptation layer, LandMarkSystem supports various NeRF and 3DGS structures while optimizing computational efficiency through distributed parallel computing and model parameter offloading. Our system addresses the limitations of existing frameworks, providing dedicated operators for complex 3D sparse computations, thus facilitating efficient training and rapid inference over extensive scenes. Key contributions include a modular architecture, a dynamic loading strategy for limited resources, and proven capabilities across multiple representative algorithms.This comprehensive solution aims to advance the efficiency and effectiveness of 3D reconstruction tasks.To facilitate further research and collaboration, the source code and documentation for the LandMarkSystem project are publicly available in an open-source repository, accessing the repository at: https://github.com/InternLandMark/LandMarkSystem.

Jaco: An Offline Running Privacy-aware Voice Assistant

Daniel Bermuth,Alexander Poeppel,Wolfgang Reif

Task: 提出一种新型语音助手Jaco的架构，支持离线运行、多语言扩展，并注重用户隐私保护。

Motivation: 现有语音助手多为云端服务，隐私保护不足，且对低资源设备支持有限。

Details

Method: 设计并实现Jaco语音助手架构，支持离线运行、技能扩展和多语言，同时保护用户隐私。 Result: Jaco在低资源设备上运行良好，功能扩展灵活，隐私保护强，且性能与其他语音助手竞争。 Conclusion: Jaco结合并扩展了现有语音助手的优势，提供了一种隐私友好且功能强大的离线解决方案。 Abstract: With the recent advance in speech technology, smart voice assistants have been improved and are now used by many people. But often these assistants are running online as a cloud service and are not always known for a good protection of users' privacy. This paper presents the architecture of a novel voice assistant, called Jaco, with the following features: (a) It can run completely offline, even on low resource devices like a RaspberryPi. (b) Through a skill concept it can be easily extended. (c) The architectural focus is on protecting users' privacy, but without restricting capabilities for developers. (d) It supports multiple languages. (e) It is competitive with other voice assistant solutions. In this respect the assistant combines and extends the advantages of other approaches.

Multimodal surface defect detection from wooden logs for sawing optimization

Bořek Reich,Matej Kunda,Fedor Zolotarev,Tuomas Eerola,Pavel Zemčík,Tomi Kauppi

Task: 提出一种基于多模态数据融合的木材表面节疤检测方法。

Motivation: 节疤是影响锯材质量的主要因素，现有X射线计算机断层扫描技术虽准确但成本高且速度慢，而单一模态的表面测量方法因噪声和小节疤尺寸导致检测精度低。

Details

Method: 采用RGB和点云数据的多模态数据融合管道，结合后期融合模块，提高节疤检测精度，并提出一种基于表面节疤检测和互相关的锯切角度优化方法。 Result: 多模态数据融合方法比单一模态检测精度更高，锯切角度优化方法能有效减少不需要的边缘节疤。 Conclusion: 多模态数据融合和锯切角度优化方法为木材质量检测和加工提供了高效且经济的解决方案。 Abstract: We propose a novel, good-quality, and less demanding method for detecting knots on the surface of wooden logs using multimodal data fusion. Knots are a primary factor affecting the quality of sawn timber, making their detection fundamental to any timber grading or cutting optimization system. While X-ray computed tomography provides accurate knot locations and internal structures, it is often too slow or expensive for practical use. An attractive alternative is to use fast and cost-effective log surface measurements, such as laser scanners or RGB cameras, to detect surface knots and estimate the internal structure of wood. However, due to the small size of knots and noise caused by factors, such as bark and other natural variations, detection accuracy often remains low when only one measurement modality is used. In this paper, we demonstrate that by using a data fusion pipeline consisting of separate streams for RGB and point cloud data, combined by a late fusion module, higher knot detection accuracy can be achieved compared to using either modality alone. We further propose a simple yet efficient sawing angle optimization method that utilizes surface knot detections and cross-correlation to minimize the amount of unwanted arris knots, demonstrating its benefits over randomized sawing angles.

Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models

Pin-Yu Chen,Han Shen,Payel Das,Tianyi Chen

Task: 研究大语言模型（LLM）微调中安全性与能力之间的权衡关系。

Motivation: 观察到任务特定数据集的微调在提升能力的同时会损害安全性，需要理论框架解释这一现象。

Details

Method: 提出理论框架分析两种主要的安全感知LLM微调策略，探讨数据相似性、上下文重叠和对齐损失景观的影响。 Result: 理论结果揭示了LLM微调中安全性与能力权衡的基本限制，并通过数值实验验证。 Conclusion: 为理解LLM微调中的安全性与能力关系提供了新的理论视角和实验支持。 Abstract: Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.

Unsupervised Real-World Denoising: Sparsity is All You Need

Hamadi Chihaoui,Paolo Favaro

Task: 提出一种基于输入稀疏化的方法（MID）来解决合成噪声图像与真实噪声图像之间的分布差距问题。

Motivation: 由于难以收集大量成对的噪声和干净图像数据集，现有方法在合成噪声图像与真实噪声图像之间存在分布差距，导致性能不佳。

Details

Method: 采用随机输入掩码的稀疏化方法，训练一个同时去噪和修复的模型（MID），并通过迭代优化噪声采样器来改进去噪效果。 Result: 在真实噪声图像数据集上的实验表明，该方法在无监督去噪任务中具有竞争力。 Conclusion: MID方法通过稀疏化和迭代优化噪声采样器，有效缩小了合成与真实噪声图像之间的分布差距，提升了去噪性能。 Abstract: Supervised training for real-world denoising presents challenges due to the difficulty of collecting large datasets of paired noisy and clean images. Recent methods have attempted to address this by utilizing unpaired datasets of clean and noisy images. Some approaches leverage such unpaired data to train denoisers in a supervised manner by generating synthetic clean-noisy pairs. However, these methods often fall short due to the distribution gap between synthetic and real noisy images. To mitigate this issue, we propose a solution based on input sparsification, specifically using random input masking. Our method, which we refer to as Mask, Inpaint and Denoise (MID), trains a denoiser to simultaneously denoise and inpaint synthetic clean-noisy pairs. On one hand, input sparsification reduces the gap between synthetic and real noisy images. On the other hand, an inpainter trained in a supervised manner can still accurately reconstruct sparse inputs by predicting missing clean pixels using the remaining unmasked pixels. Our approach begins with a synthetic Gaussian noise sampler and iteratively refines it using a noise dataset derived from the denoiser's predictions. The noise dataset is created by subtracting predicted pseudo-clean images from real noisy images at each iteration. The core intuition is that improving the denoiser results in a more accurate noise dataset and, consequently, a better noise sampler. We validate our method through extensive experiments on real-world noisy image datasets, demonstrating competitive performance compared to existing unsupervised denoising methods.

Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead

Viktor Schlegel,Anil A Bharath,Zilong Zhao,Kevin Yee

Task: 该论文的任务是提供一个关于隐私保护合成数据的全面框架，并评估其在专业领域中的实际表现。

Motivation: 动机在于解决高敏感领域中数据隔离的问题，同时平衡隐私保护和数据实用性。

Details

Method: 方法包括综述生成模型和差分隐私的理论基础，并评估四种领先方法在五个真实数据集上的表现。 Result: 结果表明，在严格的隐私约束下（ε ≤ 4），性能显著下降，揭示了通用领域基准与专业领域数据之间的差距。 Conclusion: 结论强调了需要更强大的评估框架、标准化基准和改进技术，以充分发挥隐私保护合成数据的潜力。 Abstract: Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($\epsilon \leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.

VALLR: Visual ASR Language Model for Lip Reading

Marshall Thomas,Edward Fish,Richard Bowden

Task: 提出一种新颖的两阶段、以音素为中心的视觉自动语音识别（V-ASR）框架。

Motivation: 解决现有方法因共发音效应和视觉音素（viseme）模糊性导致的高错误率问题。

Details

Method: 使用视频Transformer和CTC头预测音素序列，再通过微调的大型语言模型（LLM）重建连贯的单词和句子。 Result: 在两个数据集（LRS2和LRS3）上达到最先进的性能，LRS3的词错误率（WER）降至18.7%，且使用的标注数据比次优方法少99.4%。 Conclusion: 该方法通过显式编码中间语言结构，显著提升了V-ASR的性能和数据效率。 Abstract: Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

Silin Gao,Sheryl Mathew,Li Mi,Sepideh Mamooler,Mengjie Zhao,Hiromi Wakaki,Yuki Mitsufuji,Syrielle Montariol,Antoine Bosselut

Task: 提出一个新的基准VinaBench，用于解决视觉叙事生成中忠实性和自一致性的挑战。

Motivation: 视觉叙事生成缺乏用于规划故事的知识约束，导致生成的图像与输入文本不一致且不连贯。

Details

Method: 通过标注视觉叙事样本中的常识和话语约束，提供系统化的学习支架，并提出新的评估指标。 Result: 实验结果表明，使用VinaBench的知识约束能有效提升生成视觉叙事的忠实性和连贯性。 Conclusion: VinaBench为视觉叙事生成提供了有效的知识约束和评估方法，显著提升了生成质量。 Abstract: Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.

Diffusion Image Prior

Hamadi Chihaoui,Paolo Favaro

Task: 提出一种基于预训练扩散模型的零样本图像恢复方法（DIIP），用于处理复杂且无法明确定义的退化模型。

Motivation: 现实场景中的退化模型可能过于复杂，无法明确定义，因此需要一种无需明确退化模型的方法。

Details

Method: 利用预训练扩散模型作为先验（DIIP），并通过早期停止策略实现盲图像恢复。 Result: DIIP在多种退化盲图像恢复任务（如JPEG伪影去除、水滴去除、去噪和超分辨率）中取得了最先进的结果。 Conclusion: 预训练扩散模型提供了比DIP更强的先验，DIIP是一种有效的盲图像恢复方法。 Abstract: Zero-shot image restoration (IR) methods based on pretrained diffusion models have recently achieved significant success. These methods typically require at least a parametric form of the degradation model. However, in real-world scenarios, the degradation may be too complex to define explicitly. To handle this general case, we introduce the Diffusion Image Prior (DIIP). We take inspiration from the Deep Image Prior (DIP)[16], since it can be used to remove artifacts without the need for an explicit degradation model. However, in contrast to DIP, we find that pretrained diffusion models offer a much stronger prior, despite being trained without knowledge from corrupted data. We show that, the optimization process in DIIP first reconstructs a clean version of the image before eventually overfitting to the degraded input, but it does so for a broader range of degradations than DIP. In light of this result, we propose a blind image restoration (IR) method based on early stopping, which does not require prior knowledge of the degradation model. We validate DIIP on various degradation-blind IR tasks, including JPEG artifact removal, waterdrop removal, denoising and super-resolution with state-of-the-art results.

D4R -- Exploring and Querying Relational Graphs Using Natural Language and Large Language Models -- the Case of Historical Documents

Michel Boeglin,David Kahn,Josiane Mothe,Diego Ortiz,David Panzoli

Task: 设计一个数字平台（D4R），帮助非技术用户（尤其是历史学家）通过图形工具探索文本文档。

Motivation: 弥合AI技术与历史研究之间的鸿沟，并扩展至其他领域。

Details

Method: 利用大型语言模型将自然语言问题转换为Cypher查询，从Neo4J数据库中检索数据，并提供用户友好的图形界面。 Result: 开发了一个功能强大的平台，支持用户直观地分析和导航复杂的关系数据。 Conclusion: D4R不仅适用于历史研究，还可扩展到其他领域，展示了其广泛的应用潜力。 Abstract: D4R is a digital platform designed to assist non-technical users, particularly historians, in exploring textual documents through advanced graphical tools for text analysis and knowledge extraction. By leveraging a large language model, D4R translates natural language questions into Cypher queries, enabling the retrieval of data from a Neo4J database. A user-friendly graphical interface allows for intuitive interaction, enabling users to navigate and analyse complex relational data extracted from unstructured textual documents. Originally designed to bridge the gap between AI technologies and historical research, D4R's capabilities extend to various other domains. A demonstration video and a live software demo are available.

Dual-Task Learning for Dead Tree Detection and Segmentation with Hybrid Self-Attention U-Nets in Aerial Imagery

Anis Ur Rahman,Einari Heinaro,Mete Ahishali,Samuli Junttila

Task: 开发一种混合后处理框架，用于改进基于深度学习的树木分割，以精确识别枯立木。

Motivation: 密集的树冠结构、活植被与枯死植被的光谱重叠以及过分割问题限制了现有方法的可靠性。

Details

Method: 结合分水岭算法和自适应滤波的混合后处理框架，优化边界划分并减少假阳性。 Result: 在北方森林的高分辨率航拍图像上测试，实例级分割精度提高了41.5%，位置误差减少了57%。 Conclusion: 该方法在复杂森林环境中表现出色，支持大规模生态监测和森林管理需求。 Abstract: Mapping standing dead trees is critical for assessing forest health, monitoring biodiversity, and mitigating wildfire risks, for which aerial imagery has proven useful. However, dense canopy structures, spectral overlaps between living and dead vegetation, and over-segmentation errors limit the reliability of existing methods. This study introduces a hybrid postprocessing framework that refines deep learning-based tree segmentation by integrating watershed algorithms with adaptive filtering, enhancing boundary delineation, and reducing false positives in complex forest environments. Tested on high-resolution aerial imagery from boreal forests, the framework improved instance-level segmentation accuracy by 41.5% and reduced positional errors by 57%, demonstrating robust performance in densely vegetated regions. By balancing detection accuracy and over-segmentation artifacts, the method enabled the precise identification of individual dead trees, which is critical for ecological monitoring. The framework's computational efficiency supports scalable applications, such as wall-to-wall tree mortality mapping over large geographic regions using aerial or satellite imagery. These capabilities directly benefit wildfire risk assessment (identifying fuel accumulations), carbon stock estimation (tracking emissions from decaying biomass), and precision forestry (targeting salvage loggings). By bridging advanced remote sensing techniques with practical forest management needs, this work advances tools for large-scale ecological conservation and climate resilience planning.

ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer

Michael Brown,Sofia Martinez,Priya Singh

Task: 文本驱动的语音风格转换旨在根据文本描述调整语音的语调、节奏和音色。

Motivation: 现有方法通常依赖大规模神经网络或预训练语言模型，计算成本较高。

Details

Method: 提出了一种基于状态空间模型（SSM）的高效框架ReverBERT，结合离散傅里叶变换和Transformer-based SSM层，实现平滑的风格调制。 Result: 在基准语音语料库上的实验表明，ReverBERT在自然度、表现力和计算效率方面显著优于基线方法。 Conclusion: ReverBERT为文本驱动的语音风格转换提供了一种高效且高质量的解决方案，并公开了模型和代码以促进进一步研究。 Abstract: Text-driven speech style transfer aims to mold the intonation, pace, and timbre of a spoken utterance to match stylistic cues from text descriptions. While existing methods leverage large-scale neural architectures or pre-trained language models, the computational costs often remain high. In this paper, we present \emph{ReverBERT}, an efficient framework for text-driven speech style transfer that draws inspiration from a state space model (SSM) paradigm, loosely motivated by the image-based method of Wang and Liu~\cite{wang2024stylemamba}. Unlike image domain techniques, our method operates in the speech space and integrates a discrete Fourier transform of latent speech features to enable smooth and continuous style modulation. We also propose a novel \emph{Transformer-based SSM} layer for bridging textual style descriptors with acoustic attributes, dramatically reducing inference time while preserving high-quality speech characteristics. Extensive experiments on benchmark speech corpora demonstrate that \emph{ReverBERT} significantly outperforms baselines in terms of naturalness, expressiveness, and computational efficiency. We release our model and code publicly to foster further research in text-driven speech style transfer.

Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving

Lucas Nunes,Rodrigo Marcuzzi,Jens Behley,Cyrill Stachniss

Task: 提出一种无需依赖投影或解耦多分辨率模型的新方法，用于生成3D语义场景尺度数据。

Motivation: 解决3D数据标注的复杂性和模拟数据与真实数据之间的领域差距问题。

Details

Method: 采用一种新型方法直接生成3D语义场景数据，避免中间表示带来的误差。 Result: 生成的数据质量更高，且用于训练语义分割网络时性能有所提升。 Conclusion: 该方法展示了生成场景尺度点云数据的潜力，可减少数据标注工作量。 Abstract: Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still however a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at https://github.com/PRBonn/3DiSS.

AskSport: Web Application for Sports Question-Answering

Enzo B Onofre,Leonardo M P Moraes,Cristina D Aguiar

Task: 介绍AskSport，一个基于自然语言的体育问答网络应用。

Motivation: 为用户提供便捷的体育问题解答服务，支持自然语言输入并返回相关答案和信息。

Details

Method: 描述AskSport的功能和特性，包括用例展示其返回名称和数值的能力。 Result: AskSport能够返回三个最相关的答案及相关文档，已在HuggingFace上公开实现。 Conclusion: AskSport是一个有效的体育问答工具，支持自然语言交互并公开可用。 Abstract: This paper introduces AskSport, a question-answering web application about sports. It allows users to ask questions using natural language and retrieve the three most relevant answers, including related information and documents. The paper describes the characteristics and functionalities of the application, including use cases demonstrating its ability to return names and numerical values. AskSport and its implementation are available for public access on HuggingFace.

FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs

Xiaoqin Wang,Xusen Ma,Xianxu Hou,Meidan Ding,Yudong Li,Junliang Chen,Wenting Chen,Xiaoyang Peng,Linlin Shen

Task: 评估多模态大语言模型（MLLMs）在面部感知任务中的表现。

Motivation: 当前MLLMs在面部感知任务上的评估尚未充分探索，需要专门的数据集和方法来填补这一空白。

Details

Method: 提出FaceBench数据集，包含分层多视角和多层次属性，并开发Face-LLaVA作为基线模型。 Result: 现有MLLMs在细粒度面部属性理解上表现不佳，而Face-LLaVA显著优于开源模型，接近商业模型如GPT-4o和Gemini。 Conclusion: FaceBench为MLLMs的面部感知能力评估提供了有效工具，Face-LLaVA展示了在少量训练数据下的优越性能。 Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at https://github.com/CVI-SZU/FaceBench.

Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing Systems

Ooha Lakkadi Reddy

Task: 研究印度河谷文字与藏彝走廊象形文字系统之间的潜在历史联系。

Motivation: 探索印度河谷文明与藏彝走廊文化之间未被充分认识的视觉形态相似性，挑战传统关于孤立文字发展的观点。

Details

Method: 采用混合CNN-Transformer架构和人类学框架，通过15个独立训练模型的集成方法，比较三种目标文字。 Result: 藏彝走廊文字与印度河谷文字的视觉相似性（61.7%-63.5%）显著高于与青铜时代原始楔形文字（10.2%-10.9%）或原始埃兰文字（7.6%-8.7%）的相似性。 Conclusion: 研究结果表明印度河谷文字与藏彝走廊文字之间存在显著相似性，支持古代南亚与东亚之间复杂的文化传播网络。 Abstract: This thesis employs a hybrid CNN-Transformer architecture, in conjunction with a detailed anthropological framework, to investigate potential historical connections between the visual morphology of the Indus Valley script and pictographic systems of the Tibetan-Yi Corridor. Through an ensemble methodology of three target scripts across 15 independently trained models, we demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold higher visual similarity to the Indus script (61.7%-63.5%) than to the Bronze Age Proto-Cuneiform (10.2%-10.9%) or Proto-Elamite (7.6%-8.7%) systems. Additionally and contrarily to our current understanding of the networks of the Indus Valley Civilization, the Indus script unexpectedly maps closer to Tibetan-Yi Corridor scripts, with a mean cosine similarity of 0.629, than to the aforementioned contemporaneous West Asian signaries, both of which recorded mean cosine similarities of 0.104 and 0.080 despite their close geographic proximity and evident trade relations. Across various dimensionality reduction practices and clustering methodologies, the Indus script consistently clusters closest to Tibetan-Yi Corridor scripts. Our computational results align with qualitative observations of specific pictorial parallels in numeral systems, gender markers, and key iconographic elements; this is further supported by archaeological evidence of sustained contact networks along the ancient Shu-Shendu road in tandem with the Indus Valley Civilization's decline, providing a plausible transmission pathway. While alternative explanations cannot be ruled out, the specificity and consistency of observed similarities challenge conventional narratives of isolated script development and suggest more complex ancient cultural transmission networks between South and East Asia than previously recognized.

Chirag Parikh,Deepti Rawat,Rakshitha R. T.,Tathagata Ghosh,Ravi Kiran Sarvadevabhatla

Task: 构建一个大规模、多样化的VideoQA数据集RoadSocial，用于通用道路事件理解。

Motivation: 现有数据集受限于区域偏见、视角偏见和专家驱动的标注，无法全面捕捉道路事件的复杂性。

Details

Method: 采用可扩展的半自动标注框架，结合文本和视频大语言模型（LLMs），生成涵盖12种挑战性QA任务的问题-答案对。 Result: 构建了一个包含13.2K视频、674标签和260K高质量QA对的数据集，并评估了18种视频LLMs。 Conclusion: RoadSocial不仅推动了道路事件理解的研究，还提升了通用视频LLMs的能力。 Abstract: We introduce RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalable semi-automatic annotation framework leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning 14M frames and 414K social comments, resulting in a dataset with 13.2K videos, 674 tags and 260K high-quality QA pairs. We evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose) on our road event understanding benchmark. We also demonstrate RoadSocial's utility in improving road event understanding capabilities of general-purpose Video LLMs.

Measuring and Analyzing Subjective Uncertainty in Scientific Communications

Jamshid Sourati,Grace Shao

Task: 研究科学论文中主观不确定性语言的使用及其对科学界的影响。

Motivation: 科学发现的不确定性通常通过统计指标报告，但语言中的主观不确定性可能影响科学传播和公众理解。

Details

Method: 测量和分析不同学科、年份和地理位置的论文中主观不确定性语言的差异，并研究其与文献计量学指标的相关性。 Result: 发现主观不确定性语言的使用在不同领域、出版年份和地理位置间存在显著差异，并与作者数量、性别、领域中心性、引用次数等指标相关。 Conclusion: 研究结果有助于识别和记录不同科学社区中的语言规范，对科学传播具有重要意义。 Abstract: Uncertainty of scientific findings are typically reported through statistical metrics such as $p$-values, confidence intervals, etc. The magnitude of this objective uncertainty is reflected in the language used by the authors to report their findings primarily through expressions carrying uncertainty-inducing terms or phrases. This language uncertainty is a subjective concept and is highly dependent on the writing style of the authors. There is evidence that such subjective uncertainty influences the impact of science on public audience. In this work, we turned our focus to scientists themselves, and measured/analyzed the subjective uncertainty and its impact within scientific communities across different disciplines. We showed that the level of this type of uncertainty varies significantly across different fields, years of publication and geographical locations. We also studied the correlation between subjective uncertainty and several bibliographical metrics, such as number/gender of authors, centrality of the field's community, citation count, etc. The underlying patterns identified in this work are useful in identification and documentation of linguistic norms in scientific communication in different communities/societies.

Retinal Fundus Multi-Disease Image Classification using Hybrid CNN-Transformer-Ensemble Architectures

Deependra Singh,Saksham Agarwal,Subhankar Mishra

Task: 开发一种基于眼底图像的视网膜疾病诊断系统，准确预测20种疾病标签。

Motivation: 解决全球视网膜疾病患者多但医疗资源分布不均的问题，特别是在非城市地区。

Details

Method: 采用混合模型，结合深度卷积神经网络（CNN）、Transformer编码器和集成架构，进行序列和并行分类。 Result: C-Tran集成模型表现最佳，得分0.9166，超过基线0.9；IEViT模型在计算效率上也有显著提升。 Conclusion: 研究为视网膜疾病诊断提供了高效、准确的解决方案，尤其适用于医疗资源匮乏地区。 Abstract: Our research is motivated by the urgent global issue of a large population affected by retinal diseases, which are evenly distributed but underserved by specialized medical expertise, particularly in non-urban areas. Our primary objective is to bridge this healthcare gap by developing a comprehensive diagnostic system capable of accurately predicting retinal diseases solely from fundus images. However, we faced significant challenges due to limited, diverse datasets and imbalanced class distributions. To overcome these issues, we have devised innovative strategies. Our research introduces novel approaches, utilizing hybrid models combining deeper Convolutional Neural Networks (CNNs), Transformer encoders, and ensemble architectures sequentially and in parallel to classify retinal fundus images into 20 disease labels. Our overarching goal is to assess these advanced models' potential in practical applications, with a strong focus on enhancing retinal disease diagnosis accuracy across a broader spectrum of conditions. Importantly, our efforts have surpassed baseline model results, with the C-Tran ensemble model emerging as the leader, achieving a remarkable model score of 0.9166, surpassing the baseline score of 0.9. Additionally, experiments with the IEViT model showcased equally promising outcomes with improved computational efficiency. We've also demonstrated the effectiveness of dynamic patch extraction and the integration of domain knowledge in computer vision tasks. In summary, our research strives to contribute significantly to retinal disease diagnosis, addressing the critical need for accessible healthcare solutions in underserved regions while aiming for comprehensive and accurate disease prediction.

VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

Alan Dao,Norapat Buppodom

Task: 提出一种利用视觉语言模型（VLM）从体素数据中提取高级语义信息（如物体身份、颜色和位置）的新方法。

Motivation: 体素网格提供了3D空间的结构化表示，但提取高级语义信息仍然具有挑战性。

Details

Method: 通过沿主轴（如Z轴）系统切片体素空间，将2D切片输入标准VLM的图像编码器，利用预训练的2D VLM实现3D语义理解。 Result: 模型能够跨切片聚合信息，并将空间模式与语言组件提供的语义概念关联。 Conclusion: 基于切片的方法有效利用了预训练的2D VLM，直接从体素表示中实现高效的3D语义理解。 Abstract: Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract "voxel semantics"-object identity, color, and location-from voxel data. Critically, instead of employing complex 3D networks, our method processes the voxel space by systematically slicing it along a primary axis (e.g., the Z-axis, analogous to CT scan slices). These 2D slices are then formatted and sequentially fed into the image encoder of a standard VLM. The model learns to aggregate information across slices and correlate spatial patterns with semantic concepts provided by the language component. This slice-based strategy aims to leverage the power of pre-trained 2D VLMs for efficient 3D semantic understanding directly from voxel representations.

Fine-Grained Behavior and Lane Constraints Guided Trajectory Prediction Method

Wenyi Xiong,Jian Chen,Ziheng Qi

Task: 提出一种名为BLNet的双流架构，用于解决动态环境中目标车辆轨迹预测的细粒度和连续性描述问题。

Motivation: 现有预测算法无法提供对目标车辆未来行为和车道约束的细粒度连续描述，导致预测精度下降。

Details

Method: 通过并行注意力机制协同整合行为意图识别和车道约束建模，生成细粒度行为状态查询和车道查询，并利用两阶段解码器进行轨迹生成和细化。 Result: 在nuScenes和Argoverse数据集上的实验表明，BLNet显著优于现有的直接回归和基于目标的算法。 Conclusion: BLNet通过双流架构和两阶段解码器，有效提升了动态环境中轨迹预测的精度和连续性。 Abstract: Trajectory prediction, as a critical component of autonomous driving systems, has attracted the attention of many researchers. Existing prediction algorithms focus on extracting more detailed scene features or selecting more reasonable trajectory destinations. However, in the face of dynamic and evolving future movements of the target vehicle, these algorithms cannot provide a fine-grained and continuous description of future behaviors and lane constraints, which degrades the prediction accuracy. To address this challenge, we present BLNet, a novel dualstream architecture that synergistically integrates behavioral intention recognition and lane constraint modeling through parallel attention mechanisms. The framework generates fine-grained behavior state queries (capturing spatial-temporal movement patterns) and lane queries (encoding lane topology constraints), supervised by two auxiliary losses, respectively. Subsequently, a two-stage decoder first produces trajectory proposals, then performs point-level refinement by jointly incorporating both the continuity of passed lanes and future motion features. Extensive experiments on two large datasets, nuScenes and Argoverse, show that our network exhibits significant performance gains over existing direct regression and goal-based algorithms.

Bias-Aware Agent: Enhancing Fairness in AI-Driven Knowledge Retrieval

Karanbir Singh,William Ngu

Task: 提出一种基于代理框架和偏见检测工具的新型方法，以实现偏见感知的知识检索。

Motivation: 尽管大型语言模型（LLMs）和AI代理在信息检索领域取得了显著进展，但它们仍存在偏见和公平性问题，这些问题根植于知识库和LLMs的训练中。

Details

Method: 利用代理框架和创新性的偏见检测工具，识别并突出检索内容中的固有偏见。 Result: 通过增强用户的透明度和意识，该方法旨在促进更公平的信息系统和负责任AI的发展。 Conclusion: 该研究为偏见感知的信息检索提供了一种新方法，有助于提升信息系统的公平性和AI的责任性。 Abstract: Advancements in retrieving accessible information have evolved faster in the last few years compared to the decades since the internet's creation. Search engines, like Google, have been the number one way to find relevant data. They have always relied on the user's abilities to find the best information in its billions of links and sources at everybody's fingertips. The advent of large language models (LLMs) has completely transformed the field of information retrieval. The LLMs excel not only at retrieving relevant knowledge but also at summarizing it effectively, making information more accessible and consumable for users. On top of it, the rise of AI Agents has introduced another aspect to information retrieval i.e. dynamic information retrieval which enables the integration of real-time data such as weather forecasts, and financial data with the knowledge base to curate context-aware knowledge. However, despite these advancements the agents remain susceptible to issues of bias and fairness, challenges deeply rooted within the knowledge base and training of LLMs. This study introduces a novel approach to bias-aware knowledge retrieval by leveraging agentic framework and the innovative use of bias detectors as tools to identify and highlight inherent biases in the retrieved content. By empowering users with transparency and awareness, this approach aims to foster more equitable information systems and promote the development of responsible AI.

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Shuming Liu,Chen Zhao,Tianqi Xu,Bernard Ghanem

Task: 提出BOLT方法，通过帧选择策略提升大型视频语言模型（VLMs）在长视频分析中的性能。

Motivation: 现有VLMs在长视频分析中因上下文窗口有限而效果不佳，传统均匀帧采样方法在噪声环境中表现差。

Details

Method: 研究多种基于查询-帧相似性的帧选择策略，并提出多源检索评估设置。 Result: 逆变换采样策略显著提升性能，Video-MME基准准确率从53.8%提升至56.1%，MLVU基准从58.9%提升至63.4%。 Conclusion: BOLT方法无需额外训练即可有效提升VLMs在长视频分析中的性能。 Abstract: Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is constrained by limited context windows. Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content, diminishing their effectiveness in real-world scenarios. In this paper, we introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies. First, to enable a more realistic evaluation of VLMs in long-form video understanding, we propose a multi-source retrieval evaluation setting. Our findings reveal that uniform sampling performs poorly in noisy contexts, underscoring the importance of selecting the right frames. Second, we explore several frame selection strategies based on query-frame similarity and analyze their effectiveness at inference time. Our results show that inverse transform sampling yields the most significant performance improvement, increasing accuracy on the Video-MME benchmark from 53.8% to 56.1% and MLVU benchmark from 58.9% to 63.4%. Our code is available at https://github.com/sming256/BOLT.

Composable Prompting Workspaces for Creative Writing: Exploration and Iteration Using Dynamic Widgets

Rifat Mehreen Amin,Oliver Hans Kühle,Daniel Buschek,Andreas Butz

Task: 提出一种可组合的提示画布概念，用于通过动态小部件进行文本探索和迭代。

Motivation: 当前生成式AI模型的图形用户界面缺乏对迭代探索的支持，未能将提示作为可操作的界面对象表示。

Details

Method: 设计一个可组合的提示画布，用户通过系统建议、提示或手动生成小部件，捕捉任务相关的文本生成维度。 Result: 在比较研究中，18名参与者使用该系统完成写作任务，报告对生成文本有更多控制，并显著优于基线（对话式UI）。 Conclusion: 支持用户驱动定制和重构的GUI设计能提高提示的灵活性和效率。 Abstract: Generative AI models offer many possibilities for text creation and transformation. Current graphical user interfaces (GUIs) for prompting them lack support for iterative exploration, as they do not represent prompts as actionable interface objects. We propose the concept of a composable prompting canvas for text exploration and iteration using dynamic widgets. Users generate widgets through system suggestions, prompting, or manually to capture task-relevant facets that affect the generated text. In a comparative study with a baseline (conversational UI), 18 participants worked on two writing tasks, creating diverse prompting environments with custom widgets and spatial layouts. They reported having more control over the generated text and preferred our system over the baseline. Our design significantly outperformed the baseline on the Creativity Support Index, and participants felt the results were worth the effort. This work highlights the need for GUIs that support user-driven customization and (re-)structuring to increase both the flexibility and efficiency of prompting.

Hamadi Chihaoui,Paolo Favaro

Task: 提出一种名为Invert2Restore的零样本、无需训练的方法，用于解决真实场景中图像恢复的挑战。

Motivation: 真实场景中图像恢复的两个主要挑战是图像先验的准确表征和图像退化算子的精确建模，现有方法在退化算子建模上存在局限性。

Details

Method: 利用预训练扩散模型作为确定性映射，通过引导输入噪声向高概率密度区域移动来恢复退化图像。 Result: 实验验证表明，Invert2Restore在退化算子未知或部分已知的情况下，实现了最先进的性能。 Conclusion: Invert2Restore是一种高效且通用的方法，适用于多种图像退化类型。 Abstract: Two of the main challenges of image restoration in real-world scenarios are the accurate characterization of an image prior and the precise modeling of the image degradation operator. Pre-trained diffusion models have been very successfully used as image priors in zero-shot image restoration methods. However, how to best handle the degradation operator is still an open problem. In real-world data, methods that rely on specific parametric assumptions about the degradation model often face limitations in their applicability. To address this, we introduce Invert2Restore, a zero-shot, training-free method that operates in both fully blind and partially blind settings -- requiring no prior knowledge of the degradation model or only partial knowledge of its parametric form without known parameters. Despite this, Invert2Restore achieves high-fidelity results and generalizes well across various types of image degradation. It leverages a pre-trained diffusion model as a deterministic mapping between normal samples and undistorted image samples. The key insight is that the input noise mapped by a diffusion model to a degraded image lies in a low-probability density region of the standard normal distribution. Thus, we can restore the degraded image by carefully guiding its input noise toward a higher-density region. We experimentally validate Invert2Restore across several image restoration tasks, demonstrating that it achieves state-of-the-art performance in scenarios where the degradation operator is either unknown or partially known.

debug-gym: A Text-Based Environment for Interactive Debugging

Xingdi Yuan,Morgane M Moss,Charbel El Feghali,Chinmay Singh,Darya Moldavskaya,Drew MacPhee,Lucas Caccia,Matheus Pereira,Minseon Kim,Alessandro Sordoni,Marc-Alexandre Côté

Task: 开发一个交互式环境（debug-gym），以帮助基于LLM的代理在代码库中探索信息。

Motivation: LLMs在编码任务中依赖上下文或训练数据，但缺乏交互式探索代码库的能力，限制了其信息获取。

Details

Method: 提出一个轻量级文本环境（debug-gym），内置Python调试器等工具，支持LLM代理的交互式调试。 Result: debug-gym环境能够有效支持LLM代理在交互式编码任务中的信息探索和调试。 Conclusion: 该方法不仅适用于编码和调试任务，还可推广到其他需要LLM代理信息搜索行为的任务。 Abstract: Large Language Models (LLMs) are increasingly relied upon for coding tasks, yet in most scenarios it is assumed that all relevant information can be either accessed in context or matches their training data. We posit that LLMs can benefit from the ability to interactively explore a codebase to gather the information relevant to their task. To achieve this, we present a textual environment, namely debug-gym, for developing LLM-based agents in an interactive coding setting. Our environment is lightweight and provides a preset of useful tools, such as a Python debugger (pdb), designed to facilitate an LLM-based agent's interactive debugging. Beyond coding and debugging tasks, this approach can be generalized to other tasks that would benefit from information-seeking behavior by an LLM agent.

Shape Modeling of Longitudinal Medical Images: From Diffeomorphic Metric Mapping to Deep Learning

Edwin Tay,Nazli Tümer,Amir A. Zadpoor

Task: 综述生物组织纵向形状变化的建模方法。

Motivation: 生物组织的形状变化在医疗诊断、预后和治疗中具有重要意义，但由于其非线性特性，建模具有挑战性。

Details

Method: 综述了多种方法，包括微分同胚度量映射和基于深度学习的方法（如自编码器、生成网络、循环神经网络等）。 Result: 总结了现有技术的协同组合，并指出了当前研究中的关键不足。 Conclusion: 提出了未来研究的潜在方向，强调了进一步发展的必要性。 Abstract: Living biological tissue is a complex system, constantly growing and changing in response to external and internal stimuli. These processes lead to remarkable and intricate changes in shape. Modeling and understanding both natural and pathological (or abnormal) changes in the shape of anatomical structures is highly relevant, with applications in diagnostic, prognostic, and therapeutic healthcare. Nevertheless, modeling the longitudinal shape change of biological tissue is a non-trivial task due to its inherent nonlinear nature. In this review, we highlight several existing methodologies and tools for modeling longitudinal shape change (i.e., spatiotemporal shape modeling). These methods range from diffeomorphic metric mapping to deep-learning based approaches (e.g., autoencoders, generative networks, recurrent neural networks, etc.). We discuss the synergistic combinations of existing technologies and potential directions for future research, underscoring key deficiencies in the current research landscape.

Model Assembly Learning with Heterogeneous Layer Weight Merging

Yi-Kai Zhang,Jin Wang,Xu-Xiang Zhong,De-Chuan Zhan,Han-Jia Ye

Task: 提出一种名为模型组装学习（MAL）的新范式，用于合并异构架构的预训练模型参数。

Motivation: 通过迭代整合模型库中多样化模型的参数，提升基础模型的能力，而无需额外数据或训练。

Details

Method: 引入MAL，支持异构架构和跨层选择性参数的合并，并系统研究合并条件和设置。 Result: 确立了异构参数合并的关键法则，并提供了实用指南。 Conclusion: MAL为模型合并提供了一种灵活且高效的新方法，适用于异构架构。 Abstract: Model merging acquires general capabilities without extra data or training by combining multiple models' parameters. Previous approaches achieve linear mode connectivity by aligning parameters into the same loss basin using permutation invariance. In this paper, we introduce Model Assembly Learning (MAL), a novel paradigm for model merging that iteratively integrates parameters from diverse models in an open-ended model zoo to enhance the base model's capabilities. Unlike previous works that require identical architectures, MAL allows the merging of heterogeneous architectures and selective parameters across layers. Specifically, the base model can incorporate parameters from different layers of multiple pre-trained models. We systematically investigate the conditions and fundamental settings of heterogeneous parameter merging, addressing all possible mismatches in layer widths between the base and target models. Furthermore, we establish key laws and provide practical guidelines for effectively implementing MAL.

ICG-MVSNet: Learning Intra-view and Cross-view Relationships for Guidance in Multi-View Stereo

Yuxi Hu,Jun Zhang,Zhe Zhang,Rafael Weilharter,Yuchen Rao,Kuangyi Chen,Runze Yuan,Friedrich Fraundorfer

Task: 通过多视角立体视觉（MVS）从一系列重叠图像中估计深度并重建3D点云。

Motivation: 当前基于学习的MVS框架忽略了特征和相关性中嵌入的几何信息，导致匹配成本较弱。

Details

Method: 提出ICG-MVSNet，明确整合了单视角内和跨视角的关系用于深度估计，包括单视角特征融合模块和轻量级跨视角聚合模块。 Result: 在DTU数据集和Tanks and Temples基准测试中表现优异，计算资源需求较低。 Conclusion: ICG-MVSNet在性能和效率上均优于现有方法。 Abstract: Multi-view Stereo (MVS) aims to estimate depth and reconstruct 3D point clouds from a series of overlapping images. Recent learning-based MVS frameworks overlook the geometric information embedded in features and correlations, leading to weak cost matching. In this paper, we propose ICG-MVSNet, which explicitly integrates intra-view and cross-view relationships for depth estimation. Specifically, we develop an intra-view feature fusion module that leverages the feature coordinate correlations within a single image to enhance robust cost matching. Additionally, we introduce a lightweight cross-view aggregation module that efficiently utilizes the contextual information from volume correlations to guide regularization. Our method is evaluated on the DTU dataset and Tanks and Temples benchmark, consistently achieving competitive performance against state-of-the-art works, while requiring lower computational resources.

LLM-Gomoku: A Large Language Model-Based System for Strategic Gomoku with Self-Play and Reinforcement Learning

Hui Wang

Task: 开发一个基于大型语言模型（LLM）的Gomoku AI系统，模拟人类学习下棋的过程。

Motivation: 尽管LLM在自然语言处理领域表现出色，但在Gomoku等游戏中的战略规划和决策应用仍具挑战性。

Details

Method: 通过让模型‘阅读棋盘’、‘理解规则’、‘选择策略’和‘评估位置’，并结合自我对弈和强化学习提升能力。 Result: 该方法显著改善了落子位置的选择，解决了生成非法位置的问题，并通过并行位置评估减少了处理时间。 Conclusion: 经过大量自我对弈训练，模型的Gomoku游戏能力显著提升。 Abstract: In recent years, large language models (LLMs) have shown significant advancements in natural language processing (NLP), with strong capa-bilities in generation, comprehension, and rea-soning. These models have found applications in education, intelligent decision-making, and gaming. However, effectively utilizing LLMs for strategic planning and decision-making in the game of Gomoku remains a challenge. This study aims to develop a Gomoku AI system based on LLMs, simulating the human learning process of playing chess. The system is de-signed to understand and apply Gomoku strat-egies and logic to make rational decisions. The research methods include enabling the model to "read the board," "understand the rules," "select strategies," and "evaluate positions," while en-hancing its abilities through self-play and rein-forcement learning. The results demonstrate that this approach significantly improves the se-lection of move positions, resolves the issue of generating illegal positions, and reduces pro-cess time through parallel position evaluation. After extensive self-play training, the model's Gomoku-playing capabilities have been notably enhanced.

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Achint Soni,Meet Soni,Sirisha Rambhatla

Task: 通过自然语言指令修改图像的特定区域，同时保持整体结构和背景保真度。

Motivation: 现有方法利用扩散模型生成的交叉注意力图来识别修改目标区域，但由于交叉注意力机制关注语义相关性，难以保持图像完整性，导致编辑伪影和失真。

Details

Method: 提出LOCATEdit，通过基于图的方法增强交叉注意力图，利用自注意力派生的补丁关系保持图像区域间的平滑、连贯注意力，确保修改仅限于指定目标并保留周围结构。 Result: 在PIE-Bench上显著优于现有基线方法，展示了其在各种编辑任务中的先进性能和有效性。 Conclusion: LOCATEdit通过改进注意力机制，解决了现有方法在空间一致性和图像完整性上的不足，实现了更高质量的文本引导图像编辑。 Abstract: Text-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. \method consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks. Code can be found on https://github.com/LOCATEdit/LOCATEdit/

Learning to Represent Individual Differences for Choice Decision Making

Yan-Ying Chen,Yue Weng,Alexandre Filipowicz,Rumen Iliev,Francine Chen,Shabnam Hakimi,Yanxia Zhang,Matthew Lee,Kent Lyons,Charlene Wu

Task: 研究如何利用表示学习从行为实验数据中测量个体差异以预测人类决策。

Motivation: 人类决策受多种复杂因素影响且个体差异显著，现有方法（如问卷、行为模型）通常局限于低维度且未针对特定预测任务定制。

Details

Method: 使用表示学习从结构化和非结构化数据中创建个体嵌入，以灵活测量个体差异。 Result: 基于表示学习的模型在决策预测上优于未使用表示学习的模型，甚至超越理论行为模型。 Conclusion: 表示学习是一种有效且灵活的工具，可用于捕捉个体差异。 Abstract: Human decision making can be challenging to predict because decisions are affected by a number of complex factors. Adding to this complexity, decision-making processes can differ considerably between individuals, and methods aimed at predicting human decisions need to take individual differences into account. Behavioral science offers methods by which to measure individual differences (e.g., questionnaires, behavioral models), but these are often narrowed down to low dimensions and not tailored to specific prediction tasks. This paper investigates the use of representation learning to measure individual differences from behavioral experiment data. Representation learning offers a flexible approach to create individual embeddings from data that are both structured (e.g., demographic information) and unstructured (e.g., free text), where the flexibility provides more options for individual difference measures for personalization, e.g., free text responses may allow for open-ended questions that are less privacy-sensitive. In the current paper we use representation learning to characterize individual differences in human performance on an economic decision-making task. We demonstrate that models using representation learning to capture individual differences consistently improve decision predictions over models without representation learning, and even outperform well-known theory-based behavioral models used in these environments. Our results propose that representation learning offers a useful and flexible tool to capture individual differences.

uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images

Jonathan Lee,Bolivar Solarte,Chin-Hsuan Wu,Jin-Cheng Jhang,Fu-En Wang,Yi-Hsuan Tsai,Min Sun

Task: 提出一个统一的模型uLayout，用于从透视图像和全景图像中估计房间布局几何。

Motivation: 传统方法需要为不同类型的图像设计不同的模型，而uLayout旨在通过统一处理两种图像类型来简化这一过程。

Details

Method: 将两种图像统一投影到等距柱状投影中，并通过共享特征提取器和额外的1D卷积层处理不同输入域，以解决视场差异。 Result: uLayout在多个真实数据集上表现出色，性能与当前最先进方法相当，并首次实现了两种图像类型的端到端统一模型。 Conclusion: uLayout通过简单而有效的方法，成功实现了对透视和全景图像的统一处理，为房间布局估计提供了新的解决方案。 Abstract: We present uLayout, a unified model for estimating room layout geometries from both perspective and panoramic images, whereas traditional solutions require different model designs for each image type. The key idea of our solution is to unify both domains into the equirectangular projection, particularly, allocating perspective images into the most suitable latitude coordinate to effectively exploit both domains seamlessly. To address the Field-of-View (FoV) difference between the input domains, we design uLayout with a shared feature extractor with an extra 1D-Convolution layer to condition each domain input differently. This conditioning allows us to efficiently formulate a column-wise feature regression problem regardless of the FoV input. This simple yet effective approach achieves competitive performance with current state-of-the-art solutions and shows for the first time a single end-to-end model for both domains. Extensive experiments in the real-world datasets, LSUN, Matterport3D, PanoContext, and Stanford 2D-3D evidence the contribution of our approach. Code is available at https://github.com/JonathanLee112/uLayout.

Elementwise Layer Normalization

Felix Stollenwerk

Task: 提出并分析Dynamic Tanh (DyT)作为Layer Normalization的替代方法，并推导其数学基础。

Motivation: DyT虽然经验上表现良好，但缺乏理论支持，因此需要从数学角度推导其合理性。

Details

Method: 通过数学推导DyT，并发现需要近似处理；去除近似后得到Elementwise Layer Normalization (ELN)。 Result: ELN比DyT更准确地模拟了Layer Normalization的行为。 Conclusion: ELN是一种更接近Layer Normalization的替代方法，具有理论支持。 Abstract: A recent paper proposed Dynamic Tanh (DyT) as a drop-in replacement for Layer Normalization. Although the method is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we derive DyT mathematically and show that a well-defined approximation is needed to do so. By dropping said approximation, an alternative element-wise transformation is obtained, which we call Elementwise Layer Normalization (ELN). We demonstrate that ELN resembles Layer Normalization more accurately than DyT does.

Bearing fault diagnosis based on multi-scale spectral images and convolutional neural network

Tongchao Luo,Mingquan Qiu,Zhenyu Wu,Zebo Zhao,Dingyou Zhang

Task: 提出一种基于多尺度频谱特征图像和深度学习的轴承故障诊断新方法。

Motivation: 解决传统轴承故障诊断方法诊断精度低的问题。

Details

Method: 通过均值去除预处理振动信号，利用快速傅里叶变换（FFT）生成多长度频谱，构建多尺度频谱图像（MSSI），并使用卷积神经网络（CNN）进行故障诊断。 Result: 实验结果表明，该方法显著提高了故障诊断的准确性。 Conclusion: 所提出的方法在轴承故障诊断中具有高效性和准确性。 Abstract: To address the challenges of low diagnostic accuracy in traditional bearing fault diagnosis methods, this paper proposes a novel fault diagnosis approach based on multi-scale spectrum feature images and deep learning. Firstly, the vibration signal are preprocessed through mean removal and then converted to multi-length spectrum with fast Fourier transforms (FFT). Secondly, a novel feature called multi-scale spectral image (MSSI) is constructed by multi-length spectrum paving scheme. Finally, a deep learning framework, convolutional neural network (CNN), is formulated to diagnose the bearing faults. Two experimental cases are utilized to verify the effectiveness of the proposed method. Experimental results demonstrate that the proposed method significantly improves the accuracy of fault diagnosis.

GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics

Arsham Gholamzadeh Khoee,Shuai Wang,Yinan Yu,Robert Feldt,Dhasarathy Parthasarathy

Task: 开发一个基于LLM的工具GateLens，用于分析汽车领域的表格数据，以支持软件发布决策。

Motivation: 传统的手动分析方法在安全关键领域（如汽车系统）中效率低下且成本高昂，而现有LLM在分析推理、上下文理解等方面存在局限性。

Details

Method: GateLens将自然语言查询转换为关系代数表达式，并生成优化的Python代码。 Result: GateLens在基准数据集上表现优于基线系统，F1分数更高，且能更稳健地处理复杂和模糊查询。工业评估显示分析时间减少80%以上。 Conclusion: GateLens通过自动化测试结果分析，实现了更快、更可靠和更智能的软件发布决策，提升了汽车系统的可扩展性和可靠性。 Abstract: Ensuring the reliability and effectiveness of software release decisions is critical, particularly in safety-critical domains like automotive systems. Precise analysis of release validation data, often presented in tabular form, plays a pivotal role in this process. However, traditional methods that rely on manual analysis of extensive test datasets and validation metrics are prone to delays and high costs. Large Language Models (LLMs) offer a promising alternative but face challenges in analytical reasoning, contextual understanding, handling out-of-scope queries, and processing structured test data consistently; limitations that hinder their direct application in safety-critical scenarios. This paper introduces GateLens, an LLM-based tool for analyzing tabular data in the automotive domain. GateLens translates natural language queries into Relational Algebra (RA) expressions and then generates optimized Python code. It outperforms the baseline system on benchmarking datasets, achieving higher F1 scores and handling complex and ambiguous queries with greater robustness. Ablation studies confirm the critical role of the RA module, with performance dropping sharply when omitted. Industrial evaluations reveal that GateLens reduces analysis time by over 80% while maintaining high accuracy and reliability. As demonstrated by presented results, GateLens achieved high performance without relying on few-shot examples, showcasing strong generalization across various query types from diverse company roles. Insights from deploying GateLens with a partner automotive company offer practical guidance for integrating AI into critical workflows such as release validation. Results show that by automating test result analysis, GateLens enables faster, more informed, and dependable release decisions, and can thus advance software scalability and reliability in automotive systems.

AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion

Liuyue Xie,Jiancong Guo,Ozan Cakmakci,Andre Araujo,Laszlo A. Jeni,Zhiheng Jia

Task: 提出一种新的框架，通过联合建模相机内参和外参，解决复杂光学畸变下的相机标定问题。

Motivation: 现有方法依赖预校正图像或标定模式，限制了适用性和灵活性。

Details

Method: 使用基于几何先验的扩散模型（AlignDiff），结合边缘感知注意力机制和大型光线追踪镜头数据库。 Result: 显著减少估计光线束的角度误差约8.2度，并在真实数据集上优于现有方法。 Conclusion: AlignDiff通过几何特征建模和大型数据库支持，提升了相机标定的准确性和泛化能力。 Abstract: Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-the-wild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by ~8.2 degrees and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.

Ziyu Guo,Young Yoon Lee,Joseph Liu,Yizhak Ben-Shabat,Victor Zordan,Mubbasir Kapadia

Task: 提出一种新颖的Stylized Motion Latent Diffusion模型（StyleMotif），用于生成基于多模态内容和风格的运动。

Motivation: 现有方法要么专注于生成多样化的运动内容，要么从序列中转移风格，而StyleMotif旨在无缝合成广泛内容范围的运动，同时融入来自多模态输入（如运动、文本、图像、视频和音频）的风格线索。

Details

Method: 引入风格-内容交叉融合机制，并将风格编码器与预训练的多模态模型对齐，确保生成的运动准确捕捉参考风格并保持真实性。 Result: 实验表明，该框架在风格化运动生成方面优于现有方法，并展现出多模态运动风格化的新兴能力，实现更细致的运动合成。 Conclusion: StyleMotif通过多模态输入实现了高质量的风格化运动生成，具有广泛的应用潜力。 Abstract: We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance. Project Page: https://stylemotif.github.io

FusionSegReID: Advancing Person Re-Identification with Multimodal Retrieval and Precise Segmentation

Jincheng Yan,Yun Wang,Xiaoyan Luo,Yu-Wing Tai

Task: 提出一种多模态模型FusionSegReID，结合图像和文本输入以提升行人重识别（ReID）性能。

Motivation: 传统ReID方法依赖单一模态（如图像）面临遮挡、光照变化和姿态变化等挑战，多模态融合的研究不足。

Details

Method: 开发FusionSegReID模型，结合图像和文本模态的互补优势，提升匹配准确性和鲁棒性。 Result: 实验显示Top-1准确率和mAP显著提升，在遮挡和低质量图像等复杂场景中表现更优。 Conclusion: FusionSegReID优于传统单模态模型，为实际ReID任务提供更鲁棒和灵活的解决方案。 Abstract: Person re-identification (ReID) plays a critical role in applications like security surveillance and criminal investigations by matching individuals across large image galleries captured by non-overlapping cameras. Traditional ReID methods rely on unimodal inputs, typically images, but face limitations due to challenges like occlusions, lighting changes, and pose variations. While advancements in image-based and text-based ReID systems have been made, the integration of both modalities has remained under-explored. This paper presents FusionSegReID, a multimodal model that combines both image and text inputs for enhanced ReID performance. By leveraging the complementary strengths of these modalities, our model improves matching accuracy and robustness, particularly in complex, real-world scenarios where one modality may struggle. Our experiments show significant improvements in Top-1 accuracy and mean Average Precision (mAP) for ReID, as well as better segmentation results in challenging scenarios like occlusion and low-quality images. Ablation studies further confirm that multimodal fusion and segmentation modules contribute to enhanced re-identification and mask accuracy. The results show that FusionSegReID outperforms traditional unimodal models, offering a more robust and flexible solution for real-world person ReID tasks.

Audio-driven Gesture Generation via Deviation Feature in the Latent Space

Jiahui Chen,Yang Huan,Runhua Shi,Chanfan Ding,Xiaoqi Mo,Siyu Xiong,Yinong He

Task: 提出一种弱监督框架，用于生成伴随语音的手势视频。

Motivation: 手势在增强语音交流中具有重要作用，但现有方法多关注点级运动或全监督学习，缺乏对像素级运动偏差的探索。

Details

Method: 采用扩散模型整合潜在运动特征，通过弱监督学习潜在表示偏差，生成更精确和细腻的手势。 Result: 实验表明，该方法显著提升了视频质量，优于当前最先进技术。 Conclusion: 弱监督学习结合潜在空间偏差能有效生成逼真的手势和嘴部动作，为视频生成提供了新思路。 Abstract: Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.

The MVTec AD 2 Dataset: Advanced Scenarios for Unsupervised Anomaly Detection

Lars Heckler-Kram,Jan-Hendrik Neudeck,Ulla Scheler,Rebecca König,Carsten Steger

Task: 提出并评估MVTec AD 2数据集，用于解决现有异常检测基准性能饱和问题。

Motivation: 现有异常检测基准（如MVTec AD和VisA）的性能在分割AU-PRO指标上已接近饱和，导致模型比较困难，阻碍领域进展。

Details

Method: 收集了八个异常检测场景的8000多张高分辨率图像，涵盖工业检测中未考虑的挑战性用例，如透明物体、重叠物体、极端小缺陷等。 Result: 现有最先进方法的平均AU-PRO性能仍低于60%，并提供了光照条件变化的测试场景以评估鲁棒性。 Conclusion: MVTec AD 2数据集为异常检测领域提供了更具挑战性和多样性的基准，促进了模型比较和领域进展。 Abstract: In recent years, performance on existing anomaly detection benchmarks like MVTec AD and VisA has started to saturate in terms of segmentation AU-PRO, with state-of-the-art models often competing in the range of less than one percentage point. This lack of discriminatory power prevents a meaningful comparison of models and thus hinders progress of the field, especially when considering the inherent stochastic nature of machine learning results. We present MVTec AD 2, a collection of eight anomaly detection scenarios with more than 8000 high-resolution images. It comprises challenging and highly relevant industrial inspection use cases that have not been considered in previous datasets, including transparent and overlapping objects, dark-field and back light illumination, objects with high variance in the normal data, and extremely small defects. We provide comprehensive evaluations of state-of-the-art methods and show that their performance remains below 60% average AU-PRO. Additionally, our dataset provides test scenarios with lighting condition changes to assess the robustness of methods under real-world distribution shifts. We host a publicly accessible evaluation server that holds the pixel-precise ground truth of the test set (https://benchmark.mvtec.com/). All image data is available at https://www.mvtec.com/company/research/datasets/mvtec-ad-2.

InteractionMap: Improving Online Vectorized HDMap Construction with Interaction

Kuang Wu,Chuan Yang,Zhanbin Li

Task: 改进基于DETR框架的高清地图矢量化方法，通过充分利用时间和空间上的局部到全局信息交互。

Motivation: 当前的高清地图矢量化方法主要依赖DETR框架，但缺乏对局部到全局信息交互的充分利用，导致性能受限。

Details

Method: 提出InteractionMap，包括显式位置关系先验、基于关键帧的分层时间融合模块，以及几何感知的分类损失和匹配成本。 Result: 在nuScenes和Argoverse2基准测试中达到最先进性能。 Conclusion: InteractionMap通过局部到全局信息交互显著提升了高清地图矢量化的性能。 Abstract: Vectorized high-definition (HD) maps are essential for an autonomous driving system. Recently, state-of-the-art map vectorization methods are mainly based on DETR-like framework to generate HD maps in an end-to-end manner. In this paper, we propose InteractionMap, which improves previous map vectorization methods by fully leveraging local-to-global information interaction in both time and space. Firstly, we explore enhancing DETR-like detectors by explicit position relation prior from point-level to instance-level, since map elements contain strong shape priors. Secondly, we propose a key-frame-based hierarchical temporal fusion module, which interacts temporal information from local to global. Lastly, the separate classification branch and regression branch lead to the problem of misalignment in the output distribution. We interact semantic information with geometric information by introducing a novel geometric-aware classification loss in optimization and a geometric-aware matching cost in label assignment. InteractionMap achieves state-of-the-art performance on both nuScenes and Argoverse2 benchmarks.

CMED: A Child Micro-Expression Dataset

Nikin~Matharaarachchi,Muhammad~Fermi Pasha,Sonya~Coleman,Kah PengWong

Task: 研究儿童微表情的检测方法，并建立首个儿童微表情数据集。

Motivation: 现有微表情研究主要针对成人，而儿童微表情的特征与成人不同，缺乏相关数据集和研究。

Details

Method: 通过视频会议软件采集儿童自发微表情视频，建立首个儿童微表情数据集，并采用手工和基于学习的方法进行检测。 Result: 建立了首个儿童微表情数据集，并探索了成人与儿童微表情的关键差异，同时为儿童微表情的自动检测和识别建立了基线。 Conclusion: 该研究填补了儿童微表情研究的空白，为心理治疗提供了重要工具。 Abstract: Micro-expressions are short bursts of emotion that are difficult to hide. Their detection in children is an important cue to assist psychotherapists in conducting better therapy. However, existing research on the detection of micro-expressions has focused on adults, whose expressions differ in their characteristics from those of children. The lack of research is a direct consequence of the lack of a child-based micro-expressions dataset as it is much more challenging to capture children's facial expressions due to the lack of predictability and controllability. This study compiles a dataset of spontaneous child micro-expression videos, the first of its kind, to the best of the authors knowledge. The dataset is captured in the wild using video conferencing software. This dataset enables us to then explore key features and differences between adult and child micro-expressions. This study also establishes a baseline for the automated spotting and recognition of micro-expressions in children using three approaches comprising of hand-created and learning-based approaches.

RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond

Daniel Bermuth,Alexander Poeppel,Wolfgang Reif

Task: 提出一种改进多视角多人姿态估计的新算法，专注于快速三角测量速度和良好的泛化能力。

Motivation: 多视角成像与姿态估计的集成为计算机视觉应用带来了显著进展，为理解人类运动和互动提供了新可能性。

Details

Method: 扩展至全身姿态估计，捕捉从面部表情到手指动作的细节，适用于多个个体和视角。 Result: 在未见过的数据集和配置中表现出强大的性能，展示了其适应性。 Conclusion: 所有工作公开可用，以支持该领域的进一步发展。 Abstract: The integration of multi-view imaging and pose estimation represents a significant advance in computer vision applications, offering new possibilities for understanding human movement and interactions. This work presents a new algorithm that improves multi-view multi-person pose estimation, focusing on fast triangulation speeds and good generalization capabilities. The approach extends to whole-body pose estimation, capturing details from facial expressions to finger movements across multiple individuals and viewpoints. Adaptability to different settings is demonstrated through strong performance across unseen datasets and configurations. To support further progress in this field, all of this work is publicly accessible.

AMA-SAM: Adversarial Multi-Domain Alignment of Segment Anything Model for High-Fidelity Histology Nuclei Segmentation

Jiahe Qian,Yaoyu Fang,Jinkui Hao,Bo Zhou

Task: 通过扩展Segment Anything Model (SAM)来解决多数据集学习中细胞核分割的领域偏移问题。

Motivation: 现有细胞核分割方法仅考虑单一数据集，忽略了利用辅助领域数据以减少过拟合并提升性能，而多数据集学习可能加剧领域偏移导致的性能下降。

Details

Method: 提出Adversarial Multi-domain Alignment of Segment Anything Model (AMA-SAM)，包括Conditional Gradient Reversal Layer (CGRL)用于多领域对齐，以及High-Resolution Decoder (HR-Decoder)用于生成高分辨率分割图。 Result: 在多个公开数据集上验证，表现优于现有方法。 Conclusion: AMA-SAM是首个将SAM应用于多数据集学习的细胞核分割方法，显著提升了性能。 Abstract: Accurate segmentation of cell nuclei in histopathology images is essential for numerous biomedical research and clinical applications. However, existing cell nucleus segmentation methods only consider a single dataset (i.e., primary domain), while neglecting to leverage supplementary data from diverse sources (i.e., auxiliary domains) to reduce overfitting and enhance the performance. Although incorporating multiple datasets could alleviate overfitting, it often exacerbates performance drops caused by domain shifts. In this work, we introduce Adversarial Multi-domain Alignment of Segment Anything Model (AMA-SAM) that extends the Segment Anything Model (SAM) to overcome these obstacles through two key innovations. First, we propose a Conditional Gradient Reversal Layer (CGRL), a multi-domain alignment module that harmonizes features from diverse domains to promote domain-invariant representation learning while preserving crucial discriminative features for the primary dataset. Second, we address SAM's inherent low-resolution output by designing a High-Resolution Decoder (HR-Decoder), which directly produces fine-grained segmentation maps in order to capture intricate nuclei boundaries in high-resolution histology images. To the best of our knowledge, this is the first attempt to adapt SAM for multi-dataset learning with application to histology nuclei segmentation. We validate our method on several publicly available datasets, demonstrating consistent and significant improvements over state-of-the-art approaches.

Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance

Jaywon Koo,Jefferson Hernandez,Moayed Haji-Ali,Ziyan Yang,Vicente Ordonez

Task: 提出一种基于条件Fréchet距离的文本到图像合成评估指标cFreD。

Motivation: 现有指标（如IS、FID和CLIPScore）仅评估图像质量或图像-文本对齐，无法同时兼顾两者，导致与人类偏好相关性不足。

Details

Method: cFreD通过条件Fréchet距离同时衡量视觉保真度和文本提示对齐。 Result: 实验表明，cFreD与人类判断的相关性高于统计指标和基于人类偏好训练的指标。 Conclusion: cFreD是一种稳健且具有前瞻性的文本到图像模型评估指标，适用于快速发展的领域。 Abstract: Evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences. We propose cFreD, a metric based on the notion of Conditional Fr\'echet Distance that explicitly accounts for both visual fidelity and text-prompt alignment. Existing metrics such as Inception Score (IS), Fr\'echet Inception Distance (FID) and CLIPScore assess either image quality or image-text alignment but not both which limits their correlation with human preferences. Scoring models explicitly trained to replicate human preferences require constant updates and may not generalize to novel generation techniques or out-of-domain inputs. Through extensive experiments across multiple recently proposed text-to-image models and diverse prompt datasets, we demonstrate that cFreD exhibits a higher correlation with human judgments compared to statistical metrics, including metrics trained with human preferences. Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models, standardizing benchmarking in this rapidly evolving field. We release our evaluation toolkit and benchmark in the appendix.

OccRobNet : Occlusion Robust Network for Accurate 3D Interacting Hand-Object Pose Estimation

Mallika Garg,Debashis Ghosh,Pyari Mohan Pradhan

Task: 提出一种鲁棒且准确的方法，从RGB图像中估计3D手-物体姿态。

Motivation: 遮挡是3D手部姿态估计中的挑战性问题，尤其是在手与物体交互或双手交互时。过去的研究未充分关注遮挡区域，但这些区域包含重要信息。

Details

Method: 首先使用基于CNN的模型定位手部关节，然后通过提取上下文信息进行细化。自注意力变换器识别特定关节及手部身份，交叉注意力机制用于姿态估计。 Result: 在InterHand2.6M、HO3D和H$_2$O3D数据集上取得了最先进的结果。 Conclusion: 通过识别遮挡区域的关节，该方法对遮挡具有鲁棒性，实现了高精度的3D手部姿态估计。 Abstract: Occlusion is one of the challenging issues when estimating 3D hand pose. This problem becomes more prominent when hand interacts with an object or two hands are involved. In the past works, much attention has not been given to these occluded regions. But these regions contain important and beneficial information that is vital for 3D hand pose estimation. Thus, in this paper, we propose an occlusion robust and accurate method for the estimation of 3D hand-object pose from the input RGB image. Our method includes first localising the hand joints using a CNN based model and then refining them by extracting contextual information. The self attention transformer then identifies the specific joints along with the hand identity. This helps the model to identify the hand belongingness of a particular joint which helps to detect the joint even in the occluded region. Further, these joints with hand identity are then used to estimate the pose using cross attention mechanism. Thus, by identifying the joints in the occluded region, the obtained network becomes robust to occlusion. Hence, this network achieves state-of-the-art results when evaluated on the InterHand2.6M, HO3D and H$_2$O3D datasets.

SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling

Xianglong He,Zi-Xin Zou,Chia-Hao Chen,Yuan-Chen Guo,Ding Liang,Chun Yuan,Wanli Ouyang,Yan-Pei Cao,Yangguang Li

Task: 提出一种名为SparseFlex的稀疏结构等值面表示方法，用于直接从渲染损失中实现高分辨率（最高1024^3）的可微分网格重建。

Motivation: 现有隐式场方法需要昂贵且细节损失的水密转换，而其他方法难以处理高分辨率，因此需要一种能高效处理开放表面和复杂内部结构的新方法。

Details

Method: 结合Flexicubes的准确性和稀疏体素结构，提出一种视锥感知的分段体素训练策略，仅激活渲染相关的体素，显著降低内存消耗。 Result: 实验显示重建精度达到最先进水平，Chamfer Distance减少约82%，F-score提高约88%，并能生成高分辨率、任意拓扑的3D形状。 Conclusion: SparseFlex通过支持高分辨率可微分网格重建和生成，显著推进了3D形状表示和建模的先进水平。 Abstract: Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to $1024^3$ directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ~82% reduction in Chamfer Distance and a ~88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation and modeling.

3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models

Yuhan Zhang,Mengchen Zhang,Tong Wu,Tengfei Wang,Gordon Wetzstein,Dahua Lin,Ziwei Liu

Task: 开发一个自动评估系统（3DGen-Score和3DGen-Eval）以统一评估文本到3D和图像到3D生成的质量。

Motivation: 3D生成领域快速发展，但自动评估方法未能与人类感知保持一致，缺乏全面的偏好数据集。

Details

Method: 构建3DGen-Arena平台收集人类偏好数据（3DGen-Bench），并基于CLIP和MLLM训练评分模型和自动评估器。 Result: 实验证明评分模型能有效预测人类偏好，与人类排名相关性优于现有指标。 Conclusion: 3DGen-Bench数据集和自动评估系统将促进3D生成领域的公平评估，推动生成模型及其应用的发展。 Abstract: 3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a comprehensive preference dataset over generative models. To mitigate this absence, we develop 3DGen-Arena, an integrated platform in a battle manner. Then, we carefully design diverse text and image prompts and leverage the arena platform to gather human preferences from both public users and expert annotators, resulting in a large-scale multi-dimension human preference dataset 3DGen-Bench. Using this dataset, we further train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval. These two models innovatively unify the quality evaluation of text-to-3D and image-to-3D generation, and jointly form our automated evaluation system with their respective strengths. Extensive experiments demonstrate the efficacy of our scoring model in predicting human preferences, exhibiting a superior correlation with human ranks compared to existing metrics. We believe that our 3DGen-Bench dataset and automated evaluation system will foster a more equitable evaluation in the field of 3D generation, further promoting the development of 3D generative models and their downstream applications.

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Aniket Didolkar,Andrii Zadaianchuk,Rabiul Awal,Maximilian Seitzer,Efstratios Gavves,Aishwarya Agrawal

Task: 提出一种用户可控制的对象中心表示学习方法，通过语言描述指导对象表示。

Motivation: 现有对象中心模型缺乏用户可控性，无法根据用户输入指导对象表示。

Details

Method: 提出CTRL-O方法，通过语言描述绑定对象表示，无需掩码监督。 Result: 在文本到图像生成和视觉问答任务中表现优异。 Conclusion: CTRL-O方法实现了对象表示的用户可控性，并在下游任务中验证了其有效性。 Abstract: Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

Shitian Zhao,Qilong Wu,Xinyue Li,Bo Zhang,Ming Li,Qi Qin,Dongyang Liu,Kaipeng Zhang,Hongsheng Li,Yu Qiao,Peng Gao,Bin Fu,Zhen Li

Task: 开发LeX-Art，一个高质量的文本-图像合成套件，旨在提升提示表达能力和文本渲染保真度。

Motivation: 解决文本-图像合成中提示表达与文本渲染保真度之间的差距。

Details

Method: 采用数据为中心的方法，构建高质量的数据合成管道LeX-10K，开发LeX-Enhancer提示增强模型，并训练LeX-FLUX和LeX-Lumina两个文本-图像模型。 Result: LeX-Lumina在CreateBench上实现79.81%的PNED增益，LeX-FLUX在颜色、位置和字体准确性上分别提升3.18%、4.45%和3.81%。 Conclusion: LeX-Art在文本渲染性能上达到最先进水平，并通过LeX-Bench和PNED指标提供了系统化的评估方法。 Abstract: We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.

Reconstructing Humans with a Biomechanically Accurate Skeleton

Yan Xia,Xiaowei Zhou,Etienne Vouga,Qixing Huang,Georgios Pavlakos

Task: 从单张图像重建具有生物力学准确性的3D人体模型。

Motivation: 现有方法在极端3D姿态和视角下表现不佳，且常违反关节角度限制，导致不自然的旋转。

Details

Method: 使用基于变换器的模型从图像估计模型参数，并通过伪标签迭代优化训练数据。 Result: 在标准基准测试中表现优异，尤其在极端姿态和视角下显著优于现有方法，且能生成更真实的关节旋转。 Conclusion: 提出的方法在3D人体网格重建中具有竞争力，尤其在生物力学合理性方面表现突出。 Abstract: In this paper, we introduce a method for reconstructing 3D humans from a single image using a biomechanically accurate skeleton model. To achieve this, we train a transformer that takes an image as input and estimates the parameters of the model. Due to the lack of training data for this task, we build a pipeline to produce pseudo ground truth model parameters for single images and implement a training procedure that iteratively refines these pseudo labels. Compared to state-of-the-art methods for 3D human mesh recovery, our model achieves competitive performance on standard benchmarks, while it significantly outperforms them in settings with extreme 3D poses and viewpoints. Additionally, we show that previous reconstruction methods frequently violate joint angle limits, leading to unnatural rotations. In contrast, our approach leverages the biomechanically plausible degrees of freedom making more realistic joint rotation estimates. We validate our approach across multiple human pose estimation benchmarks. We make the code, models and data available at: https://isshikihugh.github.io/HSMR/

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng,Ziqi Huang,Hongbo Liu,Kai Zou,Yinan He,Fan Zhang,Yuanhan Zhang,Jingwen He,Wei-Shi Zheng,Yu Qiao,Ziwei Liu

Task: 开发VBench-2.0，一个用于评估视频生成模型内在忠实性的新一代基准。

Motivation: 现有基准主要关注视频生成的表面忠实性（如视觉逼真度），而忽略了内在忠实性（如物理法则、常识推理等），限制了视频生成模型在真实世界应用中的潜力。

Details

Method: VBench-2.0通过评估五个关键维度（人类逼真度、可控性、创造力、物理法则和常识），结合通用模型（如VLM和LLM）和专用方法（如异常检测），并辅以人工标注，实现自动评估。 Result: VBench-2.0为视频生成模型提供了一个更全面的评估框架，旨在推动模型从表面忠实性向内在忠实性迈进。 Conclusion: VBench-2.0通过关注内在忠实性，为下一代视频生成模型设定了新标准，助力其在真实世界应用中的发展。 Abstract: Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real "world models" through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored for individual dimensions, our evaluation framework integrates generalists such as state-of-the-art VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive annotations to ensure alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.

Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck

Adrian Bulat,Yassine Ouali,Georgios Tzimiropoulos

Task: 压缩大型视觉语言模型（LVLM）的视觉标记，生成同时适用于生成性和判别性任务的表示。

Motivation: 为了在保持信息近乎无损的同时实现存储高效，并适用于多种任务。

Details

Method: 提出Fwd2Bot方法，采用“双前向传递”训练策略，结合自回归损失和对比损失优化压缩表示。 Result: 实现了2倍更高的压缩率且不影响生成能力，在图像检索和组合性任务上达到新SOTA。 Conclusion: Fwd2Bot生成的压缩表示在生成性和判别性任务中均表现出色，具有高效性和通用性。 Abstract: In this work, we aim to compress the vision tokens of a Large Vision Language Model (LVLM) into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) is storage-efficient. We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner. At the core of Fwd2bot there exists a "double-forward pass" training strategy, whereby, during the first forward pass, the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of summary tokens. Then, using the same LLM, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training signal is provided by two losses: an autoregressive one applied after the second pass that provides a direct optimization objective for compression, and a contrastive loss, applied after the first pass, that further boosts the representation strength, especially for discriminative tasks. The training is further enhanced by stage-specific adapters. We accompany the proposed method by an in-depth ablation study. Overall, Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks. For generative tasks, we offer a 2x higher compression rate without compromising the generative capabilities, setting a new state-of-the-art result. For discriminative tasks, we set a new state-of-the-art on image retrieval and compositionality.

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Qi Qin,Le Zhuo,Yi Xin,Ruoyi Du,Zhen Li,Bin Fu,Yiting Lu,Jiakang Yuan,Xinyue Li,Dongyang Liu,Xiangyang Zhu,Manyuan Zhang,Will Beddow,Erwann Millon,Victor Perez,Wenhai Wang,Conghui He,Bo Zhang,Xiaohong Liu,Hongsheng Li,Yu Qiao,Chang Xu,Peng Gao

Task: 介绍Lumina-Image 2.0，一种先进的文本到图像生成框架。

Motivation: 通过统一架构和高效训练策略提升文本到图像生成的质量和效率。

Details

Method: 采用统一架构（Unified Next-DiT）和统一标注系统（UniCap），结合多阶段渐进训练和推理加速技术。 Result: 在学术基准和公开文本到图像竞技场中表现优异，仅需2.6B参数。 Conclusion: Lumina-Image 2.0展示了其可扩展性和设计效率，代码和模型已开源。 Abstract: We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.

Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video

David Yifan Yao,Albert J. Zhai,Shenlong Wang

Task: 提出一种统一的方法（Uni4D）来从非专业视频中理解动态场景。

Motivation: 尽管大型预训练视觉基础模型（如视觉语言、视频深度预测、运动跟踪和分割模型）具有潜力，但训练单一模型以实现全面的4D理解仍具挑战性。

Details

Method: 引入Uni4D，一个多阶段优化框架，利用多个预训练模型推进动态3D建模，包括静态/动态重建、相机姿态估计和密集3D运动跟踪。 Result: 在动态4D建模中表现出最先进的性能，并具有卓越的视觉质量。 Conclusion: Uni4D无需重新训练或微调，展示了将视觉基础模型重新用于4D理解的有效性。 Abstract: This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.

Exploring the Evolution of Physics Cognition in Video Generation: A Survey

Minghui Lin,Xiang Wang,Yishan Wang,Shu Wang,Fengqi Dai,Pengxiang Ding,Cunxiang Wang,Zhengrong Zuo,Nong Sang,Siteng Huang,Donglin Wang

Task: 系统总结视频生成中物理认知的架构设计及其应用。

Motivation: 当前视频生成技术虽在视觉真实性上取得进展，但缺乏物理一致性，导致内容违反物理规律，亟需系统化的综述填补这一领域空白。

Details

Method: 从认知科学角度梳理物理认知在视频生成中的演化过程，提出三层分类法（基础模式感知、被动物理知识认知、主动世界模拟），涵盖前沿方法、经典范式与基准。 Result: 总结了现有技术并提出了未来研究的潜在路径，推动视频生成从“视觉模仿”迈向“类人物理理解”的新阶段。 Conclusion: 通过结构化综述与跨学科分析，为开发可解释、可控且物理一致的视频生成范式提供方向指导。 Abstract: Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - generated content often violates the fundamental laws of physics, falling into the dilemma of ''visual realism but physical absurdity". Researchers began to increasingly recognize the importance of physical fidelity in video generation and attempted to integrate heuristic physical cognition such as motion representations and physical knowledge into generative systems to simulate real-world dynamic scenarios. Considering the lack of a systematic overview in this field, this survey aims to provide a comprehensive summary of architecture designs and their applications to fill this gap. Specifically, we discuss and organize the evolutionary process of physical cognition in video generation from a cognitive science perspective, while proposing a three-tier taxonomy: 1) basic schema perception for generation, 2) passive cognition of physical knowledge for generation, and 3) active cognition for world simulation, encompassing state-of-the-art methods, classical paradigms, and benchmarks. Subsequently, we emphasize the inherent key challenges in this domain and delineate potential pathways for future research, contributing to advancing the frontiers of discussion in both academia and industry. Through structured review and interdisciplinary analysis, this survey aims to provide directional guidance for developing interpretable, controllable, and physically consistent video generation paradigms, thereby propelling generative models from the stage of ''visual mimicry'' towards a new phase of ''human-like physical comprehension''.

Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence

Haolin Liu,Xiaohang Zhan,Zizheng Yan,Zhongjin Luo,Yuxin Wen,Xiaoguang Han

Task: 提出一种稳定的基于注册的3D形状对应框架Stable-SCore，用于解决复杂场景下的形状对应问题。

Motivation: 当前主流的功能映射方法在复杂场景（如非等距形状差异）中表现不佳，因此需要更稳定的形状对应估计方法。

Details

Method: 重新利用2D角色对应的基础模型，并提出一种新颖的语义流引导注册方法，利用2D对应指导网格变形。 Result: Stable-SCore在挑战性场景中显著优于现有方法，并展示了广泛的现实应用潜力。 Conclusion: 该框架为解决复杂形状对应问题提供了稳定且有效的解决方案。 Abstract: Establishing character shape correspondence is a critical and fundamental task in computer vision and graphics, with diverse applications including re-topology, attribute transfer, and shape interpolation. Current dominant functional map methods, while effective in controlled scenarios, struggle in real situations with more complex challenges such as non-isometric shape discrepancies. In response, we revisit registration-for-correspondence methods and tap their potential for more stable shape correspondence estimation. To overcome their common issues including unstable deformations and the necessity for careful pre-alignment or high-quality initial 3D correspondences, we introduce Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence. We first re-purpose a foundation model for 2D character correspondence that ensures reliable and stable 2D mappings. Crucially, we propose a novel Semantic Flow Guided Registration approach that leverages 2D correspondence to guide mesh deformations. Our framework significantly surpasses existing methods in challenging scenarios, and brings possibilities for a wide array of real applications, as demonstrated in our results.

Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying

Hairong Yin,Huangying Zhan,Yi Xu,Raymond A. Yeh

Task: Open-vocabulary querying in 3D Gaussian Splatting to identify semantically relevant regions based on text queries.

Motivation: Prior methods like LangSplat and OpenGaussian have limitations in directly querying 3D Gaussians or achieving semantic consistency.

Details

Method: A point-level querying method leveraging SAM2 masklets for semantic ground-truth and a two-step querying approach. Result: Achieves better performance, e.g., +20.42 mIoU improvement on the 3D-OVS dataset. Conclusion: The proposed method outperforms state-of-the-art approaches in open-vocabulary querying. Abstract: Open-vocabulary querying in 3D Gaussian Splatting aims to identify semantically relevant regions within a 3D Gaussian representation based on a given text query. Prior work, such as LangSplat, addressed this task by retrieving these regions in the form of segmentation masks on 2D renderings. More recently, OpenGaussian introduced point-level querying, which directly selects a subset of 3D Gaussians. In this work, we propose a point-level querying method that builds upon LangSplat's framework. Our approach improves the framework in two key ways: (a) we leverage masklets from the Segment Anything Model 2 (SAM2) to establish semantic consistent ground-truth for distilling the language Gaussians; (b) we introduces a novel two-step querying approach that first retrieves the distilled ground-truth and subsequently uses the ground-truth to query the individual Gaussians. Experimental evaluations on three benchmark datasets demonstrate that the proposed method achieves better performance compared to state-of-the-art approaches. For instance, our method achieves an mIoU improvement of +20.42 on the 3D-OVS dataset.

Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting

Anand Bhattad,Konpat Preechakul,Alexei A. Efros

Task: 提出一种名为Visual Jenga的新场景理解任务，通过逐步移除图像中的物体来揭示场景元素之间的内在关系。

Motivation: 受Jenga游戏启发，旨在探索物体移除对场景连贯性的影响，揭示物理和几何上的依赖关系。

Details

Method: 采用一种简单、数据驱动且无需训练的方法，利用物体间的不对称关系和大规模修复模型生成反事实量化这种不对称性。 Result: 该方法在多种真实图像上表现出色，验证了其有效性。 Conclusion: Visual Jenga任务为场景理解提供了新视角，提出的方法展示了其潜力。 Abstract: This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.

A Unified Image-Dense Annotation Generation Model for Underwater Scenes

Hongkai Lin,Dingkang Liang,Zhenghao Qi,Xiang Bai

Task: 提出一种统一的文本到图像和密集标注生成方法（TIDE），用于水下场景。

Motivation: 高质量、大规模的水下密集标注数据集稀缺，因复杂环境和数据收集成本高昂。

Details

Method: 通过单一模型统一生成文本到图像和文本到密集标注，引入隐式布局共享机制（ILS）和时间自适应归一化（TAN）优化一致性。 Result: 合成的数据集验证了方法在水下密集预测任务中的有效性，提升了现有模型的性能并缓解数据稀缺问题。 Conclusion: TIDE为缓解其他领域数据稀缺问题提供了新视角。 Abstract: Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for gaining a comprehensive understanding of underwater scenes. Nevertheless, high-quality and large-scale underwater datasets with dense annotations remain scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text-to-image and text-to-dense annotations within a single model. The Implicit Layout Sharing mechanism (ILS) and cross-modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations. We synthesize a large-scale underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks. The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations. We hope our method can offer new perspectives on alleviating data scarcity issues in other fields. The code is available at https: //github.com/HongkLin/TIDE.

LOCORE: Image Re-ranking with Long-Context Sequence Modeling

Zilin Xiao,Pavel Suma,Ayush Sachdeva,Hao-Jen Wang,Giorgos Kordopatis-Zilos,Giorgos Tolias,Vicente Ordonez

Task: 提出一种名为LOCORE的长上下文重排序模型，用于图像检索任务，通过局部描述符进行列表级重排序。

Motivation: 现有方法通常使用局部描述符进行成对相似性估计或全局描述符进行列表级重排序，LOCORE首次实现了基于局部描述符的列表级重排序。

Details

Method: 利用高效的长上下文序列模型捕捉查询图像与候选图像在局部描述符级别的依赖关系，并通过滑动窗口策略处理长候选列表。 Result: 在多个图像检索基准测试（如ROxf、RPar、SOP、In-Shop和CUB-200）上表现优于其他重排序方法，且延迟与成对局部描述符重排序方法相当。 Conclusion: LOCORE是一种高效且性能优越的图像重排序方法，首次实现了基于局部描述符的列表级重排序。 Abstract: We introduce LOCORE, Long-Context Re-ranker, a model that takes as input local descriptors corresponding to an image query and a list of gallery images and outputs similarity scores between the query and each gallery image. This model is used for image retrieval, where typically a first ranking is performed with an efficient similarity measure, and then a shortlist of top-ranked images is re-ranked based on a more fine-grained similarity measure. Compared to existing methods that perform pair-wise similarity estimation with local descriptors or list-wise re-ranking with global descriptors, LOCORE is the first method to perform list-wise re-ranking with local descriptors. To achieve this, we leverage efficient long-context sequence models to effectively capture the dependencies between query and gallery images at the local-descriptor level. During testing, we process long shortlists with a sliding window strategy that is tailored to overcome the context size limitations of sequence models. Our approach achieves superior performance compared with other re-rankers on established image retrieval benchmarks of landmarks (ROxf and RPar), products (SOP), fashion items (In-Shop), and bird species (CUB-200) while having comparable latency to the pair-wise local descriptor re-rankers.

Optimal Stepsize for Diffusion Sampling

Jianning Pei,Han Hu,Shuyang Gu

Task: 提出一种动态规划框架（Optimal Stepsize Distillation）来优化扩散模型的步长调度设计。

Motivation: 扩散模型在生成质量上表现出色，但由于步长离散化的不足导致计算密集的采样问题。

Details

Method: 通过动态规划框架从参考轨迹中提取理论最优的步长调度，将步长优化问题重新表述为递归误差最小化。 Result: 实验表明，该方法在文本到图像生成任务中实现了10倍的加速，同时保持了99.4%的性能。 Conclusion: 提出的方法在多种架构、ODE求解器和噪声调度中表现出强鲁棒性，显著提升了扩散模型的采样效率。 Abstract: Diffusion models achieve remarkable generation quality but suffer from computational intensive sampling due to suboptimal step discretization. While existing works focus on optimizing denoising directions, we address the principled design of stepsize schedules. This paper proposes Optimal Stepsize Distillation, a dynamic programming framework that extracts theoretically optimal schedules by distilling knowledge from reference trajectories. By reformulating stepsize optimization as recursive error minimization, our method guarantees global discretization bounds through optimal substructure exploitation. Crucially, the distilled schedules demonstrate strong robustness across architectures, ODE solvers, and noise schedules. Experiments show 10x accelerated text-to-image generation while preserving 99.4% performance on GenEval. Our code is available at https://github.com/bebebe666/OptimalSteps.

Ziyu Guo,Young Yoon Lee,Joseph Liu,Yizhak Ben-Shabat,Victor Zordan,Mubbasir Kapadia

Task: 提出一种新颖的Stylized Motion Latent Diffusion模型（StyleMotif），用于生成基于多模态内容和风格的运动。

Motivation: 现有方法要么专注于生成多样化的运动内容，要么专注于从序列中转移风格，而StyleMotif旨在无缝合成广泛内容的运动，同时融入多模态输入（如运动、文本、图像、视频和音频）的风格线索。

Details

Method: 引入风格-内容交叉融合机制，并将风格编码器与预训练的多模态模型对齐，确保生成的运动准确捕捉参考风格并保持真实感。 Result: 实验表明，该框架在风格化运动生成方面优于现有方法，并展现出多模态运动风格化的新兴能力，实现更细致的运动合成。 Conclusion: StyleMotif能够生成高质量的风格化运动，支持多模态输入，为运动合成提供了新的可能性。 Abstract: We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance. Project Page: https://stylemotif.github.io

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng,Kaixiong Gong,Bohao Li,Zonghao Guo,Yibing Wang,Tianshuo Peng,Benyou Wang,Xiangyu Yue

Task: 探索R1范式在多模态大语言模型（MLLMs）中激发视频推理能力。

Motivation: 受DeepSeek-R1通过基于规则的强化学习（RL）激发推理能力的启发，但直接应用RL训练（如GRPO算法）于视频推理存在两个主要挑战：缺乏时间建模和高质量视频推理数据的稀缺。

Details

Method: 提出T-GRPO算法以利用视频时间信息，并结合高质量图像推理数据进行训练，构建了两个数据集（Video-R1-COT-165k和Video-R1-260k）。 Result: Video-R1在视频推理基准（如VideoMMMU和VSI-Bench）及通用视频基准（如MVBench和TempCompass）上表现显著提升，Video-R1-7B在VSI-bench上达到35.8%准确率，超越GPT-4o。 Conclusion: Video-R1成功解决了视频推理中的挑战，并通过公开代码、模型和数据推动领域发展。 Abstract: Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All codes, models, data are released.

Test-Time Visual In-Context Tuning

Jiahao Xie,Alessio Tonioni,Nathalie Rauschmayr,Federico Tombari,Bernt Schiele

Task: 提出一种名为VICT的方法，用于在测试时动态调整视觉上下文学习（VICL）模型，以提高其在分布偏移下的泛化能力。

Motivation: 现有的VICL范式在分布偏移下表现不佳，需要一种能够快速适应新测试样本的方法。

Details

Method: 通过翻转任务提示和测试样本的角色，并使用循环一致性损失来重构原始任务提示输出。 Result: 在六种代表性视觉任务和15种常见干扰下，VICT显著提高了VICL对未见新领域的泛化能力。 Conclusion: VICT不仅提升了VICL的泛化性能，还展示了在测试时适应未见任务的潜力。 Abstract: Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. Specifically, we flip the role between the task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks ranging from high-level visual understanding to low-level image processing, with 15 common corruptions, demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. In addition, we show the potential of applying VICT for unseen tasks at test time. Code: https://github.com/Jiahao000/VICT.

HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM

Ziren Gong,Fabio Tosi,Youmin Zhang,Stefano Mattoccia,Matteo Poggi

Task: 提出HS-SLAM方法以解决NeRF-based SLAM在场景表示、结构信息捕捉和全局一致性方面的挑战。

Motivation: 现有方法在场景表示不足、结构信息捕捉不充分和全局一致性维护方面存在困难，尤其是在显著运动或被遗忘的场景中。

Details

Method: 采用混合编码网络结合hash-grid、tri-planes和one-blob的优势，引入结构监督通过采样非局部像素块，并实施主动全局BA以消除相机漂移和累积误差。 Result: 实验表明HS-SLAM在跟踪和重建精度上优于基线方法，同时保持机器人应用所需的效率。 Conclusion: HS-SLAM通过改进场景表示和全局一致性，显著提升了NeRF-based SLAM的性能。 Abstract: NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capacity, we propose a hybrid encoding network that combines the complementary strengths of hash-grid, tri-planes, and one-blob, improving the completeness and smoothness of reconstruction. Additionally, we introduce structural supervision by sampling patches of non-local pixels rather than individual rays to better capture the scene structure. To ensure global consistency, we implement an active global bundle adjustment (BA) to eliminate camera drifts and mitigate accumulative errors. Experimental results demonstrate that HS-SLAM outperforms the baselines in tracking and reconstruction accuracy while maintaining the efficiency required for robotics.

X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction

Weihao Yu,Yuanhao Cai,Ruyi Zha,Zhiwen Fan,Chenxin Li,Yixuan Yuan

Task: 提出一种名为X²-Gaussian的新型框架，用于实现连续时间的4D-CT重建。

Motivation: 传统相位分箱方法在4D-CT重建中存在运动不对齐和临床实用性受限的问题。

Details

Method: 结合动态辐射高斯抛光和自监督呼吸运动学习，通过时空编码器-解码器架构预测时变高斯变形。 Result: 实验表明，X²-Gaussian在PSNR上比传统方法提高了9.93 dB，比之前的高斯抛光技术提高了2.25 dB。 Conclusion: X²-Gaussian通过连续运动建模和无硬件周期学习，推动了高保真4D-CT重建的发展。 Abstract: Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows. Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality. In this paper, We propose X$^2$-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization. To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization. Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. By unifying continuous motion modeling with hardware-free period learning, X$^2$-Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging. Project website at: https://x2-gaussian.github.io/.

Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation

Reza Qorbani,Gianluca Villani,Theodoros Panagiotakopoulos,Marc Botet Colomer,Linus Härenstam-Nielsen,Mattia Segu,Pier Luigi Dovesi,Jussi Karlgren,Daniel Cremers,Federico Tombari,Matteo Poggi

Task: 提出一种无需训练、测试时域适应的新框架SemLA，用于开放词汇语义分割。

Motivation: 解决开放词汇语义分割模型在训练和测试域差异大时性能下降的问题，避免微调需求。

Details

Method: 利用基于LoRA的适配器库和CLIP嵌入，动态合并最相关的适配器以构建特定输入的自定义模型。 Result: 在20个域的基准测试中表现优异，适应性和性能均优于现有方法。 Conclusion: SemLA为开放词汇语义分割的域适应设立了新标准，适用于敏感应用。 Abstract: Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world applications. We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation. SemLA leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on a 20-domain benchmark built over 10 standard datasets demonstrate SemLA's superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.

VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

Chi-Pin Huang,Yen-Siang Wu,Hung-Kai Chung,Kai-Po Chang,Fu-En Yang,Yu-Chiang Frank Wang

Task: 提出一个统一框架VideoMage，用于同时定制多个主题及其交互运动的视频生成。

Motivation: 现有方法主要专注于个性化单一概念（主题身份或运动模式），限制了其在多主题和交互运动中的有效性。

Details

Method: 使用主题和运动LoRAs从用户提供的图像和视频中捕捉个性化内容，并结合外观无关的运动学习方法分离运动模式与视觉外观，同时开发了时空组合方案指导主题间的交互。 Result: 实验表明，VideoMage优于现有方法，能生成具有一致主题身份和交互的连贯、用户可控视频。 Conclusion: VideoMage为多主题和交互运动的视频定制提供了有效解决方案。 Abstract: Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.

Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

Abdelrahman Shaker,Muhammad Maaz,Chenhui Gou,Hamid Rezatofighi,Salman Khan,Fahad Shahbaz Khan

Task: 提出Mobile-VideoGPT，一种高效的多模态框架，用于解决视频理解模型计算需求高、参数多和推理速度慢的问题。

Motivation: 传统视频大型多模态模型（LMMs）计算资源消耗大，难以实际应用，需要更高效的解决方案。

Details

Method: 采用轻量级双视觉编码器、高效投影器和小型语言模型（SLM），结合注意力帧评分机制和视觉令牌修剪技术。 Result: Mobile-VideoGPT-0.5B在多个视频理解基准测试中表现优于现有0.5B参数模型，平均提升6分，参数减少40%，吞吐量提高2倍以上。 Conclusion: Mobile-VideoGPT是一种高效且实用的视频理解框架，适用于实时应用。 Abstract: Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.

Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for Federated Continual Learning

Xiaoming Qi,Jingyang Zhang,Huazhu Fu,Guanyu Yang,Shuo Li,Yueming Jin

Task: 提出一种新颖的服务器端联邦持续学习（FCL）模式FedDAH，用于医学领域中动态和异步任务流下的协作学习。

Motivation: 解决现有服务器端FCL方法在医学场景中面临的灾难性遗忘和异步任务导致的优化偏差问题。

Details

Method: 提出动态分配超网络（DAHyper）和自适应模型重新校准（AMR）技术，分别用于缓解灾难性遗忘和优化偏差。 Result: 在AMOS数据集上的实验表明，FedDAH优于其他FCL方法。 Conclusion: FedDAH为医学领域的动态任务流提供了一种有效的解决方案，解决了灾难性遗忘和优化偏差问题。 Abstract: Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (FedDAH). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:https://github.com/jinlab-imvr/FedDAH.

Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead

Viktor Schlegel,Anil A Bharath,Zilong Zhao,Kevin Yee

Task: 探讨隐私保护合成数据的现状及其在特定领域中的应用与挑战。

Motivation: 解决高敏感领域中数据隔离问题，同时平衡隐私保护与数据实用性。

Details

Method: 综述生成模型和差分隐私的理论基础，并评估四种领先方法在五个真实数据集上的表现。 Result: 在严格隐私约束下（ε ≤ 4），性能显著下降，揭示了通用基准与特定领域数据之间的差距。 Conclusion: 需建立更稳健的评估框架、标准化特定领域基准，并改进技术以满足隐私敏感领域的独特需求。 Abstract: Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($\epsilon \leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.

CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis

Youngkyoon Jang,Eduardo Pérez-Pellitero

Task: 通过Covisibility Map-based Gaussian Splatting (CoMapGS)恢复稀疏新视角合成中未被充分表示的稀疏区域。

Motivation: 解决稀疏新视角合成中高不确定性和低不确定性区域的问题，提升重建质量。

Details

Method: 构建共视性地图，增强初始点云，并应用基于不确定性的加权监督和邻近分类器。 Result: CoMapGS在Mip-NeRF 360和LLFF等数据集上优于现有方法。 Conclusion: CoMapGS通过共视性地图和自适应监督，显著提升了稀疏区域的合成效果。 Abstract: We propose Covisibility Map-based Gaussian Splatting (CoMapGS), designed to recover underrepresented sparse regions in sparse novel view synthesis. CoMapGS addresses both high- and low-uncertainty regions by constructing covisibility maps, enhancing initial point clouds, and applying uncertainty-aware weighted supervision using a proximity classifier. Our contributions are threefold: (1) CoMapGS reframes novel view synthesis by leveraging covisibility maps as a core component to address region-specific uncertainty; (2) Enhanced initial point clouds for both low- and high-uncertainty regions compensate for sparse COLMAP-derived point clouds, improving reconstruction quality and benefiting few-shot 3DGS methods; (3) Adaptive supervision with covisibility-score-based weighting and proximity classification achieves consistent performance gains across scenes with varying sparsity scores derived from covisibility maps. Experimental results demonstrate that CoMapGS outperforms state-of-the-art methods on datasets including Mip-NeRF 360 and LLFF.

Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins

Yiqing Shen,Chenjia Li,Bohan Liu,Cheng-Yi Li,Tito Porras,Mathias Unberath

Task: 提出一种无需LLM微调的推理分割框架（ORDiRS）和数字孪生表示，用于分析手术室工作流程。

Motivation: 现有方法依赖端到端深度神经网络，灵活性不足，难以适应不同手术室场景的需求，且当前基于LLM的推理分割方法在语义/空间关系推理和泛化能力上存在局限。

Details

Method: 提出数字孪生表示以保留语义和空间关系，并设计ORDiRS框架，将推理分割转化为“推理-检索-合成”范式，同时引入ORDiRS-Agent分解查询并生成响应。 Result: 在内部和公共数据集上，ORDiRS的cIoU比现有方法提高了6.12%-9.74%。 Conclusion: ORDiRS框架有效解决了现有方法的局限性，提升了手术室工作流程分析的灵活性和准确性。 Abstract: Analyzing operating room (OR) workflows to derive quantitative insights into OR efficiency is important for hospitals to maximize patient care and financial sustainability. Prior work on OR-level workflow analysis has relied on end-to-end deep neural networks. While these approaches work well in constrained settings, they are limited to the conditions specified at development time and do not offer the flexibility necessary to accommodate the OR workflow analysis needs of various OR scenarios (e.g., large academic center vs. rural provider) without data collection, annotation, and retraining. Reasoning segmentation (RS) based on foundation models offers this flexibility by enabling automated analysis of OR workflows from OR video feeds given only an implicit text query related to the objects of interest. Due to the reliance on large language model (LLM) fine-tuning, current RS approaches struggle with reasoning about semantic/spatial relationships and show limited generalization to OR video due to variations in visual characteristics and domain-specific terminology. To address these limitations, we first propose a novel digital twin (DT) representation that preserves both semantic and spatial relationships between the various OR components. Then, building on this foundation, we propose ORDiRS (Operating Room Digital twin representation for Reasoning Segmentation), an LLM-tuning-free RS framework that reformulates RS into a "reason-retrieval-synthesize" paradigm. Finally, we present ORDiRS-Agent, an LLM-based agent that decomposes OR workflow analysis queries into manageable RS sub-queries and generates responses by combining detailed textual explanations with supporting visual evidence from RS. Experimental results on both an in-house and a public OR dataset demonstrate that our ORDiRS achieves a cIoU improvement of 6.12%-9.74% compared to the existing state-of-the-arts.

ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging

Haoming Xu,Shuxun Wang,Yanqiu Zhao,Yi Zhong,Ziyan Jiang,Ningyuan Zhao,Shumin Deng,Huajun Chen,Ningyu Zhang

Task: 选择性从大型语言模型中删除敏感知识，避免过度遗忘或遗忘不足的问题。

Motivation: 解决大型语言模型中敏感内容的遗忘问题，提出一种更平衡的遗忘方法。

Details

Method: 利用模型合并（特别是TIES-Merging）技术，将两个专用模型结合为一个更平衡的遗忘模型。 Result: 在SemEval-2025 Task 4中排名第二，Task Aggregate得分为0.944，整体Aggregate得分为0.487。 Conclusion: 当前评估方法（如MIA分数和ROUGE指标）不足以全面评估遗忘效果，未来需要更全面的评估方法和重新思考遗忘目标。 Abstract: This paper presents the ZJUKLAB team's submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models. This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues. We propose an unlearning system that leverages Model Merging (specifically TIES-Merging), combining two specialized models into a more balanced unlearned model. Our system achieves competitive results, ranking second among 26 teams, with an online score of 0.944 for Task Aggregate and 0.487 for overall Aggregate. In this paper, we also conduct local experiments and perform a comprehensive analysis of the unlearning process, examining performance trajectories, loss dynamics, and weight perspectives, along with several supplementary experiments, to understand the effectiveness of our method. Furthermore, we analyze the shortcomings of our method and evaluation metrics, emphasizing that MIA scores and ROUGE-based metrics alone are insufficient to fully evaluate successful unlearning. Finally, we emphasize the need for more comprehensive evaluation methodologies and rethinking of unlearning objectives in future research. Code is available at https://github.com/zjunlp/unlearn/tree/main/semeval25.

VideoMix: Aggregating How-To Videos for Task-Oriented Learning

Saelyne Yang,Anh Truong,Juho Kim,Dingzeyu Li

Task: 开发一个系统（VideoMix），帮助用户通过整合多个教程视频的信息来全面理解任务。

Motivation: 用户在学习新任务时通常需要观看多个教程视频，但视频分散且不易快速浏览，导致效率低下。

Details

Method: 利用视觉-语言模型管道提取和组织视频信息，提供文本摘要和相关视频片段。 Result: 用户研究表明，VideoMix比独立观看视频更高效且能提供更全面的任务理解。 Conclusion: VideoMix展示了基于任务的多视频整合方法在视频学习中的潜力，优于传统方式。 Abstract: Tutorial videos are a valuable resource for people looking to learn new tasks. People often learn these skills by viewing multiple tutorial videos to get an overall understanding of a task by looking at different approaches to achieve the task. However, navigating through multiple videos can be time-consuming and mentally demanding as these videos are scattered and not easy to skim. We propose VideoMix, a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task. Insights from our formative study (N=12) reveal that learners value understanding potential outcomes, required materials, alternative methods, and important details shared by different videos. Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips, enabling users to quickly digest and navigate the content. A comparative user study (N=12) demonstrated that VideoMix enabled participants to gain a more comprehensive understanding of tasks with greater efficiency than a baseline video interface, where videos are viewed independently. Our findings highlight the potential of a task-oriented, multi-video approach where videos are organized around a shared goal, offering an enhanced alternative to conventional video-based learning.

UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning

Hongxuan Tang,Hao Liu,Xinyan Xiao

Task: 提出一种统一的自动回归多模态模型UGen，用于同时处理文本处理、图像理解和图像生成任务。

Motivation: 解决统一多模态学习中的挑战，提升模型在多种任务上的性能。

Details

Method: 将文本和图像转换为离散令牌序列，使用单一Transformer以自动回归方式生成，并采用渐进式词汇学习机制训练。 Result: 在综合文本和图像任务上，UGen比传统统一自动回归方法性能提升13.3%，且在所有任务上表现优于多个任务专用模型。 Conclusion: UGen通过渐进式词汇学习机制有效提升了统一多模态学习的性能，成为多任务处理的强大工具。 Abstract: We introduce UGen, a unified autoregressive multimodal model that demonstrates strong performance across text processing, image understanding, and image generation tasks simultaneously. UGen converts both texts and images into discrete token sequences and utilizes a single transformer to generate them uniformly in an autoregressive manner. To address the challenges associated with unified multimodal learning, UGen is trained using a novel mechanism, namely progressive vocabulary learning. In this process, visual token IDs are incrementally activated and integrated into the training phase, ultimately enhancing the effectiveness of unified multimodal learning. Experiments on comprehensive text and image tasks show that UGen achieves a significant overall performance improvement of 13.3% compared to the vanilla unified autoregressive method, and it also delivers competitive results across all tasks against several task-specific models.

WVSC: Wireless Video Semantic Communication with Multi-frame Compensation

Bingyan Xie,Yongpeng Wu,Yuxuan Shi,Biqian Feng,Wenjun Zhang,Jihong Park,Tony Q. S. Quek

Task: 提出一种无线视频语义通信框架（WVSC），将语义通信思想融入无线视频传输场景。

Motivation: 现有无线视频传输方案直接在像素级别进行视频编码，忽略了视频中的内部语义信息。

Details

Method: WVSC将原始视频帧编码为语义帧，基于这种紧凑表示进行视频编码，并在接收端引入多帧补偿（MFC）和参考语义帧以减少通信开销。 Result: 实验结果表明，WVSC在PSNR上优于其他基于深度学习的方法（如DVSC）约1 dB，优于传统方案约2 dB。 Conclusion: WVSC通过语义级视频编码和多帧补偿，显著提高了带宽效率并保持了良好的视频传输性能。 Abstract: Existing wireless video transmission schemes directly conduct video coding in pixel level, while neglecting the inner semantics contained in videos. In this paper, we propose a wireless video semantic communication framework, abbreviated as WVSC, which integrates the idea of semantic communication into wireless video transmission scenarios. WVSC first encodes original video frames as semantic frames and then conducts video coding based on such compact representations, enabling the video coding in semantic level rather than pixel level. Moreover, to further reduce the communication overhead, a reference semantic frame is introduced to substitute motion vectors of each frame in common video coding methods. At the receiver, multi-frame compensation (MFC) is proposed to produce compensated current semantic frame with a multi-frame fusion attention module. With both the reference frame transmission and MFC, the bandwidth efficiency improves with satisfying video transmission performance. Experimental results verify the performance gain of WVSC over other DL-based methods e.g. DVSC about 1 dB and traditional schemes about 2 dB in terms of PSNR.

PLAIN: Scalable Estimation Architecture for Integrated Sensing and Communication

Bashar Tahir,Philipp Svoboda,Markus Rupp

Task: 提出一种基于张量的估计架构PLAIN，用于解决集成感知与通信（ISAC）中的高维参数估计问题。

Motivation: 集成感知与通信（ISAC）需要高维参数估计，但传统方法计算复杂度高且时间窗口有限，难以满足需求。

Details

Method: PLAIN架构分为三个阶段：压缩阶段、解耦估计阶段和基于输入的融合阶段，利用张量代数、子空间处理和压缩感知工具。 Result: PLAIN能够灵活扩展维度，保持超分辨率，同时计算复杂度低，优于传统顺序和联合估计方法。 Conclusion: PLAIN为ISAC中的高维参数估计提供了一种高效且灵活的解决方案。 Abstract: Integrated sensing and communication (ISAC) is envisioned be to one of the paradigms upon which next-generation mobile networks will be built, extending localization and tracking capabilities, as well as giving birth to environment-aware wireless access. A key aspect of sensing integration is parameter estimation, which involves extracting information about the surrounding environment, such as the direction, distance, and velocity of various objects within. This is typically of a high-dimensional nature, which leads to significant computational complexity, if performed jointly across multiple sensing dimensions, such as space, frequency, and time. Additionally, due to the incorporation of sensing on top of the data transmission, the time window available for sensing is likely to be short, resulting in an estimation problem where only a single snapshot is accessible. In this work, we propose PLAIN, a tensor-based estimation architecture that flexibly scales with multiple sensing dimensions and can handle high dimensionality, limited measurement time, and super-resolution requirements. It consists of three stages: a compression stage, where the high dimensional input is converted into lower dimensionality, without sacrificing resolution; a decoupled estimation stage, where the parameters across the different dimensions are estimated in parallel with low complexity; an input-based fusion stage, where the decoupled parameters are fused together to form a paired multidimensional estimate. We investigate the performance of the architecture for different configurations and compare it against practical sequential and joint estimation baselines, as well as theoretical bounds. Our results show that PLAIN, using tools from tensor algebra, subspace-based processing, and compressed sensing, can scale flexibly with dimensionality, while operating with low complexity and maintaining super-resolution.

ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks

Erik Wallin,Fredrik Kahl,Lars Hammarstrand

Task: 提出一种框架，用于在给定的类别层次结构中检测和分类分布外（OOD）样本。

Motivation: 传统OOD检测仅将其视为二分类任务，忽略了OOD样本与分布内（ID）类别之间的语义关系。

Details

Method: 利用类别层次结构构建概率模型，并通过在多个层次深度上训练的ID分类网络实现该模型。 Result: 在三个具有预定义类别层次结构的数据集上验证了方法的有效性。 Conclusion: 提出的框架能够有效检测和分类OOD样本，并利用类别层次结构提升性能。 Abstract: Out-of-distribution (OOD) detection in deep learning has traditionally been framed as a binary task, where samples are either classified as belonging to the known classes or marked as OOD, with little attention given to the semantic relationships between OOD samples and the in-distribution (ID) classes. We propose a framework for detecting and classifying OOD samples in a given class hierarchy. Specifically, we aim to predict OOD data to their correct internal nodes of the class hierarchy, whereas the known ID classes should be predicted as their corresponding leaf nodes. Our approach leverages the class hierarchy to create a probabilistic model and we implement this model by using networks trained for ID classification at multiple hierarchy depths. We conduct experiments on three datasets with predefined class hierarchies and show the effectiveness of our method. Our code is available at https://github.com/walline/prohoc.

STAMICS: Splat, Track And Map with Integrated Consistency and Semantics for Dense RGB-D SLAM

Yongxu Wang,Xu Cao,Weiyun Yi,Zhaoxin Fan

Task: 提出一种名为STAMICS的新方法，将语义信息与3D高斯表示结合，以提高SLAM的定位和建图精度。

Motivation: 当前SLAM方法主要依赖几何线索，但在动态或密集场景中难以保证语义一致性。

Details

Method: STAMICS包含三个关键组件：基于3D高斯的场景表示、图聚类技术确保时序语义一致性，以及开放词汇系统用于未见过物体的分类。 Result: 实验表明，STAMICS显著提升了相机位姿估计和地图质量，优于现有方法并减少重建误差。 Conclusion: STAMICS通过整合语义信息，有效提升了SLAM在复杂环境中的性能。 Abstract: Simultaneous Localization and Mapping (SLAM) is a critical task in robotics, enabling systems to autonomously navigate and understand complex environments. Current SLAM approaches predominantly rely on geometric cues for mapping and localization, but they often fail to ensure semantic consistency, particularly in dynamic or densely populated scenes. To address this limitation, we introduce STAMICS, a novel method that integrates semantic information with 3D Gaussian representations to enhance both localization and mapping accuracy. STAMICS consists of three key components: a 3D Gaussian-based scene representation for high-fidelity reconstruction, a graph-based clustering technique that enforces temporal semantic consistency, and an open-vocabulary system that allows for the classification of unseen objects. Extensive experiments show that STAMICS significantly improves camera pose estimation and map quality, outperforming state-of-the-art methods while reducing reconstruction errors. Code will be public available.

RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting

Qiyu Dai,Xingyu Ni,Qianfan Shen,Wenzheng Chen,Baoquan Chen,Mengyu Chu

Task: 提出一种名为RainyGS的新方法，用于在开放世界场景中以物理正确的方式动态添加雨效果。

Motivation: 现有方法（如NeRF和3DGS）在复杂场景编辑（如基于物理的雨模拟）中存在不足，而传统物理模拟方法依赖人工设置，缺乏灵活性和可扩展性。

Details

Method: 结合基于物理的雨滴和浅水模拟技术与快速3DGS渲染框架，实现高效且真实的雨效果模拟。 Result: RainyGS能以超过30 fps的速度生成雨效果，支持从细雨到暴雨的灵活控制，并在真实世界场景和大规模驾驶场景中表现优异。 Conclusion: RainyGS在生成逼真且物理准确的雨效果方面优于现有方法。 Abstract: We consider the problem of adding dynamic rain effects to in-the-wild scenes in a physically-correct manner. Recent advances in scene modeling have made significant progress, with NeRF and 3DGS techniques emerging as powerful tools for reconstructing complex scenes. However, while effective for novel view synthesis, these methods typically struggle with challenging scene editing tasks, such as physics-based rain simulation. In contrast, traditional physics-based simulations can generate realistic rain effects, such as raindrops and splashes, but they often rely on skilled artists to carefully set up high-fidelity scenes. This process lacks flexibility and scalability, limiting its applicability to broader, open-world environments. In this work, we introduce RainyGS, a novel approach that leverages the strengths of both physics-based modeling and 3DGS to generate photorealistic, dynamic rain effects in open-world scenes with physical accuracy. At the core of our method is the integration of physically-based raindrop and shallow water simulation techniques within the fast 3DGS rendering framework, enabling realistic and efficient simulations of raindrop behavior, splashes, and reflections. Our method supports synthesizing rain effects at over 30 fps, offering users flexible control over rain intensity -- from light drizzles to heavy downpours. We demonstrate that RainyGS performs effectively for both real-world outdoor scenes and large-scale driving scenarios, delivering more photorealistic and physically-accurate rain effects compared to state-of-the-art methods. Project page can be found at https://pku-vcl-geometry.github.io/RainyGS/

Sparse Bayesian Learning for Label Efficiency in Cardiac Real-Time MRI

Felix Terhag,Philipp Knechtges,Achim Basermann,Anja Bach,Darius Gerlach,Jens Tank,Raúl Tempone

Task: 利用稀疏贝叶斯学习（SBL）预测心脏MRI外切片的室容积，减少手动标注需求。

Motivation: 实时心脏MRI生成大量图像，但神经网络在外切片预测不可靠，需解决标注效率问题。

Details

Method: 通过SBL识别稀疏频率（心率和呼吸频率），优化超参数，自动筛选外切片图像标注。 Result: 实验表明仅需少量标注即可准确预测室容积，且贝叶斯方法提供不确定性估计。 Conclusion: SBL方法高效减少标注需求，提供可靠预测和不确定性评估。 Abstract: Cardiac real-time magnetic resonance imaging (MRI) is an emerging technology that images the heart at up to 50 frames per second, offering insight into the respiratory effects on the heartbeat. However, this method significantly increases the number of images that must be segmented to derive critical health indicators. Although neural networks perform well on inner slices, predictions on outer slices are often unreliable. This work proposes sparse Bayesian learning (SBL) to predict the ventricular volume on outer slices with minimal manual labeling to address this challenge. The ventricular volume over time is assumed to be dominated by sparse frequencies corresponding to the heart and respiratory rates. Moreover, SBL identifies these sparse frequencies on well-segmented inner slices by optimizing hyperparameters via type -II likelihood, automatically pruning irrelevant components. The identified sparse frequencies guide the selection of outer slice images for labeling, minimizing posterior variance. This work provides performance guarantees for the greedy algorithm. Testing on patient data demonstrates that only a few labeled images are necessary for accurate volume prediction. The labeling procedure effectively avoids selecting inefficient images. Furthermore, the Bayesian approach provides uncertainty estimates, highlighting unreliable predictions (e.g., when choosing suboptimal labels).

Embedding Compression Distortion in Video Coding for Machines

Yuxiao Sun,Yao Zhao,Meiqin Liu,Chao Yao,Weisi Lin

Task: 提出一种压缩失真表示嵌入（CDRE）框架，以优化机器视觉任务的视频传输性能。

Motivation: 现有编解码器主要针对像素域和人类视觉系统优化，而忽略了机器视觉任务的需求。

Details

Method: 设计压缩敏感提取器分析特征域中的压缩退化，并引入轻量级失真编解码器压缩失真信息，最后将表示嵌入下游模型。 Result: 实验表明，CDRE框架能以最小的比特率、执行时间和参数开销显著提升现有编解码器的率任务性能。 Conclusion: CDRE框架有效解决了压缩过程中的信息丢失问题，提升了机器视觉任务的性能。 Abstract: Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis. However, existing codecs are primarily optimized for pixel-domain and HVS-perception metrics rather than the needs of machine vision tasks. To address this issue, we propose a Compression Distortion Representation Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models, addressing the information lost during compression and improving task performance. Specifically, to better analyze the machine-perception-related distortion, we design a compression-sensitive extractor that identifies compression degradation in the feature domain. For efficient transmission, a lightweight distortion codec is introduced to compress the distortion information into a compact representation. Subsequently, the representation is progressively embedded into the downstream model, enabling it to be better informed about compression degradation and enhancing performance. Experiments across various codecs and downstream tasks demonstrate that our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of bitrate, execution time, and number of parameters. Our codes and supplementary materials are released in https://github.com/Ws-Syx/CDRE/.

Brett Levac,Ajil Jalal,Kannan Ramchandran,Jonathan I. Tamir

Task: 提出一种基于AmbientGAN的生成技术，用于从未配对的干净图像和损坏测量中识别未知成像系统的参数分布。

Motivation: 解决盲逆问题中成像系统参数不确定性的挑战，无需成对的干净图像和系统样本。

Details

Method: 利用AmbientGAN生成技术学习成像系统参数的分布，并将其应用于基于模型的恢复算法。 Result: 成功演示了从噪声测量中学习高斯模糊和运动模糊先验，并在扩散后验采样中解决盲去卷积问题。 Conclusion: 该方法为盲逆问题提供了一种有效的解决方案，无需成对数据即可学习系统参数分布。 Abstract: Blind inverse problems in imaging arise from uncertainties in the system used to collect (noisy) measurements of images. Recovering clean images from these measurements typically requires identifying the imaging system, either implicitly or explicitly. A common solution leverages generative models as priors for both the images and the imaging system parameters (e.g., a class of point spread functions). To learn these priors in a straightforward manner requires access to a dataset of clean images as well as samples of the imaging system. We propose an AmbientGAN-based generative technique to identify the distribution of parameters in unknown imaging systems, using only unpaired clean images and corrupted measurements. This learned distribution can then be used in model-based recovery algorithms to solve blind inverse problems such as blind deconvolution. We successfully demonstrate our technique for learning Gaussian blur and motion blur priors from noisy measurements and show their utility in solving blind deconvolution with diffusion posterior sampling.

Keyword-Oriented Multimodal Modeling for Euphemism Identification

Yuxue Hu,Junsong Li,Meixuan Chen,Dongyu Su,Tongguan Wang,Ying Sha

Task: 开发一种多模态委婉语识别方法（KOM-EI）并构建多模态委婉语数据集（KOM-Euph）。

Motivation: 现有方法主要基于文本分析，但社交媒体中多模态内容的兴起需要结合文本、图像和音频的多模态分析，而缺乏相关数据集限制了研究进展。

Details

Method: 提出KOM-EI方法，通过跨模态特征对齐和动态融合模块，利用关键词的视觉和音频特征进行委婉语识别；构建KOM-Euph数据集，涵盖毒品、武器和性相关三个领域。 Result: 实验表明KOM-EI优于现有先进模型和大语言模型，并验证了多模态数据集的重要性。 Conclusion: KOM-EI和多模态数据集KOM-Euph为委婉语识别提供了有效工具，推动了多模态分析在内容审核和地下市场打击中的应用。 Abstract: Euphemism identification deciphers the true meaning of euphemisms, such as linking "weed" (euphemism) to "marijuana" (target keyword) in illicit texts, aiding content moderation and combating underground markets. While existing methods are primarily text-based, the rise of social media highlights the need for multimodal analysis, incorporating text, images, and audio. However, the lack of multimodal datasets for euphemisms limits further research. To address this, we regard euphemisms and their corresponding target keywords as keywords and first introduce a keyword-oriented multimodal corpus of euphemisms (KOM-Euph), involving three datasets (Drug, Weapon, and Sexuality), including text, images, and speech. We further propose a keyword-oriented multimodal euphemism identification method (KOM-EI), which uses cross-modal feature alignment and dynamic fusion modules to explicitly utilize the visual and audio features of the keywords for efficient euphemism identification. Extensive experiments demonstrate that KOM-EI outperforms state-of-the-art models and large language models, and show the importance of our multimodal datasets.

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

Yue Li,Meng Tian,Zhenyu Lin,Jiangtong Zhu,Dechang Zhu,Haiqiang Liu,Zining Wang,Yueyi Zhang,Zhiwei Xiong,Xinhai Zhao

Task: 提出一个细粒度的数据集VLADBench，用于评估视觉语言模型（VLM）在自动驾驶（AD）复杂场景中的能力。

Motivation: 现有基准测试主要通过粗粒度任务中的开放式视觉问答（QA）评估VLM的可解释性，不足以评估复杂驾驶场景中的能力。

Details

Method: 构建VLADBench数据集，包含5个关键领域（如交通知识理解、目标属性理解等），细分为11个次级方面和29个三级任务，并评估通用和领域特定VLM的性能。 Result: 实验结果表明，VLADBench为全面评估VLM在AD中的能力提供了关键步骤，揭示了模型的优势和局限性。 Conclusion: VLADBench为开发更具认知和推理能力的AD系统奠定了基础。 Abstract: Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce $\textbf{VLADBench}$, a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate $\textbf{VLADBench}$ spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.

Uncertainty-aware Bayesian machine learning modelling of land cover classification

Samuel Bilson,Anna Pustogvar

Task: 提出一种基于贝叶斯分类框架的土地覆盖分类方法，考虑输入测量不确定性。

Motivation: 现有机器学习分类模型未考虑输入测量不确定性，而这对计量学的可追溯性至关重要。

Details

Method: 采用生成建模的贝叶斯分类框架，具体应用贝叶斯二次判别分析，并基于Copernicus Sentinel-2数据验证。 Result: 贝叶斯模型更具可信度，更易解释，能显式建模输入不确定性，且在不同年份和规模的数据集上保持预测性能，计算效率高。 Conclusion: 贝叶斯分类框架在土地覆盖分类中具有优势，尤其在考虑测量不确定性和模型可信度方面。 Abstract: Land cover classification involves the production of land cover maps, which determine the type of land through remote sensing imagery. Over recent years, such classification is being performed by machine learning classification models, which can give highly accurate predictions on land cover per pixel using large quantities of input training data. However, such models do not currently take account of input measurement uncertainty, which is vital for traceability in metrology. In this work we propose a Bayesian classification framework using generative modelling to take account of input measurement uncertainty. We take the specific case of Bayesian quadratic discriminant analysis, and apply it to land cover datasets from Copernicus Sentinel-2 in 2020 and 2021. We benchmark the performance of the model against more popular classification models used in land cover maps such as random forests and neural networks. We find that such Bayesian models are more trustworthy, in the sense that they are more interpretable, explicitly model the input measurement uncertainty, and maintain predictive performance of class probability outputs across datasets of different years and sizes, whilst also being computationally efficient.

SyncSDE: A Probabilistic Framework for Diffusion Synchronization

Hyunjun Lee,Hyunsoo Lee,Sookwan Han

Task: 研究如何通过概率框架分析扩散同步的工作原理，并优化多任务中的相关性建模。

Motivation: 现有方法依赖简单的启发式方法（如平均），缺乏任务特异性分析，导致在不同任务中表现不佳。

Details

Method: 提出一个概率框架，分析扩散同步的工作原理，并针对不同任务优化相关性建模。 Result: 识别出针对不同任务的最优相关性模型，性能优于之前的方法。 Conclusion: 通过任务特定的相关性建模，扩散同步的效果得到显著提升。 Abstract: There have been many attempts to leverage multiple diffusion models for collaborative generation, extending beyond the original domain. A prominent approach involves synchronizing multiple diffusion trajectories by mixing the estimated scores to artificially correlate the generation processes. However, existing methods rely on naive heuristics, such as averaging, without considering task specificity. These approaches do not clarify why such methods work and often fail when a heuristic suitable for one task is blindly applied to others. In this paper, we present a probabilistic framework for analyzing why diffusion synchronization works and reveal where heuristics should be focused - modeling correlations between multiple trajectories and adapting them to each specific task. We further identify optimal correlation models per task, achieving better results than previous approaches that apply a single heuristic across all tasks without justification.

When Astronomy Meets AI: Manazel For Crescent Visibility Prediction in Morocco

Yassir Lairgi

Task: 通过整合Arc of Vision (ARCV)和月牙总宽度(W)特征，利用机器学习方法改进ODEH标准，以更准确地确定希吉来历月份的起始。

Motivation: 希吉来历月份的准确确定对宗教、文化和行政事务至关重要，而现有标准ODEH在预测月牙可见性方面存在不足。

Details

Method: 采用Logistic回归算法，结合13年的月牙可见性数据，对月牙可见性条件进行分类。 Result: 预测准确率达到98.83%，为希吉来历月份的起始提供了可靠的数据驱动框架。 Conclusion: 机器学习在天文应用中表现出色，未来可进一步优化月牙可见性建模。 Abstract: The accurate determination of the beginning of each Hijri month is essential for religious, cultural, and administrative purposes. Manazel (The code and datasets are available at https://github.com/lairgiyassir/manazel) addresses this challenge in Morocco by leveraging 13 years of crescent visibility data to refine the ODEH criterion, a widely used standard for lunar crescent visibility prediction. The study integrates two key features, the Arc of Vision (ARCV) and the total width of the crescent (W), to enhance the accuracy of lunar visibility assessments. A machine learning approach utilizing the Logistic Regression algorithm is employed to classify crescent visibility conditions, achieving a predictive accuracy of 98.83%. This data-driven methodology offers a robust and reliable framework for determining the start of the Hijri month, comparing different data classification tools, and improving the consistency of lunar calendar calculations in Morocco. The findings demonstrate the effectiveness of machine learning in astronomical applications and highlight the potential for further enhancements in the modeling of crescent visibility.

Cognitive Science-Inspired Evaluation of Core Capabilities for Object Understanding in AI

Danaja Rutar,Alva Markelius,Konstantinos Voudouris,José Hernández-Orallo,Lucy Cheke

Task: 综述和评估关于物体感知的理论框架及AI范式的研究。

Motivation: 物体感知是世界模型的核心组成部分，对生物体和AI系统都至关重要，但目前缺乏统一的理论框架和有效的评估方法。

Details

Method: 首先综述Gestalt心理学、能动认知和发展心理学等理论框架，然后评估当前AI范式在物体感知能力上的表现。 Result: 发现现有AI基准测试只能检测物体感知的孤立方面，无法评估功能整合能力。 Conclusion: 提出新的评估方法，以推动AI从孤立能力向真实世界中的综合物体感知发展。 Abstract: One of the core components of our world models is 'intuitive physics' - an understanding of objects, space, and causality. This capability enables us to predict events, plan action and navigate environments, all of which rely on a composite sense of objecthood. Despite its importance, there is no single, unified account of objecthood, though multiple theoretical frameworks provide insights. In the first part of this paper, we present a comprehensive overview of the main theoretical frameworks in objecthood research - Gestalt psychology, enactive cognition, and developmental psychology - and identify the core capabilities each framework attributes to object understanding, as well as what functional roles they play in shaping world models in biological agents. Given the foundational role of objecthood in world modelling, understanding objecthood is also essential in AI. In the second part of the paper, we evaluate how current AI paradigms approach and test objecthood capabilities compared to those in cognitive science. We define an AI paradigm as a combination of how objecthood is conceptualised, the methods used for studying objecthood, the data utilised, and the evaluation techniques. We find that, whilst benchmarks can detect that AI systems model isolated aspects of objecthood, the benchmarks cannot detect when AI systems lack functional integration across these capabilities, not solving the objecthood challenge fully. Finally, we explore novel evaluation approaches that align with the integrated vision of objecthood outlined in this paper. These methods are promising candidates for advancing from isolated object capabilities toward general-purpose AI with genuine object understanding in real-world contexts.

Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

Zhiyuan Ma,Xinyue Liang,Rongyuan Wu,Xiangyu Zhu,Zhen Lei,Lei Zhang

Task: 提出一种名为渐进渲染蒸馏（PRD）的训练方案，用于从文本提示快速生成高质量3D网格。

Motivation: 解决现有方法因缺乏高质量3D训练数据而导致的生成质量差的问题。

Details

Method: 通过多视角扩散模型（如MVDream和RichDreamer）与Stable Diffusion联合蒸馏，逐步去噪并生成3D输出，无需3D真实数据。 Result: 训练出的TriplaneTurbo模型在1.2秒内生成高质量3D网格，并在效率和生成质量上优于现有方法。 Conclusion: PRD方案有效解决了数据短缺问题，显著提升了生成速度和质量。 Abstract: It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only $2.5\%$ trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at https://github.com/theEricMa/TriplaneTurbo.

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Wenqi Zhang,Mengna Wang,Gangao Liu,Xu Huixin,Yiwei Jiang,Yongliang Shen,Guiyang Hou,Zhe Zheng,Hang Zhang,Xin Li,Weiming Lu,Peng Li,Yueting Zhuang

Task: 将深度思维模型扩展到需要连续交互的具身搜索任务中。

Motivation: 现有深度思维模型在数学和编码任务上表现出色，但在需要与环境持续交互的具身领域表现不足。

Details

Method: 提出Embodied Reasoner模型，通过合成9.3k条连贯的观察-思考-行动轨迹，采用三阶段训练流程（模仿学习、自我探索和自我纠正）。 Result: 模型在评估中显著优于其他先进视觉推理模型，如OpenAI o1、o3-mini和Claude-3.7，分别提升9%、24%和13%。 Conclusion: Embodied Reasoner在复杂长时任务中表现优越，减少了重复搜索和逻辑不一致性，适用于真实环境。 Abstract: Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.

MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX

Liuyue Xie,George Z. Wei,Avik Kuthiala,Ce Zheng,Ananya Bal,Mosam Dabhi,Liting Wen,Taru Rustagi,Ethan Lai,Sushil Khyalia,Rohan Choudhury,Morteza Ziyadi,Xu Zhang,Hao Yang,László A. Jeni

Task: 评估多模态模型在视频和音频信息整合任务中的性能。

Motivation: 当前缺乏标准化的评估框架来全面评估多模态模型的跨模态感知能力。

Details

Method: 提出MAVERIX基准，包含700个视频和2,556个问题，专门设计用于评估视频和音频信息的紧密整合任务。 Result: 实验显示，先进模型（如Gemini 1.5 Pro和o1）性能接近人类水平（约70%准确率），而人类专家达到接近天花板的表现（95.1%）。 Conclusion: MAVERIX通过标准化评估协议和公开工具包，为推进视听多模态智能提供了一个具有挑战性的测试平台。 Abstract: Frontier models have either been language-only or have primarily focused on vision and language modalities. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework for thoroughly assessing their cross-modality perception performance. We introduce MAVERIX~(Multimodal Audio-Visual Evaluation Reasoning IndeX), a novel benchmark with 700 videos and 2,556 questions explicitly designed to evaluate multimodal models through tasks that necessitate close integration of video and audio information. MAVERIX uniquely provides models with audiovisual tasks, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration. Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, show performance approaching human levels (around 70% accuracy), while human experts reach near-ceiling performance (95.1%). With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.