2025 03 22

Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings

Austin Xu,Srijan Bansal,Yifei Ming,Semih Yavuz,Shafiq Joty

Task: 提出ContextualJudgeBench，一个包含2000个挑战性响应对的评估基准，用于评估模型在上下文场景中的表现。

Motivation: 现有的评估模型通常在非上下文场景中进行评估，忽略了上下文信息的重要性，而上下文评估在实际应用中越来越普遍。

Details

Method: 通过多管齐下的数据构建管道，利用现有的人类注释和基于模型的扰动，构建了ContextualJudgeBench基准。 Result: 在11个评估模型和9个通用模型上的综合研究表明，上下文信息及其评估标准对即使是目前最先进的模型也构成了重大挑战。 Conclusion: 上下文评估是一个具有挑战性的任务，现有的模型在上下文场景中的表现仍有待提高。 Abstract: The large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs during AI system development and post-deployment monitoring. While judge models -- LLMs finetuned to specialize in assessing and critiquing model outputs -- have been touted as general purpose evaluators, they are typically evaluated only on non-contextual scenarios, such as instruction following. The omission of contextual settings -- those where external information is used as context to generate an output -- is surprising given the increasing prevalence of retrieval-augmented generation (RAG) and summarization use cases. Contextual assessment is uniquely challenging, as evaluation often depends on practitioner priorities, leading to conditional evaluation criteria (e.g., comparing responses based on factuality and then considering completeness if they are equally factual). To address the gap, we propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios. We build our benchmark with a multi-pronged data construction pipeline that leverages both existing human annotations and model-based perturbations. Our comprehensive study across 11 judge models and 9 general purpose models, reveals that the contextual information and its assessment criteria present a significant challenge to even state-of-the-art models. For example, OpenAI's o1, the best-performing model, barely reaches 55% consistent accuracy.

Enhancing Pancreatic Cancer Staging with Large Language Models: The Role of Retrieval-Augmented Generation

Hisashi Johno,Yuki Johno,Akitomo Amakawa,Junichi Sato,Ryota Tozuka,Atsushi Komaba,Hiroaki Watanabe,Hiroki Watanabe,Chihiro Goto,Hiroyuki Morisaka,Hiroshi Onishi,Kazunori Nakamoto

Task: 比较NotebookLM（带有RAG的LLM）与其内部LLM Gemini 2.0 Flash在胰腺癌分期实验中的表现，以评估RAG对LLM的影响。

Motivation: 为了更好地区分RAG的影响并评估其在不同癌症中的实用性，研究人员进行了胰腺癌分期实验。

Details

Method: 使用日本胰腺癌分期指南作为可靠外部知识（REK），比较了三组：REK+/RAG+（NotebookLM带有REK）、REK+/RAG-（Gemini 2.0 Flash带有REK）和REK-/RAG-（Gemini 2.0 Flash不带有REK），在100个虚构的胰腺癌病例中进行分期。 Result: REK+/RAG+的分期准确率为70%，优于REK+/RAG-（38%）和REK-/RAG-（35%）。在TNM分类中，REK+/RAG+的准确率为80%，超过REK+/RAG-（55%）和REK-/RAG-（50%）。此外，REK+/RAG+的检索准确率为92%。 Conclusion: NotebookLM在胰腺癌分期实验中优于其内部LLM Gemini 2.0 Flash，表明RAG可能提高LLM的分期准确性。此外，其检索和呈现REK摘录的能力为医生提供了透明度，突出了其在临床诊断和分类中的适用性。 Abstract: Purpose: Retrieval-augmented generation (RAG) is a technology to enhance the functionality and reliability of large language models (LLMs) by retrieving relevant information from reliable external knowledge (REK). RAG has gained interest in radiology, and we previously reported the utility of NotebookLM, an LLM with RAG (RAG-LLM), for lung cancer staging. However, since the comparator LLM differed from NotebookLM's internal model, it remained unclear whether its advantage stemmed from RAG or inherent model differences. To better isolate RAG's impact and assess its utility across different cancers, we compared NotebookLM with its internal LLM, Gemini 2.0 Flash, in a pancreatic cancer staging experiment. Materials and Methods: A summary of Japan's pancreatic cancer staging guidelines was used as REK. We compared three groups - REK+/RAG+ (NotebookLM with REK), REK+/RAG- (Gemini 2.0 Flash with REK), and REK-/RAG- (Gemini 2.0 Flash without REK) - in staging 100 fictional pancreatic cancer cases based on CT findings. Staging criteria included TNM classification, local invasion factors, and resectability classification. In REK+/RAG+, retrieval accuracy was quantified based on the sufficiency of retrieved REK excerpts. Results: REK+/RAG+ achieved a staging accuracy of 70%, outperforming REK+/RAG- (38%) and REK-/RAG- (35%). For TNM classification, REK+/RAG+ attained 80% accuracy, exceeding REK+/RAG- (55%) and REK-/RAG- (50%). Additionally, REK+/RAG+ explicitly presented retrieved REK excerpts, achieving a retrieval accuracy of 92%. Conclusion: NotebookLM, a RAG-LLM, outperformed its internal LLM, Gemini 2.0 Flash, in a pancreatic cancer staging experiment, suggesting that RAG may improve LLM's staging accuracy. Furthermore, its ability to retrieve and present REK excerpts provides transparency for physicians, highlighting its applicability for clinical diagnosis and classification.

Am I eligible? Natural Language Inference for Clinical Trial Patient Recruitment: the Patient's Point of View

Mathilde Aguiar,Pierre Zweigenbaum,Nona Naderi

Task: 研究患者使用自己的语言描述医疗档案来确定是否符合临床试验资格的情况。

Motivation: 通过在线招募直接向患者推广临床试验可能更高效地接触到他们。

Details

Method: 设计了一个新的数据集和任务，即患者招募的自然语言推理（NLI4PR），通过改编TREC 2022临床试验数据集并手动重新表述患者的医疗档案。 Result: 使用患者语言时，最佳模型的性能仅略有下降，F1得分在56.5到71.8之间，而使用医疗语言时F1得分在64.7到73.1之间。 Conclusion: 以患者为起点的招募方式可以帮助招募临床试验患者，且性能损失较小。 Abstract: Recruiting patients to participate in clinical trials can be challenging and time-consuming. Usually, participation in a clinical trial is initiated by a healthcare professional and proposed to the patient. Promoting clinical trials directly to patients via online recruitment might help to reach them more efficiently. In this study, we address the case where a patient is initiating their own recruitment process and wants to determine whether they are eligible for a given clinical trial, using their own language to describe their medical profile. To study whether this creates difficulties in the patient trial matching process, we design a new dataset and task, Natural Language Inference for Patient Recruitment (NLI4PR), in which patient language profiles must be matched to clinical trials. We create it by adapting the TREC 2022 Clinical Trial Track dataset, which provides patients' medical profiles, and rephrasing them manually using patient language. We also use the associated clinical trial reports where the patients are either eligible or excluded. We prompt several open-source Large Language Models on our task and achieve from 56.5 to 71.8 of F1 score using patient language, against 64.7 to 73.1 for the same task using medical language. When using patient language, we observe only a small loss in performance for the best model, suggesting that having the patient as a starting point could be adopted to help recruit patients for clinical trials. The corpus and code bases are all freely available on our Github and HuggingFace repositories.

KoGNER: A Novel Framework for Knowledge Graph Distillation on Biomedical Named Entity Recognition

Heming Zhang,Wenyu Li,Di Huang,Yinjie Tang,Yixin Chen,Philip Payne,Fuhai Li

Task: 提出了一种名为KoGNER的新方法，通过将知识图谱蒸馏集成到NER模型中，以提高实体识别的性能。

Motivation: 传统的深度学习NER模型在领域特定泛化和数据稀疏问题上表现不佳，因此需要一种新的方法来增强实体识别性能。

Details

Method: KoGNER采用两步过程：1) 知识蒸馏，将外部知识源蒸馏成轻量级表示以便与NER模型无缝集成；2) 实体感知增强，将知识图谱信息丰富的上下文嵌入直接集成到GNN中，从而提高模型理解和表示实体关系的能力。 Result: 在基准数据集上的实验结果表明，KoGNER实现了最先进的性能，显著优于微调的NER模型和LLMs。 Conclusion: 利用知识图谱作为辅助信息可以显著提高NER的准确性，KoGNER为知识感知NLP的未来研究提供了一个有前景的方向。 Abstract: Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that plays a crucial role in information extraction, question answering, and knowledge-based systems. Traditional deep learning-based NER models often struggle with domain-specific generalization and suffer from data sparsity issues. In this work, we introduce Knowledge Graph distilled for Named Entity Recognition (KoGNER), a novel approach that integrates Knowledge Graph (KG) distillation into NER models to enhance entity recognition performance. Our framework leverages structured knowledge representations from KGs to enrich contextual embeddings, thereby improving entity classification and reducing ambiguity in entity detection. KoGNER employs a two-step process: (1) Knowledge Distillation, where external knowledge sources are distilled into a lightweight representation for seamless integration with NER models, and (2) Entity-Aware Augmentation, which integrates contextual embeddings that have been enriched with knowledge graph information directly into GNN, thereby improving the model's ability to understand and represent entity relationships. Experimental results on benchmark datasets demonstrate that KoGNER achieves state-of-the-art performance, outperforming finetuned NER models and LLMs by a significant margin. These findings suggest that leveraging knowledge graphs as auxiliary information can significantly improve NER accuracy, making KoGNER a promising direction for future research in knowledge-aware NLP.

CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation

Masud Ahmed,Zahid Hasan,Syed Arefinul Haque,Abu Zaher Md Faridee,Sanjay Purushotham,Suya You,Nirmalya Roy

Task: 提出一种基于连续值嵌入的语义分割框架，以替代传统的量化嵌入方法。

Motivation: 量化嵌入（如VQ-VAE）在分割掩码上的自动编码器精度比连续值嵌入（如KL-VAE）低8%。

Details

Method: 通过将语义掩码生成重新表述为连续的图像到嵌入扩散过程，提出了一种扩散引导的自回归变换器，学习连续的语义嵌入空间。 Result: 在多个数据集（如Cityscapes和域转移变体）上的实验表明，该框架在分布转移（如恶劣天气和视角变化）下具有最先进的鲁棒性，并且在噪声环境下表现出强大的抗噪能力。 Conclusion: 该框架通过连续嵌入空间实现了零样本域适应能力，并在多种噪声条件下保持了高性能。 Abstract: Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance ($\approx$ 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact ($\approx$ 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: https://github.com/mahmed10/CAMSS.git

Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer

Alexandra DeLucia,Mark Dredze

Task: 评估多文档摘要（MDS）模型在不同训练方法、领域和维度（参考相似性、质量和事实性）下的表现，分析模型在一个领域训练后为何在另一个领域（新闻、科学和对话）的零样本领域转移设置中失败。

Motivation: 研究多文档摘要模型在不同领域之间的转移能力，并探讨现有摘要评估指标的潜在问题。

Details

Method: 评估MDS模型在不同训练方法（端到端、分块后摘要、提取后摘要和GPT风格模型推理）下的表现，分析其在新闻、科学和对话领域的零样本领域转移能力。 Result: 模型在一个领域训练后在另一个领域的摘要质量下降，事实性降低，且与目标的偏差增加。 Conclusion: 多文档摘要模型在跨领域转移时存在显著挑战，现有摘要评估指标可能不适用于所有领域。 Abstract: Abstractive multi-document summarization (MDS) is the task of automatically summarizing information in multiple documents, from news articles to conversations with multiple speakers. The training approaches for current MDS models can be grouped into four approaches: end-to-end with special pre-training ("direct"), chunk-then-summarize, extract-then-summarize, and inference with GPT-style models. In this work, we evaluate MDS models across training approaches, domains, and dimensions (reference similarity, quality, and factuality), to analyze how and why models trained on one domain can fail to summarize documents from another (News, Science, and Conversation) in the zero-shot domain transfer setting. We define domain-transfer "failure" as a decrease in factuality, higher deviation from the target, and a general decrease in summary quality. In addition to exploring domain transfer for MDS models, we examine potential issues with applying popular summarization metrics out-of-the-box.

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Federico Cocchi,Nicholas Moratelli,Davide Caffagni,Sara Sarto,Lorenzo Baraldi,Marcella Cornia,Rita Cucchiara

Task: 探索多模态大语言模型（MLLMs）中视觉骨干和语言模型之间的权衡，并引入LLaVA-MORE模型家族。

Motivation: 现有研究主要关注将模型参数扩展到数十亿，但模型大小、架构和性能之间的权衡尚未充分探索，且训练数据和评估协议的不一致性阻碍了直接比较。

Details

Method: 引入LLaVA-MORE模型家族，采用统一的训练协议，系统分析小规模和中等规模的LLMs（如Phi-4、LLaMA-3.1和Gemma-2）以及多种视觉编码器（如CLIP、DINOv2、SigLIP和SigLIP2）。 Result: 提供了关于更有效MLLMs设计的见解，并提供了一个可重复的评估框架，便于直接比较和指导未来模型开发。 Conclusion: LLaVA-MORE模型家族及其统一的训练协议为多模态大语言模型的设计和评估提供了新的视角和方法。 Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs -- including Phi-4, LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.

Grammar and Gameplay-aligned RL for Game Description Generation with LLMs

Tsunehiko Tanaka,Edgar Simo-Serra

Task: 生成游戏描述语言（GDL）的游戏描述文本。

Motivation: 现有的方法在准确再现游戏描述的游戏特征方面存在挑战。

Details

Method: 提出了基于强化学习的LLM微调方法（RLGDG），结合语法奖励和概念奖励，采用两阶段训练策略（SFT后进行RL）。 Result: 实验结果表明，所提出的方法显著优于仅使用SFT的基线方法。 Conclusion: 强化学习微调方法在提高语法正确性和游戏概念保真度方面表现优异。 Abstract: Game Description Generation (GDG) is the task of generating a game description written in a Game Description Language (GDL) from natural language text. Previous studies have explored generation methods leveraging the contextual understanding capabilities of Large Language Models (LLMs); however, accurately reproducing the game features of the game descriptions remains a challenge. In this paper, we propose reinforcement learning-based fine-tuning of LLMs for GDG (RLGDG). Our training method simultaneously improves grammatical correctness and fidelity to game concepts by introducing both grammar rewards and concept rewards. Furthermore, we adopt a two-stage training strategy where Reinforcement Learning (RL) is applied following Supervised Fine-Tuning (SFT). Experimental results demonstrate that our proposed method significantly outperforms baseline methods using SFT alone.

EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis

Matthew Massey,Abdullah-Al-Zubaer Imran

Task: 介绍并验证EarthScape，一个用于地表地质测绘和地球表面分析的新型多模态数据集。

Motivation: 传统的地表地质测绘方法劳动密集，限制了空间覆盖范围并引入了潜在的偏差。

Details

Method: EarthScape集成了高分辨率航空RGB和近红外（NIR）影像、数字高程模型（DEM）、多尺度DEM衍生的地形特征、水文和基础设施矢量数据，并提供了七个不同地表地质类别的详细注释。 Result: 建立了一个全面的数据处理管道，并使用不同的空间模态建立了基准测试，展示了EarthScape的实用性。 Conclusion: EarthScape填补了计算机视觉与地球科学之间的空白，为多模态学习、地理空间分析和地质测绘的研究提供了宝贵的资源。 Abstract: Surficial geologic mapping is essential for understanding Earth surface processes, addressing modern challenges such as climate change and national security, and supporting common applications in engineering and resource management. However, traditional mapping methods are labor-intensive, limiting spatial coverage and introducing potential biases. To address these limitations, we introduce EarthScape, a novel, AI-ready multimodal dataset specifically designed for surficial geologic mapping and Earth surface analysis. EarthScape integrates high-resolution aerial RGB and near-infrared (NIR) imagery, digital elevation models (DEM), multi-scale DEM-derived terrain features, and hydrologic and infrastructure vector data. The dataset provides detailed annotations for seven distinct surficial geologic classes encompassing various geological processes. We present a comprehensive data processing pipeline using open-sourced raw data and establish baseline benchmarks using different spatial modalities to demonstrate the utility of EarthScape. As a living dataset with a vision for expansion, EarthScape bridges the gap between computer vision and Earth sciences, offering a valuable resource for advancing research in multimodal learning, geospatial analysis, and geological mapping. Our code is available at https://github.com/masseygeo/earthscape.

Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation

Shangqing Zhao,Yuhao Zhou,Yupei Ren,Zhe Chen,Chenghao Jia,Fang Zhe,Zhaogaung Long,Shu Liu,Man Lan

Task: 评估大型语言模型在古典中文文本处理中的理解和生成能力。

Motivation: 由于古典中文独特的语言特征、复杂的结构约束和丰富的文化背景，现有的基准测试主要关注通过选择题评估理解能力，而在评估生成能力方面存在显著差距。

Details

Method: 引入F`ux`i基准，涵盖21个多样化任务，平衡理解和生成任务，设计了专门用于古典中文文本生成的评估指标，并采用系统评估框架考虑语言准确性和文化真实性。 Result: 评估结果显示，模型在理解任务上表现良好，但在生成任务上表现较差，尤其是在需要深厚文化知识和遵循古典格式的任务上。 Conclusion: 研究揭示了当前古典中文文本处理的局限性，并为未来模型开发提供了见解。基准、评估工具包和基线结果已公开，以促进该领域的研究。 Abstract: Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese. We introduce F\`ux\`i, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain.

Vision-Speech Models: Teaching Speech Models to Converse about Images

Amélie Royer,Moritz Böhle,Gabriel de Marmiesse,Laurent Mazaré,Neil Zeghidour,Alexandre Défossez,Patrick Pérez

Task: 构建一个能够自由讨论图像的多模态语音模型MoshiVis。

Motivation: 解决视觉-语音模型面临的挑战，包括数据稀缺、实时延迟和保留韵律特征。

Details

Method: 通过轻量级适配模块增强现有的对话语音LLM Moshi，并设计一个动态门控机制以在视觉输入和无关对话主题之间切换。 Result: 在视觉理解任务上进行了评估，并提供了与MoshiVis交互的定性样本。 Conclusion: MoshiVis在视觉-语音模型方面取得了进展，并公开了推理代码和图像-语音数据。 Abstract: The recent successes of Vision-Language models raise the question of how to equivalently imbue a pretrained speech model with vision understanding, an important milestone towards building a multimodal speech model able to freely converse about images. Building such a conversational Vision-Speech model brings its unique challenges: (i) paired image-speech datasets are much scarcer than their image-text counterparts, (ii) ensuring real-time latency at inference is crucial thus bringing compute and memory constraints, and (iii) the model should preserve prosodic features (e.g., speaker tone) which cannot be inferred from text alone. In this work, we introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. To reduce training costs, we design a simple one-stage, parameter-efficient fine-tuning pipeline in which we leverage a mixture of image-text (i.e., "speechless") and image-speech samples. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis. Our inference code will be made available, as well as the image-speech data used for audio evaluation.

Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey

Xiaoou Liu,Tiejin Chen,Longchao Da,Chacha Chen,Zhen Lin,Hua Wei

Task: 提出一种新的分类法，用于分类基于计算效率和不确定性维度（输入、推理、参数和预测不确定性）的不确定性量化方法。

Motivation: 大型语言模型（LLMs）在高风险领域（如医疗、法律和交通）中的应用日益广泛，但其可靠性是一个主要问题，因为它们经常产生看似合理但不正确的响应。不确定性量化（UQ）通过估计输出的置信度来增强可信度，从而实现风险缓解和选择性预测。然而，传统的UQ方法由于计算限制和解码不一致性而难以应用于LLMs。此外，LLMs引入了独特的不确定性来源，如输入模糊性、推理路径分歧和解码随机性，这些超出了经典的偶然性和认知不确定性。

Details

Method: 引入一种新的分类法，评估现有技术，评估其在实际应用中的适用性，并识别未解决的挑战。 Result: 强调了需要可扩展、可解释和稳健的UQ方法，以增强LLM的可靠性。 Conclusion: 提出了一种新的分类法，并强调了开发可扩展、可解释和稳健的UQ方法的重要性，以提高LLM的可靠性。 Abstract: Large Language Models (LLMs) excel in text generation, reasoning, and decision-making, enabling their adoption in high-stakes domains such as healthcare, law, and transportation. However, their reliability is a major concern, as they often produce plausible but incorrect responses. Uncertainty quantification (UQ) enhances trustworthiness by estimating confidence in outputs, enabling risk mitigation and selective prediction. However, traditional UQ methods struggle with LLMs due to computational constraints and decoding inconsistencies. Moreover, LLMs introduce unique uncertainty sources, such as input ambiguity, reasoning path divergence, and decoding stochasticity, that extend beyond classical aleatoric and epistemic uncertainty. To address this, we introduce a new taxonomy that categorizes UQ methods based on computational efficiency and uncertainty dimensions (input, reasoning, parameter, and prediction uncertainty). We evaluate existing techniques, assess their real-world applicability, and identify open challenges, emphasizing the need for scalable, interpretable, and robust UQ approaches to enhance LLM reliability.

A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition

Ritabrata Chakraborty,Shivakumara Palaiahnakote,Umapada Pal,Cheng-Lin Liu

Task: 提出一种无需训练、即插即用的框架，用于现代场景文本识别系统。

Motivation: 解决现有系统在实时场景中因内存、计算资源和延迟限制而无法部署的问题。

Details

Method: 利用预训练文本识别器的优势，引入基于注意力的分割阶段，细化候选文本区域，并通过语义和词汇评估生成最终得分。 Result: 在公共基准测试中，该框架实现了与最先进系统相当的性能，同时显著减少了资源需求。 Conclusion: 该框架在保持高性能的同时，显著降低了资源消耗，适用于实时场景文本识别。 Abstract: Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene context.Candidate texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.

Typed-RAG: Type-aware Multi-Aspect Decomposition for Non-Factoid Question Answering

DongGeon Lee,Ahjeong Park,Hyeri Lee,Hyeonseo Nam,Yunho Maeng

Task: 提出了一种类型感知的多方面分解框架Typed-RAG，用于非事实性问答（NFQA）。

Motivation: 非事实性问答由于其开放性、多样化的意图和需要多方面推理的特点，使得传统的问答方法（包括检索增强生成RAG）不足以应对。

Details

Method: Typed-RAG将非事实性问题分类为不同的类型（如辩论、经验和比较），并应用基于方面的分解来优化检索和生成策略。 Result: 实验结果表明，Typed-RAG在基准数据集Wiki-NFQA上优于基线方法，突显了类型感知分解在NFQA中的重要性。 Conclusion: Typed-RAG通过类型感知的多方面分解，生成了更具信息性和上下文相关性的回答，为非事实性问答提供了有效的解决方案。 Abstract: Non-factoid question-answering (NFQA) poses a significant challenge due to its open-ended nature, diverse intents, and the need for multi-aspect reasoning, which renders conventional factoid QA approaches, including retrieval-augmented generation (RAG), inadequate. Unlike factoid questions, non-factoid questions (NFQs) lack definitive answers and require synthesizing information from multiple sources across various reasoning dimensions. To address these limitations, we introduce Typed-RAG, a type-aware multi-aspect decomposition framework within the RAG paradigm for NFQA. Typed-RAG classifies NFQs into distinct types -- such as debate, experience, and comparison -- and applies aspect-based decomposition to refine retrieval and generation strategies. By decomposing multi-aspect NFQs into single-aspect sub-queries and aggregating the results, Typed-RAG generates more informative and contextually relevant responses. To evaluate Typed-RAG, we introduce Wiki-NFQA, a benchmark dataset covering diverse NFQ types. Experimental results demonstrate that Typed-RAG outperforms baselines, thereby highlighting the importance of type-aware decomposition for effective retrieval and generation in NFQA. Our code and dataset are available at \href{https://github.com/TeamNLP/Typed-RAG}{https://github.com/TeamNLP/Typed-RAG}.

Jumanh Atoum,Garrison L. H. Johnston,Nabil Simaan,Jie Ying Wu

Task: 实时识别手术手势，以实现自动活动识别、技能评估、术中辅助和最终的手术自动化。

Motivation: 当前的手术机器人系统提供了丰富的多模态数据（如视频和运动学数据），但现有方法将运动学信息视为独立信号，忽略了工具尖端姿态之间的几何关系。

Details

Method: 提出了一种将运动不变量（曲率和扭转）与视觉和运动学数据结合的方法，使用关系图网络捕捉不同数据流之间的潜在关系。 Result: 在JIGSAWS缝合数据集上，结合不变量信号和工具位置的手势识别帧准确率达到90.3%。 Conclusion: 运动不变量信号与位置结合的手势表示优于传统的位置和四元数表示，强调了运动学几何感知建模在手势识别中的重要性。 Abstract: Recognizing surgical gestures in real-time is a stepping stone towards automated activity recognition, skill assessment, intra-operative assistance, and eventually surgical automation. The current robotic surgical systems provide us with rich multi-modal data such as video and kinematics. While some recent works in multi-modal neural networks learn the relationships between vision and kinematics data, current approaches treat kinematics information as independent signals, with no underlying relation between tool-tip poses. However, instrument poses are geometrically related, and the underlying geometry can aid neural networks in learning gesture representation. Therefore, we propose combining motion invariant measures (curvature and torsion) with vision and kinematics data using a relational graph network to capture the underlying relations between different data streams. We show that gesture recognition improves when combining invariant signals with tool position, achieving 90.3\% frame-wise accuracy on the JIGSAWS suturing dataset. Our results show that motion invariant signals coupled with position are better representations of gesture motion compared to traditional position and quaternion representations. Our results highlight the need for geometric-aware modeling of kinematics for gesture recognition.

Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models

Baolong Bi,Shenghua Liu,Yiwei Wang,Yilong Xu,Junfeng Fang,Lingrui Mei,Xueqi Cheng

Task: 提出一种控制大型语言模型（LLM）对参数知识和上下文知识依赖的方法，以解决检索增强生成（RAG）中的知识冲突问题。

Motivation: 在检索增强生成（RAG）中，参数知识与检索到的上下文之间的冲突会导致模型难以确定依赖哪种知识，特别是在检索到的信息不可靠或模型的内部知识过时的情况下。

Details

Method: 提出了CK-PLUG方法，通过引入一种新的知识一致性度量——置信增益（Confidence Gain），来检测知识冲突，并通过调整具有负置信增益的标记的概率分布来实现对知识依赖的细粒度控制。 Result: 实验表明，CK-PLUG能够在反事实RAG场景中显著调节知识依赖，同时保持生成的流畅性和知识准确性。例如，在Llama3-8B上，RAG响应的记忆召回率（MR）可以在9.9%-71.9%的范围内调整，而基线为42.1%。 Conclusion: CK-PLUG支持基于模型对内部和外部知识的置信度进行自适应控制，在各种通用RAG任务中实现了持续的性能改进。 Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, conflicts between parametric knowledge and retrieved context pose challenges, particularly when retrieved information is unreliable or the model's internal knowledge is outdated. In such cases, LLMs struggle to determine whether to rely more on their own parameters or the conflicted context. To address this, we propose **CK-PLUG**, a plug-and-play method for controlling LLMs' reliance on parametric and contextual knowledge. We introduce a novel knowledge consistency metric, Confidence Gain, which detects knowledge conflicts by measuring entropy shifts in token probability distributions after context insertion. CK-PLUG then enables fine-grained control over knowledge preference by adjusting the probability distribution of tokens with negative confidence gain through a single tuning parameter. Experiments demonstrate CK-PLUG's ability to significantly regulate knowledge reliance in counterfactual RAG scenarios while maintaining generation fluency and knowledge accuracy. For instance, on Llama3-8B, memory recall (MR) of RAG response can be adjusted within a broad range (9.9%-71.9%), compared to the baseline of 42.1%. Moreover, CK-PLUG supports adaptive control based on the model's confidence in both internal and external knowledge, achieving consistent performance improvements across various general RAG tasks. Our code is available at: $\href{https://github.com/byronBBL/CK-PLUG}{\text{this https URL}}$.

Miguel Ureña Pliego,Rubén Martínez Marín,Nianfang Shi,Takeru Shibayama,Ulrich Leth,Miguel Marchamalo Sacristán

Task: 探索将机器学习集成到城市航空图像分析中，重点识别汽车和行人的基础设施表面并分析历史趋势。

Motivation: 强调从卷积架构向基于变压器的预训练模型的转变，突出其在全球地理空间分析中的潜力。

Details

Method: 提出了一种自动生成地理空间数据集的工作流程，能够从各种来源（包括WMS/WMTS链接、矢量制图和OpenStreetMap (OSM) overpass-turbo请求）创建语义分割数据集。 Result: 使用马德里和维也纳的地理办公室提供的航空图像和矢量数据，生成了两个用于汽车和行人表面检测的数据集。基于变压器的模型在每个城市中进行了训练和评估，显示出良好的准确性。 Conclusion: 该技术适用于市政政府以最低成本收集有价值的数据。 Abstract: This study explores the integration of machine learning into urban aerial image analysis, with a focus on identifying infrastructure surfaces for cars and pedestrians and analyzing historical trends. It emphasizes the transition from convolutional architectures to transformer-based pre-trained models, underscoring their potential in global geospatial analysis. A workflow is presented for automatically generating geospatial datasets, enabling the creation of semantic segmentation datasets from various sources, including WMS/WMTS links, vectorial cartography, and OpenStreetMap (OSM) overpass-turbo requests. The developed code allows a fast dataset generation process for training machine learning models using openly available data without manual labelling. Using aerial imagery and vectorial data from the respective geographical offices of Madrid and Vienna, two datasets were generated for car and pedestrian surface detection. A transformer-based model was trained and evaluated for each city, demonstrating good accuracy values. The historical trend analysis involved applying the trained model to earlier images predating the availability of vectorial data 10 to 20 years, successfully identifying temporal trends in infrastructure for pedestrians and cars across different city areas. This technique is applicable for municipal governments to gather valuable data at a minimal cost.

From Structured Prompts to Open Narratives: Measuring Gender Bias in LLMs Through Open-Ended Storytelling

Evan Chen,Run-Jun Zhan,Yan-Bai Lin,Hung-Hsuan Chen

Task: 引入一种新的评估框架来揭示大型语言模型（LLMs）中的性别偏见，特别是其职业叙述中的偏见。

Motivation: 尽管大型语言模型在自然语言处理领域取得了革命性进展，但其反映或放大训练数据中社会偏见的倾向仍然令人担忧。

Details

Method: 利用自由形式的讲故事方法，而不是依赖结构化场景或精心设计的提示，来揭示模型中的偏见。 Result: 系统分析显示，在六个广泛使用的LLMs中，女性角色在职业中的代表性过高。此外，LLM生成的职业性别排名更接近人类刻板印象，而不是实际的劳动力统计数据。 Conclusion: 这些发现强调了需要平衡的缓解策略，以确保公平性，同时避免强化新的刻板印象。 Abstract: Large Language Models (LLMs) have revolutionized natural language processing, yet concerns persist regarding their tendency to reflect or amplify social biases present in their training data. This study introduces a novel evaluation framework to uncover gender biases in LLMs, focusing on their occupational narratives. Unlike previous methods relying on structured scenarios or carefully crafted prompts, our approach leverages free-form storytelling to reveal biases embedded in the models. Systematic analyses show an overrepresentation of female characters across occupations in six widely used LLMs. Additionally, our findings reveal that LLM-generated occupational gender rankings align more closely with human stereotypes than actual labor statistics. These insights underscore the need for balanced mitigation strategies to ensure fairness while avoiding the reinforcement of new stereotypes.

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak,Xiangru Jian,Kevin Qinghong Lin,Juan A. Rodriguez,Montek Kalsi,Rabiul Awal,Nicolas Chapados,M. Tamer Özsu,Aishwarya Agrawal,David Vazquez,Christopher Pal,Perouz Taslakian,Spandana Gella,Sai Rajeswar

Task: 介绍UI-Vision，一个用于在真实桌面环境中离线评估计算机使用代理的综合性、许可宽松的基准。

Motivation: 现有的研究主要集中在在线环境中，而桌面环境对于许多专业和日常任务至关重要，但由于数据收集挑战和许可问题，桌面环境仍未得到充分探索。

Details

Method: UI-Vision提供了密集、高质量的人类演示注释，包括边界框、UI标签和动作轨迹（点击、拖动和键盘输入），并定义了三个从细到粗粒度的任务：元素定位、布局定位和动作预测。 Result: 评估揭示了最先进模型（如UI-TARS-72B）的关键局限性，包括理解专业软件、空间推理和执行复杂动作（如拖放）的问题。 Conclusion: UI-Vision的发布旨在推动开发更强大的代理，以应对现实世界中的桌面任务。 Abstract: Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.

Towards Automatic Continual Learning: A Self-Adaptive Framework for Continual Instruction Tuning

Peiyi Lin,Fukai Zhang,Kai Niu,Hao Fu

Task: 提出一种自动化的持续指令调优框架，动态过滤输入数据以减少冗余数据。

Motivation: 在特定领域背景下，保持数据质量和管理系统约束是关键挑战。现有方法主要关注如何保留旧知识，而不是选择学习哪些新知识。

Details

Method: 利用小型代理模型进行基于困惑度的高效过滤，并更新代理以确保过滤标准与部署模型的演变状态保持一致。 Result: 在真实世界的医疗场景中评估了该系统，减少了66.7%的计算成本，提高了模型性能，并实现了自主更新。 Conclusion: 该框架在自动持续指令调优中表现出有效性，能够有效处理增量获取的数据和分布变化，解决了实际部署中的挑战。 Abstract: Continual instruction tuning enables large language models (LLMs) to learn incrementally while retaining past knowledge, whereas existing methods primarily focus on how to retain old knowledge rather than on selecting which new knowledge to learn. In domain-specific contexts, maintaining data quality and managing system constraints remain key challenges. To address these issues, we propose an automated continual instruction tuning framework that dynamically filters incoming data, which identify and reduce redundant data across successive updates. Our approach utilizes a small proxy model for efficient perplexity-based filtering, and updates the proxy to ensure that the filtering criteria remain aligned with the evolving state of the deployed model. Compared to existing static data selection methods, our framework can effectively handle incrementally acquired data and shifting distributions. Additionally, it addresses practical deployment challenges by enabling seamless model updates, supporting version rollback and incorporating automatic checkpoint evaluation. We evaluated the system in real-world medical scenarios. It reduced computational costs by 66.7% and improved model performance, and achieved autonomous updates, thus demonstrating its effectiveness for automatic continual instruction tuning.

Toward Scalable, Flexible Scene Flow for Point Clouds

Kyle Vedder

Task: 描述时间上连续观测之间的3D运动。

Motivation: 建立具有可扩展性和灵活性的场景流估计器，使其能够在各种领域和运动模式中无需大量超参数调整即可工作。

Details

Method: 通过大规模蒸馏从强无监督测试时间优化方法提供的伪标签中构建和扩展前馈场景流估计器，引入新的全序列问题公式，并创建一个基准来更好地衡量估计质量。 Result: 提出了一个最先进的无监督场景流估计器，并在相邻领域如3D点跟踪中展示了巨大潜力。 Conclusion: 为场景流的未来发展奠定了基础，并探讨了其潜在的广泛影响。 Abstract: Scene flow estimation is the task of describing 3D motion between temporally successive observations. This thesis aims to build the foundation for building scene flow estimators with two important properties: they are scalable, i.e. they improve with access to more data and computation, and they are flexible, i.e. they work out-of-the-box in a variety of domains and on a variety of motion patterns without requiring significant hyperparameter tuning. In this dissertation we present several concrete contributions towards this. In Chapter 1 we contextualize scene flow and its prior methods. In Chapter 2 we present a blueprint to build and scale feedforward scene flow estimators without requiring expensive human annotations via large scale distillation from pseudolabels provided by strong unsupervised test-time optimization methods. In Chapter 3 we introduce a benchmark to better measure estimate quality across diverse object types, better bringing into focus what we care about and expect from scene flow estimators, and use this benchmark to host a public challenge that produced significant progress. In Chapter 4 we present a state-of-the-art unsupervised scene flow estimator that introduces a new, full sequence problem formulation and exhibits great promise in adjacent domains like 3D point tracking. Finally, in Chapter 5 I philosophize about what's next for scene flow and its potential future broader impacts.

From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Reasoning in Large Language Models

Jinyi Liu,Yan Zheng,Rong Cheng,Qiyu Wu,Wei Guo,Fei Ni,Hebin Liang,Yifu Yuan,Hangyu Mao,Fuzheng Zhang,Jianye Hao

Task: 提出一种名为Atomic Reasoner（AR）的认知推理策略，以解决大语言模型在逻辑推理中的问题。

Motivation: 当前推理扩展范式存在两个基本限制：思维流碎片化影响逻辑连贯性，以及随着搜索空间维度增加而加剧的计算复杂性。

Details

Method: AR将推理过程分解为原子认知单元，采用认知路由机制动态构建推理表示并协调推理路径。 Result: 实验结果表明AR在无需进行详尽解搜索的情况下，表现出卓越的推理能力，尤其在语言逻辑谜题中表现优异。 Conclusion: AR有效增强了大语言模型在长序列逻辑推理和深思熟虑方面的能力。 Abstract: Recent advances in large language models (LLMs) have shown remarkable progress, yet their capacity for logical ``slow-thinking'' reasoning persists as a critical research frontier. Current inference scaling paradigms suffer from two fundamental constraints: fragmented thought flows compromising logical coherence, and intensively computational complexity that escalates with search space dimensions. To overcome these limitations, we present \textbf{Atomic Reasoner} (\textbf{AR}), a cognitive inference strategy that enables fine-grained reasoning through systematic atomic-level operations. AR decomposes the reasoning process into atomic cognitive units, employing a cognitive routing mechanism to dynamically construct reasoning representations and orchestrate inference pathways. This systematic methodology implements stepwise, structured cognition, which ensures logical coherence while significantly reducing cognitive load, effectively simulating the cognitive patterns observed in human deep thinking processes. Extensive experimental results demonstrate AR's superior reasoning capabilities without the computational burden of exhaustive solution searches, particularly excelling in linguistic logic puzzles. These findings substantiate AR's effectiveness in enhancing LLMs' capacity for robust, long-sequence logical reasoning and deliberation.

DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis

Yuming Gu,Phong Tran,Yujian Zheng,Hongyi Xu,Heyuan Li,Adilbek Karmanov,Hao Li

Task: 从单视角图像生成高质量的360度人类头部视图。

Motivation: 为了实现可访问的沉浸式远程呈现应用和可扩展的个性化内容创建。

Details

Method: 基于DiffPortrait3D框架，结合自定义ControlNet用于后脑细节生成和双外观模块以确保全局前后一致性。 Result: 该方法能够生成高质量的神经辐射场（NeRF），用于实时自由视角渲染，在对象合成和360度头部生成方面优于现有方法。 Conclusion: 该方法在生成360度头部视图方面表现出色，能够处理人类、风格化和拟人化形式，包括眼镜和帽子等配饰。 Abstract: Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation. While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views and struggle with view consistency, preventing their conversion into true 3D models for rendering from arbitrary angles. We introduce a novel approach that generates fully consistent 360-degree head views, accommodating human, stylized, and anthropomorphic forms, including accessories like glasses and hats. Our method builds on the DiffPortrait3D framework, incorporating a custom ControlNet for back-of-head detail generation and a dual appearance module to ensure global front-back consistency. By training on continuous view sequences and integrating a back reference image, our approach achieves robust, locally continuous view synthesis. Our model can be used to produce high-quality neural radiance fields (NeRFs) for real-time, free-viewpoint rendering, outperforming state-of-the-art methods in object synthesis and 360-degree head generation for very challenging input portraits.

Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning

Chen Li,Nazhou Liu,Kai Yang

Task: 提出Adaptive Group Policy Optimization (AGPO)以改进Reasoning LLMs的训练稳定性和推理效率。

Motivation: 发现Group Relative Policy Optimization (GRPO)在RL稳定性和推理效率方面存在不足。

Details

Method: 提出了两种简单但有效的改进：修订的优势估计方法以减少零方差情况；基于长度的奖励，激励模型避免过度思考。 Result: 实验表明，该方法在训练稳定性和推理步骤中的token数量显著减少的情况下，实现了可比或更优的性能。 Conclusion: AGPO方法在提升训练稳定性和推理效率方面具有显著优势。 Abstract: Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of Reasoning LLMs training. However, we find some deficiency that influences RL stability and inference efficiency. Thus, we propose Adaptive Group Policy Optimization (AGPO) which contains two simple but effective modifications: a revised advantage estimation method to mitigate zero-variance situations; a length-based reward, incentivizing the model to avoid overthinking. The experiments demonstrate our methods achieve more stable training and comparable or superior performance with significantly fewer tokens in reasoning steps.

CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image

Arindam Dutta,Meng Zheng,Zhongpai Gao,Benjamin Planche,Anwesha Choudhuri,Terrence Chen,Amit K. Roy-Chowdhury,Ziyan Wu

Task: 从单张图像中重建被遮挡的穿衣人体

Motivation: 现有的单目穿衣人体重建方法在无遮挡环境下表现良好，但在实际应用中遇到遮挡时会产生多视角不一致和碎片化的重建结果。此外，大多数方法依赖于难以获取的几何先验（如SMPL注释）。为了解决这些问题，提出了CHROME方法。

Details

Method: CHROME利用多视角扩散模型从遮挡输入中合成无遮挡的人体图像，并结合现成的姿态控制来显式地强制跨视角一致性。然后，训练一个3D重建模型，根据遮挡输入和合成视图预测一组3D高斯分布，以对齐跨视角细节，生成一致且准确的3D表示。 Result: CHROME在具有挑战性的条件下，在新视角合成（最高3 dB PSNR）和几何重建方面取得了显著改进。 Conclusion: CHROME能够在不需要地面真实几何先验注释或3D监督的情况下，从单张遮挡图像中重建具有遮挡弹性和多视角一致性的3D人体。 Abstract: Reconstructing clothed humans from a single image is a fundamental task in computer vision with wide-ranging applications. Although existing monocular clothed human reconstruction solutions have shown promising results, they often rely on the assumption that the human subject is in an occlusion-free environment. Thus, when encountering in-the-wild occluded images, these algorithms produce multiview inconsistent and fragmented reconstructions. Additionally, most algorithms for monocular 3D human reconstruction leverage geometric priors such as SMPL annotations for training and inference, which are extremely challenging to acquire in real-world applications. To address these limitations, we propose CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-ConsistEncy from a Single Image, a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input, compatible with off-the-shelf pose control to explicitly enforce cross-view consistency during synthesis. A 3D reconstruction model is then trained to predict a set of 3D Gaussians conditioned on both the occluded input and synthesized views, aligning cross-view details to produce a cohesive and accurate 3D representation. CHROME achieves significant improvements in terms of both novel view synthesis (upto 3 db PSNR) and geometric reconstruction under challenging conditions.

Exploratory Study into Relations between Cognitive Distortions and Emotional Appraisals

Navneet Agarwal,Kairit Sirts

Task: 探索认知扭曲与情感评估维度之间的关系，并分析认知重构对评估维度的影响。

Motivation: 尽管情感重评和认知重构作为情绪调节技术有相似之处，但这些概念大多被孤立研究。本研究旨在填补这一空白，探索它们之间的关系及其对未来跨学科研究的潜在影响。

Details

Method: 进行了一项探索性计算研究，旨在调查认知扭曲与情感评估维度之间的关系。 Result: 研究发现，不同扭曲类别之间认知扭曲与评估维度的统计显著关系模式各不相同，从而为各个扭曲类别产生了独特的评估特征。此外，研究还分析了认知重构对评估维度的影响，展示了认知重构在情绪调节方面的作用。 Conclusion: 研究表明，认知扭曲与情感评估维度之间存在显著关系，且认知重构对评估维度有重要影响，这为未来的跨学科研究提供了新的方向。 Abstract: In recent years, there has been growing interest in studying cognitive distortions and emotional appraisals from both computational and psychological perspectives. Despite considerable similarities between emotional reappraisal and cognitive reframing as emotion regulation techniques, these concepts have largely been examined in isolation. This research explores the relationship between cognitive distortions and emotional appraisal dimensions, examining their potential connections and relevance for future interdisciplinary studies. Under this pretext, we conduct an exploratory computational study, aimed at investigating the relationship between cognitive distortion and emotional appraisals. We show that the patterns of statistically significant relationships between cognitive distortions and appraisal dimensions vary across different distortion categories, giving rise to distinct appraisal profiles for individual distortion classes. Additionally, we analyze the impact of cognitive restructuring on appraisal dimensions, exemplifying the emotion regulation aspect of cognitive restructuring.

GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving

William Ljungbergh,Adam Lilja,Adam Tonderski. Arvid Laveno Ling,Carl Lindström,Willem Verbeke,Junsheng Fu,Christoffer Petersson,Lars Hammarstrand,Michael Felsberg

Task: 提出一种几何和语义自监督预训练方法GASP，用于自动驾驶中的4D几何和语义占用预测。

Motivation: 自动驾驶生成大量的时空数据，通过利用这些数据来学习环境的几何和语义结构及其随时间的变化。

Details

Method: GASP方法通过预测未来时空点的（1）一般占用，（2）自我占用，以及（3）从视觉基础模型中提取的高级特征，学习统一的表示。 Result: 在多个自动驾驶基准测试中验证了GASP，展示了在语义占用预测、在线地图和自我轨迹预测方面的显著改进。 Conclusion: 连续的4D几何和语义占用预测为自动驾驶提供了一种可扩展且有效的预训练范式。 Abstract: Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving. For code and additional visualizations, see \href{https://research.zenseact.com/publications/gasp/.

InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

Tony Zhang,Rickard Brännvall

Task: 优化基于Transformer的语言模型，通过集成模型压缩技术和新型注意力机制——抑制注意力。

Motivation: 探索一种替代传统缩放点积注意力的机制，以节省计算和能源，同时保持模型的有效性。

Details

Method: 使用曼哈顿距离和ReLU激活函数替代矩阵乘法和softmax激活，提出进一步调整以提高抑制机制的训练效率，并在DistilBERT架构上进行评估。 Result: 改进后的抑制Transformer模型在标准NLP基准测试（包括GLUE和情感分析任务）上表现出竞争力。 Conclusion: 抑制注意力机制在保持模型性能的同时，具有潜在的计算和能源节省优势。 Abstract: This work explores optimizing transformer-based language models by integrating model compression techniques with inhibitor attention, a novel alternative attention mechanism. Inhibitor attention employs Manhattan distances and ReLU activations instead of the matrix multiplications and softmax activation of the conventional scaled dot-product attention. This shift offers potential computational and energy savings while maintaining model effectiveness. We propose further adjustments to improve the inhibitor mechanism's training efficiency and evaluate its performance on the DistilBERT architecture. Our knowledge distillation experiments indicate that the modified inhibitor transformer model can achieve competitive performance on standard NLP benchmarks, including General Language Understanding Evaluation (GLUE) and sentiment analysis tasks.

High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight

Cédric Vincent,Taehyoung Kim,Henri Meeß

Task: 提出一种轻量级的视频语义分割方法，适用于机载实时推理，通过语义相似性传播实现高时间一致性。

Motivation: RGB相机的语义分割对于自主飞行器的感知至关重要，预测的稳定性直接影响其可靠性和信任度。

Details

Method: 提出语义相似性传播（SSP）方法，通过全局配准对齐补偿相机运动，结合当前估计和先验预测进行线性插值。并提出一致性感知的知识蒸馏训练程序，利用大图像分割模型作为教师模型训练高效的SSP。 Result: KD-SSP在UAVid和RuralScapes数据集上分别提高了12.5%和6.7%的时间一致性，具有更高的准确性和可比的推理速度。 Conclusion: KD-SSP在航空数据集上提供了优于其他视频方法的分割质量和推理速度权衡，并显示出更高的时间一致性。 Abstract: Semantic segmentation from RGB cameras is essential to the perception of autonomous flying vehicles. The stability of predictions through the captured videos is paramount to their reliability and, by extension, to the trustworthiness of the agents. In this paper, we propose a lightweight video semantic segmentation approach-suited to onboard real-time inference-achieving high temporal consistency on aerial data through Semantic Similarity Propagation across frames. SSP temporally propagates the predictions of an efficient image segmentation model with global registration alignment to compensate for camera movements. It combines the current estimation and the prior prediction with linear interpolation using weights computed from the features similarities of the two frames. Because data availability is a challenge in this domain, we propose a consistency-aware Knowledge Distillation training procedure for sparsely labeled datasets with few annotations. Using a large image segmentation model as a teacher to train the efficient SSP, we leverage the strong correlations between labeled and unlabeled frames in the same training videos to obtain high-quality supervision on all frames. KD-SSP obtains a significant temporal consistency increase over the base image segmentation model of 12.5% and 6.7% TC on UAVid and RuralScapes respectively, with higher accuracy and comparable inference speed. On these aerial datasets, KD-SSP provides a superior segmentation quality and inference speed trade-off than other video methods proposed for general applications and shows considerably higher consistency. The code will be made publicly available upon acceptance.

ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph

Langming Liu,Haibin Chen,Yuhao Wang,Yujin Yuan,Shilei Liu,Wenbo Su,Xiangyu Zhao,Bo Zheng

Task: 评估大型语言模型（LLMs）在电子商务知识中的能力

Motivation: 由于LLMs在电子商务中的广泛应用及其对用户体验和收入的重大影响，评估其事实性（如幻觉）变得尤为重要。现有的评估方法存在可靠性不足、高消耗和缺乏领域专业知识等问题，导致在电子商务中的有效评估存在差距。

Details

Method: 提出ECKGBench数据集，采用标准化工作流程自动生成基于大规模知识图谱的问题，确保足够的可靠性。采用简单的问答范式，通过最少的输入和输出标记显著提高评估效率。在每个评估阶段注入丰富的电子商务专业知识，包括人工注释、提示设计、负采样和验证。 Result: 通过对多个先进LLMs在ECKGBench上的全面评估，提供了关于利用LLMs进行电子商务的详细分析和见解。 Conclusion: ECKGBench数据集有效地填补了电子商务领域LLMs评估的空白，提供了高效且可靠的评估方法，并为LLMs在电子商务中的应用提供了新的视角。 Abstract: Large language models (LLMs) have demonstrated their capabilities across various NLP tasks. Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service. One primary concern associated with LLMs is their factuality (e.g., hallucination), which is urgent in e-commerce due to its significant impact on user experience and revenue. Despite some methods proposed to evaluate LLMs' factuality, issues such as lack of reliability, high consumption, and lack of domain expertise leave a gap between effective assessment in e-commerce. To bridge the evaluation gap, we propose ECKGBench, a dataset specifically designed to evaluate the capacities of LLMs in e-commerce knowledge. Specifically, we adopt a standardized workflow to automatically generate questions based on a large-scale knowledge graph, guaranteeing sufficient reliability. We employ the simple question-answering paradigm, substantially improving the evaluation efficiency by the least input and output tokens. Furthermore, we inject abundant e-commerce expertise in each evaluation stage, including human annotation, prompt design, negative sampling, and verification. Besides, we explore the LLMs' knowledge boundaries in e-commerce from a novel perspective. Through comprehensive evaluations of several advanced LLMs on ECKGBench, we provide meticulous analysis and insights into leveraging LLMs for e-commerce.

The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation

Benidir Yanis,Gonthier Nicolas,Mallet Clement

Task: 提出了一种生成大规模混合语义变化检测数据集的方法。

Motivation: 解决现有方法在双时相变化检测中需要大量标注数据或局限于特定数据集的问题。

Details

Method: 提出了HySCDG生成管道，创建包含真实VHR图像和修复图像的混合语义变化检测数据集。 Result: 生成的FSC-180k数据集在五种变化检测案例中表现出色，显著提升了性能。 Conclusion: 预训练在混合数据集上可以显著提升变化检测的性能，优于完全合成的数据集。 Abstract: Bi-temporal change detection at scale based on Very High Resolution (VHR) images is crucial for Earth monitoring. This remains poorly addressed so far: methods either require large volumes of annotated data (semantic case), or are limited to restricted datasets (binary set-ups). Most approaches do not exhibit the versatility required for temporal and spatial adaptation: simplicity in architecture design and pretraining on realistic and comprehensive datasets. Synthetic datasets are the key solution but still fail to handle complex and diverse scenes. In this paper, we present HySCDG a generative pipeline for creating a large hybrid semantic change detection dataset that contains both real VHR images and inpainted ones, along with land cover semantic map at both dates and the change map. Being semantically and spatially guided, HySCDG generates realistic images, leading to a comprehensive and hybrid transfer-proof dataset FSC-180k. We evaluate FSC-180k on five change detection cases (binary and semantic), from zero-shot to mixed and sequential training, and also under low data regime training. Experiments demonstrate that pretraining on our hybrid dataset leads to a significant performance boost, outperforming SyntheWorld, a fully synthetic dataset, in every configuration. All codes, models, and data are available here: $\href{https://yb23.github.io/projects/cywd/}{https://yb23.github.io/projects/cywd/}$.

Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models

Mario Sanz-Guerrero,Katharina von der Wense

Task: 提出并评估纠正性上下文学习（CICL）方法，以提高大语言模型（LLMs）在文本分类任务中的性能。

Motivation: 尽管上下文学习（ICL）在NLP任务中表现出色，但在处理具有挑战性的示例时容易出错。因此，希望通过引入纠正性上下文学习（CICL）来提升分类准确性。

Details

Method: 提出CICL方法，将模型的错误预测与真实纠正一起纳入提示中，旨在通过自我纠正提高分类准确性。 Result: 实验结果表明，CICL在文本分类任务中表现不如标准ICL，且随着提示中纠正比例的增加，性能下降。CICL通过干扰模型的任务理解引入混淆，而不是改进其预测。此外，标准ICL中呈现更难示例并不能提高性能。 Conclusion: CICL并未达到预期效果，反而引入了混淆。研究结果为LLMs中自我纠正机制的局限性提供了重要见解，并为未来研究提供了方向。 Abstract: In-context learning (ICL) has transformed the use of large language models (LLMs) for NLP tasks, enabling few-shot learning by conditioning on labeled examples without finetuning. Despite its effectiveness, ICL is prone to errors, especially for challenging examples. With the goal of improving the performance of ICL, we propose corrective in-context learning (CICL), an approach that incorporates a model's incorrect predictions alongside ground truth corrections into the prompt, aiming to enhance classification accuracy through self-correction. However, contrary to our hypothesis, extensive experiments on text classification tasks demonstrate that CICL consistently underperforms standard ICL, with performance degrading as the proportion of corrections in the prompt increases. Our findings indicate that CICL introduces confusion by disrupting the model's task understanding, rather than refining its predictions. Additionally, we observe that presenting harder examples in standard ICL does not improve performance, suggesting that example difficulty alone may not be a reliable criterion for effective selection. By presenting these negative results, we provide important insights into the limitations of self-corrective mechanisms in LLMs and offer directions for future research.

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

Jiaqi Liu,Jichao Zahng,Paolo Rota,Nicu Sebe

Task: 提出一种多焦点条件潜在扩散（MCLD）方法，以解决潜在扩散模型（LDM）在细节保留上的不足。

Motivation: 潜在扩散模型在高分辨率图像生成中表现出色，但在压缩过程中会导致细节丢失，特别是在面部特征和服装纹理等敏感区域。

Details

Method: 通过多焦点条件聚合模块，利用解耦的、姿态不变的特征来增强模型生成外观逼真且身份一致的图像的能力。 Result: 在DeepFashion数据集上展示了身份和外观的一致性生成，并实现了灵活的人物图像编辑。 Conclusion: MCLD方法有效解决了LDM在细节保留上的问题，生成了外观逼真且身份一致的图像。 Abstract: The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model's ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code is available at https://github.com/jqliu09/mcld.

The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement

Ruihan Yang,Fanghua Ye,Jian Li,Siyu Yuan,Yikai Zhang,Zhaopeng Tu,Xiaolong Li,Deqing Yang

Task: 提出一种新的两玩家框架Critique-Guided Improvement (CGI)，用于增强基于大语言模型（LLM）的代理的决策能力。

Motivation: 自然语言反馈比数值奖励信号和验证器更能提供丰富的、可操作的指导，但有效地解析和实施这种反馈对基于LLM的代理来说具有挑战性。

Details

Method: 引入一个两玩家框架，包括一个探索环境的演员模型和一个生成详细自然语言反馈的评论家模型。通过训练评论家生成细粒度的评估和可操作的修订，并训练演员利用这些反馈，促进更稳健的策略探索。 Result: 在三个交互环境中的实验表明，CGI显著优于现有基线。即使是一个小的评论家模型，其反馈质量也超过了GPT-4。最终的演员模型达到了最先进的性能。 Conclusion: CGI框架通过显式迭代指导显著增强了基于LLM的代理的决策能力。 Abstract: Large language models (LLMs) have recently transformed from text-based assistants to autonomous agents capable of planning, reasoning, and iteratively improving their actions. While numerical reward signals and verifiers can effectively rank candidate actions, they often provide limited contextual guidance. In contrast, natural language feedback better aligns with the generative capabilities of LLMs, providing richer and more actionable suggestions. However, parsing and implementing this feedback effectively can be challenging for LLM-based agents. In this work, we introduce Critique-Guided Improvement (CGI), a novel two-player framework, comprising an actor model that explores an environment and a critic model that generates detailed nature language feedback. By training the critic to produce fine-grained assessments and actionable revisions, and the actor to utilize these critiques, our approach promotes more robust exploration of alternative strategies while avoiding local optima. Experiments in three interactive environments show that CGI outperforms existing baselines by a substantial margin. Notably, even a small critic model surpasses GPT-4 in feedback quality. The resulting actor achieves state-of-the-art performance, demonstrating the power of explicit iterative guidance to enhance decision-making in LLM-based agents.

Technical Report for the 5th CLVision Challenge at CVPR: Addressing the Class-Incremental with Repetition using Unlabeled Data -- 4th Place Solution

Panagiota Moraiti,Efstathios Karypidis

Task: 解决CVPR第五次CLVision挑战中的类增量重复（CIR）场景问题。

Motivation: CIR场景引入了独特的挑战和研究机会，特别是在训练过程中整合未标记数据。

Details

Method: 利用知识蒸馏和伪标签技术来保留先前学到的知识，并在训练过程中利用未标记数据。 Result: 在预选阶段的平均准确率为16.68%，在最终评估阶段的平均准确率为21.19%，优于基线准确率9.39%。 Conclusion: 该方法通过利用未标记数据，在先前遇到的类别实例上保持最佳性能，并减少灾难性遗忘的负面影响。 Abstract: This paper outlines our approach to the 5th CLVision challenge at CVPR, which addresses the Class-Incremental with Repetition (CIR) scenario. In contrast to traditional class incremental learning, this novel setting introduces unique challenges and research opportunities, particularly through the integration of unlabeled data into the training process. In the CIR scenario, encountered classes may reappear in later learning experiences, and each experience may involve only a subset of the overall class distribution. Additionally, the unlabeled data provided during training may include instances of unseen classes, or irrelevant classes which should be ignored. Our approach focuses on retaining previously learned knowledge by utilizing knowledge distillation and pseudo-labeling techniques. The key characteristic of our method is the exploitation of unlabeled data during training, in order to maintain optimal performance on instances of previously encountered categories and reduce the detrimental effects of catastrophic forgetting. Our method achieves an average accuracy of 16.68\% during the pre-selection phase and 21.19% during the final evaluation phase, outperforming the baseline accuracy of 9.39%. We provide the implementation code at https://github.com/panagiotamoraiti/continual-learning-challenge-2024 .

Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content

Sai Kartheek Reddy Kasu,Shankar Biradar,Sunil Saumya

Task: 介绍并分析Deceptive Humor Dataset (DHD)，一个用于研究幽默与虚假信息交织的新资源。

Motivation: 在虚假信息泛滥的时代，理解幽默如何与欺骗交织是至关重要的。

Details

Method: 使用ChatGPT-4o模型生成包含虚假叙述的幽默评论，并标注讽刺等级和幽默类别。 Result: 创建了一个多语言的幽默数据集，涵盖了多种语言及其混合变体，并建立了强基线。 Conclusion: DHD为分析欺骗性幽默提供了结构化基础，为研究幽默如何影响虚假信息的感知和传播开辟了新方向。 Abstract: This paper presents the Deceptive Humor Dataset (DHD), a novel resource for studying humor derived from fabricated claims and misinformation. In an era of rampant misinformation, understanding how humor intertwines with deception is essential. DHD consists of humor-infused comments generated from false narratives, incorporating fabricated claims and manipulated information using the ChatGPT-4o model. Each instance is labeled with a Satire Level, ranging from 1 for subtle satire to 3 for high-level satire and classified into five distinct Humor Categories: Dark Humor, Irony, Social Commentary, Wordplay, and Absurdity. The dataset spans multiple languages including English, Telugu, Hindi, Kannada, Tamil, and their code-mixed variants (Te-En, Hi-En, Ka-En, Ta-En), making it a valuable multilingual benchmark. By introducing DHD, we establish a structured foundation for analyzing humor in deceptive contexts, paving the way for a new research direction that explores how humor not only interacts with misinformation but also influences its perception and spread. We establish strong baselines for the proposed dataset, providing a foundation for future research to benchmark and advance deceptive humor detection models.

Representational Similarity via Interpretable Visual Concepts

Neehar Kondapaneni,Oisin Mac Aodha,Pietro Perona

Task: 比较两个深度神经网络在决策过程中的差异。

Motivation: 测量深度网络的相似性是一个长期存在的开放性问题，现有方法通常只提供一个单一的数字来衡量两个网络在某一层的相似性，但无法解释它们相似或不同的原因。

Details

Method: 引入了一种可解释的表示相似性方法（RSVC）来比较两个网络，并使用RSVC发现两个模型之间共享和独特的视觉概念。 Result: 研究表明，模型差异的某些方面可以归因于一个模型发现的独特概念，而这些概念在另一个模型中未能很好地表示。 Conclusion: 通过在不同视觉模型架构和训练协议上进行广泛评估，证明了RSVC方法的有效性。 Abstract: How do two deep neural networks differ in how they arrive at a decision? Measuring the similarity of deep networks has been a long-standing open question. Most existing methods provide a single number to measure the similarity of two networks at a given layer, but give no insight into what makes them similar or dissimilar. We introduce an interpretable representational similarity method (RSVC) to compare two networks. We use RSVC to discover shared and unique visual concepts between two models. We show that some aspects of model differences can be attributed to unique concepts discovered by one model that are not well represented in the other. Finally, we conduct extensive evaluation across different vision model architectures and training protocols to demonstrate its effectiveness.

Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond

Yaoyao Yu,Leilei Gan,Yinghao Hu,Bin Wei,Kun Kuang,Fei Wu

Task: 评估大型语言模型（LLMs）在法律场景中的表现，特别是中文和英文法律任务。

Motivation: 尽管LLMs在一般语言任务中表现出色，但其在法律等专业领域的有效性尚不明确。

Details

Method: 对9个LLMs在17个法律任务中进行初步评估，重点关注新发布和更复杂的挑战，如多被告法律判决和法律论证推理。 Result: 尽管DeepSeek-R1和OpenAI o1是最强大的模型之一，但它们在法律推理能力上仍然不足，特别是在中文和英文法律推理任务中得分低于80%。 Conclusion: 即使在最先进的推理模型中，法律推理能力仍然不成熟。 Abstract: Recently, Test-Time Scaling Large Language Models (LLMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated exceptional capabilities across various domains and tasks, particularly in reasoning. While these models have shown impressive performance on general language tasks, their effectiveness in specialized fields like legal remains unclear. To address this, we present a preliminary evaluation of LLMs in various legal scenarios, covering both Chinese and English legal tasks. Our analysis includes 9 LLMs and 17 legal tasks, with a focus on newly published and more complex challenges such as multi-defendant legal judgments and legal argument reasoning. Our findings indicate that, despite DeepSeek-R1 and OpenAI o1 being among the most powerful models, their legal reasoning capabilities are still lacking. Specifically, these models score below 80\% on seven Chinese legal reasoning tasks and below 80\% on two English legal reasoning tasks. This suggests that, even among the most advanced reasoning models, legal reasoning abilities remain underdeveloped.

Sustainable Deep Learning-Based Breast Lesion Segmentation: Impact of Breast Region Segmentation on Performance

Sam Narimani,Solveig Roth Hoff,Kathinka Dahli Kurz,Kjell-Inge Gjesdal,Jurgen Geisler,Endre Grovik

Task: 研究乳腺区域分割（BRS）对基于深度学习的乳腺病变分割（BLS）在乳腺动态对比增强磁共振成像（DCE-MRI）中的影响。

Motivation: 准确分割乳腺病变是诊断、治疗计划和进展监测的关键步骤。

Details

Method: 使用包含59个DCE-MRI扫描的Stavanger数据集和UNet++模型，进行了四种不同的处理来比较BRS对BLS的影响。预处理方法包括数据增强和过采样，以提高小数据集的性能。使用混合损失函数和5折交叉验证方法评估模型。 Result: 结果表明，使用BRS显著提高了模型性能，特别是在最优体积与BRS结合的方法中，性能提升了约50%。此外，能源消耗大幅减少，降低了450%。 Conclusion: BRS在BLS中非常有效，特别是在最优体积与BRS结合的方法中，不仅提高了模型性能，还显著减少了能源消耗，为未来在大数据集上的工作提供了更环保的解决方案。 Abstract: Purpose: Segmentation of the breast lesion in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is an essential step to accurately diagnose and plan treatment and monitor progress. This study aims to highlight the impact of breast region segmentation (BRS) on deep learning-based breast lesion segmentation (BLS) in breast DCE-MRI. Methods Using the Stavanger Dataset containing primarily 59 DCE-MRI scans and UNet++ as deep learning models, four different process were conducted to compare effect of BRS on BLS. These four approaches included the whole volume without BRS and with BRS, BRS with the selected lesion slices and lastly optimal volume with BRS. Preprocessing methods like augmentation and oversampling were used to enhance the small dataset, data shape uniformity and improve model performance. Optimal volume size were investigated by a precise process to ensure that all lesions existed in slices. To evaluate the model, a hybrid loss function including dice, focal and cross entropy along with 5-fold cross validation method were used and lastly a test dataset which was randomly split used to evaluate the model performance on unseen data for each of four mentioned approaches. Results Results demonstrate that using BRS considerably improved model performance and validation. Significant improvement in last approach -- optimal volume with BRS -- compared to the approach without BRS counting around 50 percent demonstrating how effective BRS has been in BLS. Moreover, huge improvement in energy consumption, decreasing up to 450 percent, introduces a green solution toward a more environmentally sustainable approach for future work on large dataset.

Incomplete Utterance Rewriting with Editing Operation Guidance and Utterance Augmentation

Zhiyu Cao,Peifeng Li,Yaxin Fan,Qiaoming Zhu

Task: 提出一种多任务学习框架EO-IUR，用于改进不完整话语重写（IUR）任务。

Motivation: 现有方法在生成连贯话语时，往往包含不相关和冗余的标记，且训练数据集规模有限，导致模型训练不足。

Details

Method: 提出EO-IUR框架，引入编辑操作标签和标记级异构图来表示对话，并提出二维话语增强策略。 Result: 在三个数据集上的实验结果表明，EO-IUR在开放域和任务导向对话中均优于现有的最先进基线。 Conclusion: EO-IUR通过引入编辑操作标签和二维话语增强策略，有效改进了不完整话语重写任务。 Abstract: Although existing fashionable generation methods on Incomplete Utterance Rewriting (IUR) can generate coherent utterances, they often result in the inclusion of irrelevant and redundant tokens in rewritten utterances due to their inability to focus on critical tokens in dialogue context. Furthermore, the limited size of the training datasets also contributes to the insufficient training of the IUR model. To address the first issue, we propose a multi-task learning framework EO-IUR (Editing Operation-guided Incomplete Utterance Rewriting) that introduces the editing operation labels generated by sequence labeling module to guide generation model to focus on critical tokens. Furthermore, we introduce a token-level heterogeneous graph to represent dialogues. To address the second issue, we propose a two-dimensional utterance augmentation strategy, namely editing operation-based incomplete utterance augmentation and LLM-based historical utterance augmentation. The experimental results on three datasets demonstrate that our EO-IUR outperforms previous state-of-the-art (SOTA) baselines in both open-domain and task-oriented dialogue. The code will be available at https://github.com/Dewset/EO-IUR.

SPNeRF: Open Vocabulary 3D Neural Scene Segmentation with Superpoints

Weiwen Hu,Niccolò Parodi,Marcus Zepp,Ingo Feldmann,Oliver Schreer,Peter Eisert

Task: 提出一种基于NeRF的零样本3D分割方法SPNeRF，利用几何先验进行3D场景分割。

Motivation: 扩展开放词汇分割能力到3D场景，解决CLIP图像嵌入缺乏几何细节的问题。

Details

Method: 在NeRF训练中引入几何原语，生成原语级别的CLIP特征，并提出基于原语的合并机制。 Result: 在不依赖额外分割模型的情况下，显著提升了原始LERF的性能。 Conclusion: SPNeRF方法有效利用了CLIP的能力进行3D分割，避免了冗余或损失CLIP的通用语言能力。 Abstract: Open-vocabulary segmentation, powered by large visual-language models like CLIP, has expanded 2D segmentation capabilities beyond fixed classes predefined by the dataset, enabling zero-shot understanding across diverse scenes. Extending these capabilities to 3D segmentation introduces challenges, as CLIP's image-based embeddings often lack the geometric detail necessary for 3D scene segmentation. Recent methods tend to address this by introducing additional segmentation models or replacing CLIP with variations trained on segmentation data, which lead to redundancy or loss on CLIP's general language capabilities. To overcome this limitation, we introduce SPNeRF, a NeRF based zero-shot 3D segmentation approach that leverages geometric priors. We integrate geometric primitives derived from the 3D scene into NeRF training to produce primitive-wise CLIP features, avoiding the ambiguity of point-wise features. Additionally, we propose a primitive-based merging mechanism enhanced with affinity scores. Without relying on additional segmentation models, our method further explores CLIP's capability for 3D segmentation and achieves notable improvements over original LERF.

Meta-Learning Neural Mechanisms rather than Bayesian Priors

Michael Goodale,Salvador Mascarenhas,Yair Lakretz

Task: 研究元学习在形式语言学习中的作用及其对模型的影响。

Motivation: 探讨元学习如何将人类类似的学习偏差整合到神经网络架构中，结合符号模型的结构化概括能力和神经网络模型的可扩展性。

Details

Method: 通过元学习训练模型，并分析其在形式语言学习中的表现。 Result: 发现元训练模型并未学习基于简单性的先验，而是将神经机制（如计数器）印入模型，这些机制在下游任务中起到认知原语的作用。 Conclusion: 元学习在单个形式语言上的训练效果可以与在5000个不同形式语言上的训练效果相当，前提是该形式语言能够激励有用的神经机制的学习。研究结果为高效的元学习范式提供了实际意义，并为符号理论与神经机制之间的联系提供了新的理论见解。 Abstract: Children acquire language despite being exposed to several orders of magnitude less data than large language models require. Meta-learning has been proposed as a way to integrate human-like learning biases into neural-network architectures, combining both the structured generalizations of symbolic models with the scalability of neural-network models. But what does meta-learning exactly imbue the model with? We investigate the meta-learning of formal languages and find that, contrary to previous claims, meta-trained models are not learning simplicity-based priors when meta-trained on datasets organised around simplicity. Rather, we find evidence that meta-training imprints neural mechanisms (such as counters) into the model, which function like cognitive primitives for the network on downstream tasks. Most surprisingly, we find that meta-training on a single formal language can provide as much improvement to a model as meta-training on 5000 different formal languages, provided that the formal language incentivizes the learning of useful neural mechanisms. Taken together, our findings provide practical implications for efficient meta-learning paradigms and new theoretical insights into linking symbolic theories and neural mechanisms.

Graph-Weighted Contrastive Learning for Semi-Supervised Hyperspectral Image Classification

Yuqing Zhang,Qi Han,Ligeng Wang,Kai Cheng,Bo Wang,Kun Zhan

Task: 提出一种新的基于图加权对比学习的方法，用于高光谱图像分类。

Motivation: 现有的基于图的半监督高光谱图像分类方法依赖于超像素分割技术，但由于超像素边界的不准确性，导致某些像素的错误分类，限制了整体分类性能。

Details

Method: 提出了一种避免使用超像素分割的方法，直接利用神经网络学习高光谱图像表示，并支持通过处理子集节点进行小批量训练。 Result: 在三个广泛使用的数据集上的实验结果表明，所提出的方法比依赖超像素分割的基线方法更有效。 Conclusion: 所提出的图加权对比学习方法避免了超像素分割的局限性，提高了分类性能，并减少了计算复杂度。 Abstract: Most existing graph-based semi-supervised hyperspectral image classification methods rely on superpixel partitioning techniques. However, they suffer from misclassification of certain pixels due to inaccuracies in superpixel boundaries, \ie, the initial inaccuracies in superpixel partitioning limit overall classification performance. In this paper, we propose a novel graph-weighted contrastive learning approach that avoids the use of superpixel partitioning and directly employs neural networks to learn hyperspectral image representation. Furthermore, while many approaches require all graph nodes to be available during training, our approach supports mini-batch training by processing only a subset of nodes at a time, reducing computational complexity and improving generalization to unseen nodes. Experimental results on three widely-used datasets demonstrate the effectiveness of the proposed approach compared to baselines relying on superpixel partitioning.

Two-stage Incomplete Utterance Rewriting on Editing Operation

Zhiyu Cao,Peifeng Li,Qiaoming Zhu,Yaxin Fan

Task: 提出一种新的框架TEO（两阶段编辑操作）用于不完整话语重写（IUR）。

Motivation: 解决现有方法在对话中忽略指代和省略现象的问题。

Details

Method: TEO框架分为两个阶段：第一阶段生成编辑操作，第二阶段利用生成的编辑操作和对话上下文重写不完整话语。此外，提出了一种对抗扰动策略以减少训练和推理不一致导致的级联错误和暴露偏差。 Result: 在三个IUR数据集上的实验结果表明，TEO显著优于现有的最先进模型。 Conclusion: TEO框架在不完整话语重写任务中表现出色，能够有效处理对话中的指代和省略现象。 Abstract: Previous work on Incomplete Utterance Rewriting (IUR) has primarily focused on generating rewritten utterances based solely on dialogue context, ignoring the widespread phenomenon of coreference and ellipsis in dialogues. To address this issue, we propose a novel framework called TEO (\emph{Two-stage approach on Editing Operation}) for IUR, in which the first stage generates editing operations and the second stage rewrites incomplete utterances utilizing the generated editing operations and the dialogue context. Furthermore, an adversarial perturbation strategy is proposed to mitigate cascading errors and exposure bias caused by the inconsistency between training and inference in the second stage. Experimental results on three IUR datasets show that our TEO outperforms the SOTA models significantly.

Sarosij Bose,Arindam Dutta,Sayak Nag,Junge Zhang,Jiachen Li,Konstantinos Karydis,Amit K. Roy Chowdhury

Task: 从单张图像重建3D场景。

Motivation: 现有的单张图像到3D重建方法在从新视角渲染场景时，往往会产生不连贯和模糊的视图，尤其是在输入相机视角之外的区域。

Details

Method: 利用预训练的潜在视频扩散模型作为生成先验，通过可优化的高斯参数对粗糙场景进行迭代优化。结合即时傅里叶风格迁移，确保生成图像的风格和纹理与输入图像一致。设计了一个语义不确定性量化模块，计算每个像素的熵并生成不确定性图，用于指导从最自信的像素开始的优化过程。 Result: 在真实场景数据集（包括域内RealEstate-10K和域外KITTI-v2）上进行了广泛实验，结果表明该方法能够提供比现有最先进方法更真实和高保真的新视角合成结果。 Conclusion: 该方法通过引入生成先验和不确定性量化模块，显著提高了单张图像到3D场景重建的质量，尤其是在新视角合成方面。 Abstract: Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image's view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the per-pixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods.

Tuning LLMs by RAG Principles: Towards LLM-native Memory

Jiale Wei,Shuchi Wu,Ruochen Liu,Xiang Ying,Jingbo Shang,Fangbo Tao

Task: 比较长上下文LLM和检索增强生成（RAG）在记忆整合中的表现，并提出一种新的方法RAG-Tuned-LLM。

Motivation: 记忆在大语言模型（LLMs）的实际应用中至关重要，如个人助手。

Details

Method: 提出RAG-Tuned-LLM方法，通过RAG原则生成的数据对较小的LLM进行微调。 Result: RAG-Tuned-LLM在三个数据集上的广泛实验表明，其在多种查询类型中优于长上下文LLM和RAG方法。 Conclusion: RAG-Tuned-LLM能够结合长上下文LLM和RAG方法的优势，表现更优。 Abstract: Memory, additional information beyond the training of large language models (LLMs), is crucial to various real-world applications, such as personal assistant. The two mainstream solutions to incorporate memory into the generation process are long-context LLMs and retrieval-augmented generation (RAG). In this paper, we first systematically compare these two types of solutions on three renovated/new datasets and show that (1) long-context solutions, although more expensive, shall be easier to capture the big picture and better answer queries which require considering the memory as a whole; and (2) when the queries concern specific information, RAG solutions shall be more competitive especially when the keywords can be explicitly matched. Therefore, we propose a novel method RAG-Tuned-LLM which fine-tunes a relative small (e.g., 7B) LLM using the data generated following the RAG principles, so it can combine the advantages of both solutions. Extensive experiments on three datasets demonstrate that RAG-Tuned-LLM can beat long-context LLMs and RAG methods across a wide range of query types.

GraPLUS: Graph-based Placement Using Semantics for Image Composition

Mir Mohammad Khaleghi,Mehran Safayani,Abdolreza Mirzaei

Task: 提出了一种基于图结构和语义理解的新框架GraPLUS，用于图像中合理的物体放置。

Motivation: 通过结合图结构的场景表示和语义理解，确定上下文合适的物体位置。

Details

Method: 利用GPT-2将分类节点和边标签转换为丰富的语义嵌入，捕捉定义特征和典型空间上下文，使用边缘感知图神经网络处理场景语义，并通过跨模态注意力机制对齐分类嵌入与增强的场景特征。 Result: 在OPA数据集上实现了92.1%的放置准确率和28.83的FID分数，优于现有方法8.1%。在人类评估研究中，52.1%的案例中优于之前的方法。 Conclusion: GraPLUS框架通过结合预训练的场景图模型、边缘感知图神经网络、跨模态注意力机制和多目标训练策略，显著提升了物体放置的准确性和视觉质量。 Abstract: We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling nuanced understanding of object relationships and placement patterns. GraPLUS achieves placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.1% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 19 participants, our method was preferred in 52.1% of cases, significantly outperforming previous approaches. The framework's key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints.

Cultural Alignment in Large Language Models Using Soft Prompt Tuning

Reem I. Masoud,Martin Ferianc,Philip Treleaven,Miguel Rodrigues

Task: 提出一种参数高效策略，结合软提示调优和差分进化，用于大语言模型（LLM）与文化维度的对齐。

Motivation: 传统的LLM对齐方法依赖于监督微调或基于强化学习的对齐框架，这些方法通常需要标记或偏好数据集，并涉及更新模型权重。然而，在社会科学中，如跨文化研究，因子分析广泛用于揭示调查数据中的潜在维度或变量，这些测量结果的非可微性使得传统对齐方法不可行。

Details

Method: 提出了一种参数高效策略，结合软提示调优（冻结模型参数，修改输入提示嵌入）和差分进化（DE），一种在无法获得可微目标的情况下使用的黑箱优化方法。 Result: 该方法在多个地区的LLama-3-8B-Instruct文化维度上表现出显著改进，优于朴素LLM和上下文学习（ICL）基线。 Conclusion: 该方法在不需偏好数据或模型参数更新的情况下确保了对齐一致性，显著提高了效率并减轻了过拟合，有效桥接了计算模型与人类文化细微差别。 Abstract: Large Language Model (LLM) alignment conventionally relies on supervised fine-tuning or reinforcement learning based alignment frameworks. These methods typically require labeled or preference datasets and involve updating model weights to align the LLM with the training objective or reward model. Meanwhile, in social sciences such as cross-cultural studies, factor analysis is widely used to uncover underlying dimensions or latent variables that explain observed patterns in survey data. The non-differentiable nature of these measurements deriving from survey data renders the former alignment methods infeasible for alignment with cultural dimensions. To overcome this, we propose a parameter efficient strategy that combines soft prompt tuning, which freezes the model parameters while modifying the input prompt embeddings, with Differential Evolution (DE), a black-box optimization method for cases where a differentiable objective is unattainable. This strategy ensures alignment consistency without the need for preference data or model parameter updates, significantly enhancing efficiency and mitigating overfitting. Our method demonstrates significant improvements in LLama-3-8B-Instruct's cultural dimensions across multiple regions, outperforming both the Naive LLM and the In-context Learning (ICL) baseline, and effectively bridges computational models with human cultural nuances.

OffsetOPT: Explicit Surface Reconstruction without Normals

Huan Lei

Task: 从3D点云直接重建显式表面，消除对点法线的需求。

Motivation: 现有的隐式表示方法通常需要高质量的法线来进行准确的重建，这限制了其应用范围。

Details

Method: 提出OffsetOPT方法，包括两个阶段：首先训练神经网络基于局部点几何预测表面三角形，然后在未见过的点云上应用冻结的网络，通过优化每点偏移来最大化三角形预测的准确性。 Result: 与现有方法相比，OffsetOPT不仅在整体表面重建上表现出色，还能显著保留尖锐的表面特征。 Conclusion: OffsetOPT方法在多个基准测试中展示了其准确性，适用于小尺度形状和大尺度开放表面。 Abstract: Neural surface reconstruction has been dominated by implicit representations with marching cubes for explicit surface extraction. However, those methods typically require high-quality normals for accurate reconstruction. We propose OffsetOPT, a method that reconstructs explicit surfaces directly from 3D point clouds and eliminates the need for point normals. The approach comprises two stages: first, we train a neural network to predict surface triangles based on local point geometry, given uniformly distributed training point clouds. Next, we apply the frozen network to reconstruct surfaces from unseen point clouds by optimizing a per-point offset to maximize the accuracy of triangle predictions. Compared to state-of-the-art methods, OffsetOPT not only excels at reconstructing overall surfaces but also significantly preserves sharp surface features. We demonstrate its accuracy on popular benchmarks, including small-scale shapes and large-scale open surfaces.

MKG-Rank: Enhancing Large Language Models with Knowledge Graph for Multilingual Medical Question Answering

Feiyang Li,Yingjian Chen,Haoran Liu,Rui Yang,Han Yuan,Yuang Jiang,Tianxiao Li,Edison Marrese Taylor,Hossein Rouhizadeh,Yusuke Iwasawa,Douglas Teodoro,Yutaka Matsuo,Irene Li

Task: 提出一种基于知识图谱的多语言医学问答框架MKG-Rank，以解决英语中心的大型语言模型在多语言医学问答中的局限性。

Motivation: 由于多语言训练数据的不平衡和低资源语言的医学资源稀缺，大型语言模型在多语言医学问答中的效果主要局限于英语。

Details

Method: 通过词级翻译机制，将英语中心的医学知识图谱高效集成到大型语言模型的推理中，并引入缓存和多角度排序策略以优化检索过程。 Result: 在中文、日文、韩文和斯瓦希里语的多语言医学问答基准测试中，MKG-Rank始终优于零样本大型语言模型，准确率最高提升33.89%，平均检索时间仅为0.0009秒。 Conclusion: MKG-Rank框架有效地解决了跨语言医学问答中的语义失真问题，显著提高了多语言医学问答的准确性和效率。 Abstract: Large Language Models (LLMs) have shown remarkable progress in medical question answering (QA), yet their effectiveness remains predominantly limited to English due to imbalanced multilingual training data and scarce medical resources for low-resource languages. To address this critical language gap in medical QA, we propose Multilingual Knowledge Graph-based Retrieval Ranking (MKG-Rank), a knowledge graph-enhanced framework that enables English-centric LLMs to perform multilingual medical QA. Through a word-level translation mechanism, our framework efficiently integrates comprehensive English-centric medical knowledge graphs into LLM reasoning at a low cost, mitigating cross-lingual semantic distortion and achieving precise medical QA across language barriers. To enhance efficiency, we introduce caching and multi-angle ranking strategies to optimize the retrieval process, significantly reducing response times and prioritizing relevant medical knowledge. Extensive evaluations on multilingual medical QA benchmarks across Chinese, Japanese, Korean, and Swahili demonstrate that MKG-Rank consistently outperforms zero-shot LLMs, achieving maximum 33.89% increase in accuracy, while maintaining an average retrieval time of only 0.0009 seconds.

AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models

Boshra Khalili,Andrew W. Smyth

Task: 将现有的驾驶问答数据集转换为结构化多选题格式，以提供标准化和客观的评估框架。

Motivation: 解决自动驾驶中开放式问答评估不可靠的问题，因为自由形式的回答需要复杂的指标或主观的人类判断。

Details

Method: 引入AutoDrive-QA，一个自动化的管道，利用大型语言模型（LLMs）生成高质量、上下文相关的干扰项，基于自动驾驶场景中常见的领域特定错误模式。 Result: 在三个公共数据集上测试了基准，并在未见过的数据集上进行了零样本实验。GPT-4V在零样本评估中以69.57%的准确率领先，其中感知任务达到74.94%，预测任务达到65.33%，规划任务达到68.45%。 Conclusion: AutoDrive-QA为整合和评估不同视觉语言模型提供了一个严格、无偏的标准，从而提高了该领域的泛化能力。所有代码已在AutoDrive-QA GitHub仓库中发布。 Abstract: In autonomous driving, open-ended question answering often suffers from unreliable evaluations because freeform responses require either complex metrics or subjective human judgment. To address this challenge, we introduce AutoDrive-QA, an automatic pipeline that converts existing driving QA datasets (including DriveLM, NuScenes-QA, and LingoQA) into a structured multiple-choice question (MCQ) format. This benchmark systematically assesses perception, prediction, and planning tasks, providing a standardized and objective evaluation framework. AutoDrive-QA employs an automated pipeline that leverages large language models (LLMs) to generate high-quality, contextually relevant distractors based on domain-specific error patterns commonly found in autonomous driving scenarios. To evaluate both general capabilities and generalization performance, we test the benchmark on three public datasets and conduct zero-shot experiments on an unseen dataset. The zero-shot evaluations reveal that GPT-4V leads with 69.57% accuracy -- achieving 74.94% in Perception, 65.33% in Prediction, and 68.45% in Planning -- demonstrating that while all models excel in Perception, they struggle in Prediction. Consequently, AutoDrive-QA establishes a rigorous, unbiased standard for integrating and evaluating different vision-language models across various autonomous driving datasets, thereby improving generalization in this field. We release all the codes in the AutoDrive-QA GitHub Repository.

Automatically Generating Chinese Homophone Words to Probe Machine Translation Estimation Systems

Shenbin Qian,Constantin Orăsan,Diptesh Kanojia,Félix do Carmo

Task: 评估用户生成内容（UGC）的机器翻译（MT）质量，特别是情感细微差别的保留。

Motivation: 现有的自动评估方法在保留情感细微差别方面是否稳健尚未得到充分探索。

Details

Method: 提出了一种基于信息论的新方法，生成与情感相关的中文同音词，利用自信息概念。 Result: 生成的中文同音词揭示了MT系统及其评估方法在处理情感UGC时的脆弱性，大型语言模型（LLMs）表现出更高的稳定性和鲁棒性。 Conclusion: 该方法在人类评估中表现出更高的相关性，并发布了数据和代码以供进一步研究。 Abstract: Evaluating machine translation (MT) of user-generated content (UGC) involves unique challenges such as checking whether the nuance of emotions from the source are preserved in the target text. Recent studies have proposed emotion-related datasets, frameworks and models to automatically evaluate MT quality of Chinese UGC, without relying on reference translations. However, whether these models are robust to the challenge of preserving emotional nuances has been left largely unexplored. To address this gap, we introduce a novel method inspired by information theory which generates challenging Chinese homophone words related to emotions, by leveraging the concept of self-information. Our approach generates homophones that were observed to cause translation errors in emotion preservation, and exposes vulnerabilities in MT systems and their evaluation methods when tackling emotional UGC. We evaluate the efficacy of our method using human evaluation for the quality of these generated homophones, and compare it with an existing one, showing that our method achieves higher correlation with human judgments. The generated Chinese homophones, along with their manual translations, are utilized to generate perturbations and to probe the robustness of existing quality evaluation models, including models trained using multi-task learning, fine-tuned variants of multilingual language models, as well as large language models (LLMs). Our results indicate that LLMs with larger size exhibit higher stability and robustness to such perturbations. We release our data and code for reproducibility and further research.

RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models

Parham Saremi,Amar Kumar,Mohammed Mohammed,Zahra TehraniNasab,Tal Arbel

Task: 提出一种多阶段架构，通过预训练的视觉语言基础模型和强化学习算法来改进图像区域与文本描述之间的细粒度对齐。

Motivation: 解决视觉语言基础模型在医学影像中精确对齐和临床特征定位的局限性。

Details

Method: 使用预训练的视觉语言基础模型进行初步语义理解，并通过强化学习算法迭代优化语义上下文对齐。 Result: 在医学影像皮肤数据集上展示了生成图像质量的提升和与提示的对齐效果，并证明合成样本可以通过增强提高疾病分类器的性能。 Conclusion: 提出的方法在医学影像中有效改进了图像生成质量和语义对齐，有助于提高疾病分类器的性能。 Abstract: Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images. While VLFMs show a rich understanding of semantic content across modalities, they often struggle with fine-grained alignment tasks that require precise correspondence between image regions and textual descriptions a limitation in medical imaging, where accurate localization and detection of clinical features are essential for diagnosis and analysis. To address this issue, we propose a multi-stage architecture where a pre-trained VLFM provides a cursory semantic understanding, while a reinforcement learning (RL) algorithm refines the alignment through an iterative process that optimizes for understanding semantic context. The reward signal is designed to align the semantic information of the text with synthesized images. We demonstrate the effectiveness of our method on a medical imaging skin dataset where the generated images exhibit improved generation quality and alignment with prompt over the fine-tuned Stable Diffusion. We also show that the synthesized samples could be used to improve disease classifier performance for underrepresented subgroups through augmentation.

Towards Lighter and Robust Evaluation for Retrieval Augmented Generation

Alex-Razvan Ispas,Charles-Elie Simon,Fabien Caspani,Vincent Guigue

Task: 评估RAG框架中生成答案的幻觉问题

Motivation: 现有的商业LLMs（如GPT4）用于评估算法时成本高且不透明，因此需要一种更经济、透明的评估方法。

Details

Method: 使用小型、量化的LLMs开发轻量级方法，提供可访问且可解释的指标，用于评估生成答案的正确性和忠实性。 Result: 提出了一个新的AUC指标，作为与人类判断相关性的替代方案。 Conclusion: 开放权重模型在评估RAG幻觉方面具有潜力，提供了一种经济且透明的评估方法。 Abstract: Large Language Models are prompting us to view more NLP tasks from a generative perspective. At the same time, they offer a new way of accessing information, mainly through the RAG framework. While there have been notable improvements for the autoregressive models, overcoming hallucination in the generated answers remains a continuous problem. A standard solution is to use commercial LLMs, such as GPT4, to evaluate these algorithms. However, such frameworks are expensive and not very transparent. Therefore, we propose a study which demonstrates the interest of open-weight models for evaluating RAG hallucination. We develop a lightweight approach using smaller, quantized LLMs to provide an accessible and interpretable metric that gives continuous scores for the generated answer with respect to their correctness and faithfulness. This score allows us to question decisions' reliability and explore thresholds to develop a new AUC metric as an alternative to correlation with human judgment.

Frequency Enhancement for Image Demosaicking

Jingyun Liu,Daiqin Yang,Zhenzhong Chen

Task: 提出了一种基于频率增强的图像去马赛克方法，以恢复高频纹理。

Motivation: 现有的空间学习方法在恢复高频纹理方面表现有限，因此需要一种新的方法来提高性能。

Details

Method: 提出了双路径频率增强网络（DFENet），通过傅里叶域频率选择以分而治之的方式重建RGB图像。 Result: DFENet在不同数据集上优于其他最先进的算法，并在困难案例中表现出显著优势。 Conclusion: DFENet通过频率增强和多级频率监督策略，显著提高了图像去马赛克的性能，并贡献了一个新的数据集LineSet37用于评估算法在挑战性案例中的表现。 Abstract: Recovering high-frequency textures in image demosaicking remains a challenging issue. While existing methods introduced elaborate spatial learning methods, they still exhibit limited performance. To address this issue, a frequency enhancement approach is proposed. Based on the frequency analysis of color filter array (CFA)/demosaicked/ground truth images, we propose Dual-path Frequency Enhancement Network (DFENet), which reconstructs RGB images in a divide-and-conquer manner through fourier-domain frequency selection. In DFENet, two frequency selectors are employed, each selecting a set of frequency components for processing along separate paths. One path focuses on generating missing information through detail refinement in spatial domain, while the other aims at suppressing undesirable frequencies with the guidance of CFA images in frequency domain. Multi-level frequency supervision with a stagewise training strategy is employed to further improve the reconstruction performance. With these designs, the proposed DFENet outperforms other state-of-the-art algorithms on different datasets and demonstrates significant advantages on hard cases. Moreover, to better assess algorithms' ability to reconstruct high-frequency textures, a new dataset, LineSet37, is contributed, which consists of 37 artificially designed and generated images. These images feature complex line patterns and are prone to severe visual artifacts like color moir\'e after demosaicking. Experiments on LineSet37 offer a more targeted evaluation of performance on challenging cases. The code and dataset are available at https://github.com/VelvetReverie/DFENet-demosaicking.

SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Shibo Jie,Yehui Tang,Kai Han,Zhi-Hong Deng,Jing Han

Task: 提出一种名为SpeCache的方法，利用CPU内存来卸载完整的KV缓存，并在每个解码步骤中动态获取重要的KV对，以减少GPU内存的使用。

Motivation: 现有的KV缓存压缩方法会导致信息丢失，影响后续解码的准确性。为了解决这一问题，提出了SpeCache方法。

Details

Method: 利用CPU内存来卸载完整的KV缓存，并在每个解码步骤中动态获取重要的KV对。通过推测性预测下一个token可能关注的KV对，实现预取和计算的并行化。 Result: 在LongBench和Needle-in-a-Haystack基准测试中，SpeCache有效地减少了VRAM使用，同时避免了长序列的信息丢失，即使在高KV缓存压缩比下也能保持准确性。 Conclusion: SpeCache方法在不重新训练的情况下，有效地减少了GPU内存的使用，同时避免了信息丢失，适用于长序列任务。 Abstract: Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the sequence length increases, which has become a bottleneck for the application of LLMs on long sequences. Existing KV cache compression methods include eviction, merging, or quantization of the KV cache to reduce its size. However, compression results in irreversible information forgetting, potentially affecting the accuracy of subsequent decoding. In this paper, we propose SpeCache, which takes full advantage of the large and easily expandable CPU memory to offload the complete KV cache, and dynamically fetches KV pairs back in each decoding step based on their importance measured by low-bit KV cache copy in VRAM. To avoid inference latency caused by CPU-GPU communication, SpeCache speculatively predicts the KV pairs that the next token might attend to, allowing us to prefetch them before the next decoding step which enables parallelization of prefetching and computation. Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10x high KV cache compression ratio.

A Vision Centric Remote Sensing Benchmark

Abduljaleel Adejumo,Faegheh Yeganli,Clifford Broni-bediako,Aoran Xiao,Naoto Yokoya,Mennatullah Siam

Task: 研究多模态大语言模型（MLLMs）在遥感（RS）任务中的局限性，并提出一个遥感多模态视觉模式（RSMMVP）基准来评估这些模型。

Motivation: 当前的多模态大语言模型在自然图像任务中取得了显著成功，但在遥感图像任务中相对较少被探索。遥感图像具有独特的挑战，特别是在视觉定位和空间推理方面，现有的MLLMs难以处理。

Details

Method: 通过引入遥感多模态视觉模式（RSMMVP）基准，评估CLIP-based MLLMs在遥感任务中的表现，特别是识别CLIP-blind对，即CLIP-based模型错误地将高相似度分数分配给视觉上不同的遥感图像。 Result: 通过视觉问答（VQA）评估，揭示了现有MLLMs在遥感特定表示学习中的显著局限性。 Conclusion: 研究结果提供了关于CLIP-based视觉编码弱点的宝贵见解，并为未来开发更适合遥感应用的有效MLLMs奠定了基础。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks but their remote sensing (RS) counterpart are relatively under explored. Unlike natural images, RS imagery presents unique challenges that current MLLMs struggle to handle, particularly in visual grounding and spatial reasoning. This study investigates the limitations of CLIP-based MLLMs in RS, highlighting their failure to differentiate visually distinct yet semantically similar RS images. To address this, we introduce a remote sensing multimodal visual patterns (RSMMVP) benchmark. It is designed to evaluate MLLMs in RS tasks by identifying the CLIP-blind pairs, where CLIP-based models incorrectly assign high similarity scores to visually distinct RS images. Through a visual question answering (VQA) evaluation, we analyze the performance of state-of-the-art MLLMs, revealing significant limitations in RS specific representation learning. The results provide valuable insights into the weaknesses of CLIP-based visual encoding and offer a foundation for future research to develop more effective MLLMs tailored for remote sensing applications.

MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion

Qizhi Pei,Lijun Wu,Zhuoshi Pan,Yu Li,Honglin Lin,Chenlin Ming,Xin Gao,Conghui He,Rui Yan

Task: 通过跨问题指令合成增强数学推理能力。

Motivation: 当前的数据增强方法主要局限于实例级修改，无法捕捉和利用数学知识中的内在关系结构。

Details

Method: 提出了MathFusion框架，通过三种融合策略（顺序融合、并行融合和条件融合）生成新的数据集MathFusionQA，并在其上微调模型。 Result: MathFusion在数学推理方面取得了显著改进，准确率提高了18.0个百分点，同时保持了高数据效率。 Conclusion: MathFusion框架通过跨问题指令合成显著提升了数学推理能力，优于传统的单指令方法。 Abstract: Large Language Models (LLMs) have shown impressive progress in mathematical reasoning. While data augmentation is promising to enhance mathematical problem-solving ability, current approaches are predominantly limited to instance-level modifications-such as rephrasing or generating syntactic variations-which fail to capture and leverage the intrinsic relational structures inherent in mathematical knowledge. Inspired by human learning processes, where mathematical proficiency develops through systematic exposure to interconnected concepts, we introduce MathFusion, a novel framework that enhances mathematical reasoning through cross-problem instruction synthesis. MathFusion implements this through three fusion strategies: (1) sequential fusion, which chains related problems to model solution dependencies; (2) parallel fusion, which combines analogous problems to reinforce conceptual understanding; and (3) conditional fusion, which creates context-aware selective problems to enhance reasoning flexibility. By applying these strategies, we generate a new dataset, \textbf{MathFusionQA}, followed by fine-tuning models (DeepSeekMath-7B, Mistral-7B, Llama3-8B) on it. Experimental results demonstrate that MathFusion achieves substantial improvements in mathematical reasoning while maintaining high data efficiency, boosting performance by 18.0 points in accuracy across diverse benchmarks while requiring only 45K additional synthetic instructions, representing a substantial improvement over traditional single-instruction approaches. Our datasets, models, and code are publicly available at https://github.com/QizhiPei/mathfusion.

Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection

Haotian Ma,Lin Gu,Siyi Wu,Yingying Zhu

Task: 定义3D点云隐私问题并提出一种高效的隐私保护框架PointFlowGMM，支持下游分类和分割任务。

Motivation: 3D点云在自动驾驶、机器人、CAD模型等应用中广泛使用，但其隐私泄露问题尚未得到充分研究。

Details

Method: 使用基于流的生成模型将点云投影到高斯混合分布的潜在子空间，并设计了一种新的角度相似性损失来混淆原始几何结构。 Result: 模型大小从767MB减少到120MB，且在加密点云上实现了与原始点云相当的识别结果。 Conclusion: 提出的PointFlowGMM框架在保护3D点云隐私的同时，支持下游识别任务，且不影响识别性能。 Abstract: 3D point cloud has been widely used in applications such as self-driving cars, robotics, CAD models, etc. To the best of our knowledge, these applications raised the issue of privacy leakage in 3D point clouds, which has not been studied well. Different from the 2D image privacy, which is related to texture and 2D geometric structure, the 3D point cloud is texture-less and only relevant to 3D geometric structure. In this work, we defined the 3D point cloud privacy problem and proposed an efficient privacy-preserving framework named PointFlowGMM that can support downstream classification and segmentation tasks without seeing the original data. Using a flow-based generative model, the point cloud is projected into a latent Gaussian mixture distributed subspace. We further designed a novel angular similarity loss to obfuscate the original geometric structure and reduce the model size from 767MB to 120MB without a decrease in recognition performance. The projected point cloud in the latent space is orthogonally rotated randomly to further protect the original geometric structure, the class-to-class relationship is preserved after rotation, thus, the protected point cloud can support the recognition task. We evaluated our model on multiple datasets and achieved comparable recognition results on encrypted point clouds compared to the original point clouds.

Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

Zhaowei Liu,Xin Guo,Fangqi Lou,Lingfeng Zeng,Jinyi Niu,Zixuan Wang,Jiajie Xu,Weige Cai,Ziwei Yang,Xueqian Zhao,Chao Li,Sheng Xu,Dezhi Chen,Yun Chen,Zuo Bai,Liwen Zhang

Task: 设计并评估一个专门用于金融领域的推理大语言模型Fin-R1。

Motivation: 探索大语言模型在处理复杂金融任务中的能力。

Details

Method: 采用两阶段架构，基于DeepSeek-R1构建金融推理数据集，并通过监督微调（SFT）和强化学习（RL）进行训练。 Result: Fin-R1在7B参数规模下，在FinQA和ConvFinQA任务上达到了最先进的性能，并在其他任务上超越了更大的模型。 Conclusion: Fin-R1展示了强大的推理和决策能力，能够解决金融领域中的各种问题。 Abstract: Reasoning large language models are rapidly evolving across various domains. However, their capabilities in handling complex financial tasks still require in-depth exploration. In this paper, we introduce Fin-R1, a reasoning large language model specifically designed for the financial sector. Fin-R1 is built using a two-stage architecture, leveraging a financial reasoning dataset distilled and processed based on DeepSeek-R1. Through supervised fine-tuning (SFT) and reinforcement learning (RL) training, it demonstrates performance close to DeepSeek-R1 with a parameter size of 7 billion across a range of financial reasoning tasks. It achieves the state-of-the-art (SOTA) in the FinQA and ConvFinQA tasks between those LLMs in our evaluation, surpassing larger models in other tasks as well. Fin-R1 showcases strong reasoning and decision-making capabilities, providing solutions to various problems encountered in the financial domain. Our code is available at https://github.com/SUFE-AIFLM-Lab/Fin-R1.

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

Zihao Zhang,Haoran Chen,Haoyu Zhao,Guansong Lu,Yanwei Fu,Hang Xu,Zuxuan Wu

Task: 提出一种名为EDEN的增强扩散方法，用于高质量的大运动视频帧插值。

Motivation: 处理复杂或非线性运动模式一直是视频帧插值的挑战，现有的扩散方法在大运动场景下仍难以生成清晰、时间一致的帧。

Details

Method: 使用基于transformer的tokenizer生成中间帧的精细潜在表示，并通过时间注意力和起始-结束帧差异嵌入来增强扩散transformer。 Result: 在多个流行基准测试中取得了最先进的结果，包括在DAVIS和SNU-FILM上LPIPS减少了近10%，在DAIN-HD上提高了8%。 Conclusion: EDEN方法在大运动视频帧插值中表现出色，显著提升了生成帧的质量和时间一致性。 Abstract: Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.

LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates

Ying Shen,Lifu Huang

Task: 提出一种名为LLMBRACES的新方法，通过动态调整FFN层中子更新的贡献来增强和控制Transformer-based大语言模型的性能和行为。

Motivation: 研究发现Transformer-based大语言模型中的知识主要编码在FFN层中，通过调整这些子更新的贡献可以进一步提升模型的表现。

Details

Method: LLMBRACES方法计算FFN层中值向量的相关性分数，并利用这些分数动态调整子更新的贡献。 Result: 实验表明，LLMBRACES在微调和零样本设置下均优于基线方法，且所需的可调参数显著减少。 Conclusion: LLMBRACES在情感控制生成和毒性减少方面表现出色，展示了其在灵活控制文本生成中的潜力。 Abstract: Recent findings reveal that much of the knowledge in a Transformer-based Large Language Model (LLM) is encoded in its feed-forward (FFN) layers, where each FNN layer can be interpreted as the summation of sub-updates, each corresponding to a weighted column vector from the FFN's value parameter matrix that often encodes human-interpretable concepts. In light of this, we hypothesize that model performance and behaviors can be further enhanced and controlled by modulating the contributions of these sub-updates based on their relevance to the input or target output style, and propose LLMBRACES, a novel and efficient method that computes relevance scores associated with value vectors in FFN layers and leverages these scores to dynamically adjust the contribution of sub-updates. By optimizing sub-update contributions, LLMBRACES refines the prediction process, leading to more accurate and reliable outputs, much like a 'brace' providing support and stability. Moreover, LLMBRACES can be extended to support conditional control over generation characteristics, such as sentiment, thereby offering fine-grained steering of LLM outputs. Extensive experiments on various LLMs-including Qwen2.5-1.5B, Llama2-7B, and Llama3-8B-demonstrate that LLMBRACES outperforms baseline approaches in both fine-tuning and zero-shot settings while requiring significantly fewer tunable parameters, up to 75% fewer compared to LoRA. Furthermore, LLMBRACES excels in sentiment-controlled generation and toxicity reduction, highlighting its potential for flexible, controlled text generation across applications.

BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting

Yiren Lu,Yunlai Zhou,Disheng Liu,Tuo Liang,Yu Yin

Task: 提出一种新的方法BARD-GS，用于在动态场景重建中有效处理模糊输入和不精确的相机姿态。

Motivation: 现有的方法在处理手持单目相机拍摄的动态场景时，无法有效处理图像模糊问题，导致重建质量下降。

Details

Method: BARD-GS方法包括两个主要组件：1) 相机运动去模糊和2) 物体运动去模糊，通过将运动模糊分解为相机运动模糊和物体运动模糊并分别建模，显著提高了动态区域的渲染效果。 Result: 实验结果表明，BARD-GS在现实条件下能够有效重建高质量的动态场景，显著优于现有方法。 Conclusion: BARD-GS方法在处理模糊输入和不精确相机姿态的动态场景重建中表现出色，具有显著的应用潜力。 Abstract: 3D Gaussian Splatting (3DGS) has shown remarkable potential for static scene reconstruction, and recent advancements have extended its application to dynamic scenes. However, the quality of reconstructions depends heavily on high-quality input images and precise camera poses, which are not that trivial to fulfill in real-world scenarios. Capturing dynamic scenes with handheld monocular cameras, for instance, typically involves simultaneous movement of both the camera and objects within a single exposure. This combined motion frequently results in image blur that existing methods cannot adequately handle. To address these challenges, we introduce BARD-GS, a novel approach for robust dynamic scene reconstruction that effectively handles blurry inputs and imprecise camera poses. Our method comprises two main components: 1) camera motion deblurring and 2) object motion deblurring. By explicitly decomposing motion blur into camera motion blur and object motion blur and modeling them separately, we achieve significantly improved rendering results in dynamic regions. In addition, we collect a real-world motion blur dataset of dynamic scenes to evaluate our approach. Extensive experiments demonstrate that BARD-GS effectively reconstructs high-quality dynamic scenes under realistic conditions, significantly outperforming existing methods.

CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners

Yunzhi Yao,Jizhan Fang,Jia-Chen Gu,Ningyu Zhang,Shumin Deng,Huajun Chen,Nanyun Peng

Task: 提出一种新的知识编辑方法CaKE，以更有效地将更新后的知识整合到大型语言模型中。

Motivation: 现有的知识编辑方法在更新孤立事实时表现良好，但在需要多跳推理任务时难以推广这些更新。

Details

Method: 通过分析推理电路，提出CaKE方法，利用策略性策划的数据，引导模型利用修改后的知识，并开发适当的推理电路。 Result: 实验结果表明，CaKE在多跳推理任务中能够更准确和一致地使用更新后的知识，在MQuAKE数据集上比现有方法平均提高了20%的准确性。 Conclusion: CaKE方法能够更有效地整合更新后的知识，提升多跳推理任务的准确性。 Abstract: Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they struggle to generalize these updates to multi-hop reasoning tasks that depend on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we observe that current layer-localized KE approaches, such as MEMIT and WISE, which edit only single or a few model layers, struggle to effectively incorporate updated information into these reasoning pathways. To address this limitation, we propose CaKE (Circuit-aware Knowledge Editing), a novel method that enables more effective integration of updated knowledge in LLMs. CaKE leverages strategically curated data, guided by our circuits-based analysis, that enforces the model to utilize the modified knowledge, stimulating the model to develop appropriate reasoning circuits for newly integrated knowledge. Experimental results show that CaKE enables more accurate and consistent use of updated knowledge across related reasoning tasks, leading to an average of 20% improvement in multi-hop reasoning accuracy on MQuAKE dataset compared to existing KE methods. We release the code and data in https://github.com/zjunlp/CaKE.

Xuanming Cui,Jaiminkumar Ashokbhai Bhoi,Chionh Wei Peng,Adriel Kuek,Ser Nam Lim

Task: 动态场景图生成（DSGG）用于视频中的细粒度、帧级理解任务。

Motivation: 现有方法在架构设计上过于复杂，且仅使用召回率进行评估，存在精度-召回率权衡、三元组重要性意识不足和不适当的评估协议等问题。

Details

Method: 使用简单的仅解码器结构的大型多模态模型（LMMs）进行动态场景图生成。 Result: LMMs 可以在仅使用少量微调数据（5-10% 的训练数据）的情况下，成为最先进的场景图生成器，有效克服现有问题。 Conclusion: LMMs 在动态场景图生成任务中表现出色，无需复杂的架构设计即可取得优异效果。 Abstract: Dynamic Scene Graph Generation (DSGG) for videos is a challenging task in computer vision. While existing approaches often focus on sophisticated architectural design and solely use recall during evaluation, we take a closer look at their predicted scene graphs and discover three critical issues with existing DSGG methods: severe precision-recall trade-off, lack of awareness on triplet importance, and inappropriate evaluation protocols. On the other hand, recent advances of Large Multimodal Models (LMMs) have shown great capabilities in video understanding, yet they have not been tested on fine-grained, frame-wise understanding tasks like DSGG. In this work, we conduct the first systematic analysis of Video LMMs for performing DSGG. Without relying on sophisticated architectural design, we show that LMMs with simple decoder-only structure can be turned into State-of-the-Art scene graph generators that effectively overcome the aforementioned issues, while requiring little finetuning (5-10% training data).

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui,Yu-Neng Chuang,Guanchu Wang,Jiamu Zhang,Tianyi Zhang,Jiayi Yuan,Hongyi Liu,Andrew Wen,Shaochen,Zhong,Hanjie Chen,Xia Hu

Task: 系统调查和探索当前在大型语言模型（LLMs）中实现高效推理的进展。

Motivation: 尽管长链思维（CoT）推理序列提高了性能，但也引入了显著的计算开销，称为“过度思考现象”。

Details

Method: 将现有工作分类为几个关键方向：基于模型的高效推理、基于推理输出的高效推理、基于输入提示的高效推理，并引入高效数据用于训练推理模型，探索小型语言模型的推理能力，讨论评估方法和基准测试。 Result: 提供了第一个结构化调查，系统地探索了当前在LLMs中实现高效推理的进展。 Conclusion: 通过优化模型、推理输出和输入提示，可以有效提高LLMs的推理效率，并探索了小型语言模型的潜力和评估方法。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the "overthinking phenomenon". In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking.

Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion

Zhou Zhenglin,Ma Fan,Fan Hehe,Chua Tat-Seng

Task: 提出一种名为Zero-1-to-A的方法，用于生成空间和时间一致的数据集以重建4D动画头像。

Motivation: 减少动画头像生成所需的数据量，并解决直接从视频扩散模型蒸馏4D头像时出现的空间和时间不一致问题。

Details

Method: 通过迭代构建视频数据集并逐步优化动画头像，确保学习过程中头像质量平滑且一致地提升。具体包括两个阶段：空间一致性学习和时间一致性学习。 Result: Zero-1-to-A在保真度、动画质量和渲染速度方面优于现有的基于扩散的方法。 Conclusion: Zero-1-to-A为逼真的头像创建提供了一种解决方案，代码已公开。 Abstract: Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: https://github.com/ZhenglinZhou/Zero-1-to-A.

XAttention: Block Sparse Attention with Antidiagonal Scoring

Ruyi Xu,Guangxuan Xiao,Haofeng Huang,Junxian Guo,Song Han

Task: 提出一种名为XAttention的框架，用于加速长上下文Transformer模型中的推理过程。

Motivation: 长上下文Transformer模型在实际应用中非常重要，但由于注意力机制的二次复杂度，计算成本很高。现有的块稀疏注意力方法在平衡准确性和效率方面存在困难。

Details

Method: XAttention通过利用注意力矩阵中反对角线值的和作为块重要性的代理，精确识别并剪除非必要的块，从而实现高稀疏性和加速推理。 Result: 在多个长上下文基准测试中，XAttention实现了与完整注意力相当的准确性，同时显著加速了计算，最高可达13.5倍的加速。 Conclusion: XAttention展示了块稀疏注意力的实际潜力，为长上下文Transformer模型的可扩展和高效部署铺平了道路。 Abstract: Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks-including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation. XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications. Code is available at https://github.com/mit-han-lab/x-attention.

VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

Hyojun Go,Byeongjun Park,Hyelin Nam,Byung-Hoon Kim,Hyungjin Chung,Changick Kim

Task: 提出一种直接文本到3D的模型VideoRFSplat，利用视频生成模型生成真实世界的无界场景的3D高斯散射（3DGS）。

Motivation: 现有方法在将2D生成模型扩展到联合建模时存在不稳定性，需要额外的模型来稳定训练和推理。

Details

Method: 提出了一种双流架构和异步采样策略，通过将专用的姿态生成模型与预训练的视频生成模型结合，分别生成多视图图像和相机姿态。 Result: 在多个大规模真实世界数据集上训练后，VideoRFSplat在不依赖后处理细化的情况下，优于现有的文本到3D直接生成方法。 Conclusion: VideoRFSplat通过减少姿态和图像模态之间的干扰，增强了跨模态一致性，实现了更好的文本到3D生成效果。 Abstract: We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.

Agreeing to Interact in Human-Robot Interaction using Large Language Models and Vision Language Models

Kazuhiro Sasabuchi,Naoki Wake,Atsushi Kanehira,Jun Takamatsu,Katsushi Ikeuchi

Task: 测试大型语言模型（LLM）和视觉语言模型（VLM）是否能够解决人机交互（HRI）中交互开始时的复杂性问题。

Motivation: 人机交互的开始通常很复杂，机器人是否应与人类交互取决于多种情境因素（如人类当前的活动、交互的紧急性等）。

Details

Method: 比较了四种不同的系统设计模式，使用了LLMs和VLMs，并在包含84个人机交互情境的测试集上进行了测试。测试集混合了多个公开可用的数据集，并包括一些开放性的情境。 Result: 使用GPT-4o和Phi-3 Vision模型的结果表明，LLMs和VLMs能够在所需行动明确的情况下处理交互开始，但在开放性情境中仍存在挑战，模型必须在人类和机器人情境之间进行权衡。 Conclusion: LLMs和VLMs在明确情境下表现良好，但在开放性情境中仍需进一步研究和改进。 Abstract: In human-robot interaction (HRI), the beginning of an interaction is often complex. Whether the robot should communicate with the human is dependent on several situational factors (e.g., the current human's activity, urgency of the interaction, etc.). We test whether large language models (LLM) and vision language models (VLM) can provide solutions to this problem. We compare four different system-design patterns using LLMs and VLMs, and test on a test set containing 84 human-robot situations. The test set mixes several publicly available datasets and also includes situations where the appropriate action to take is open-ended. Our results using the GPT-4o and Phi-3 Vision model indicate that LLMs and VLMs are capable of handling interaction beginnings when the desired actions are clear, however, challenge remains in the open-ended situations where the model must balance between the human and robot situation.

TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data

Rohit Kundu,Athula Balachandran,Amit K. Roy-Chowdhury

Task: 提出一种新的DeepFake检测框架TruthLens，不仅能判断图像的真伪，还能提供详细的预测解释。

Motivation: 现有的DeepFake检测方法通常仅限于二分类（真实 vs. 伪造），且缺乏可解释性。

Details

Method: 结合多模态大语言模型（如PaliGemma2）的全局上下文理解和视觉模型（如DINOv2）的局部特征提取能力。 Result: 在多个数据集上的实验表明，TruthLens在检测准确性和可解释性方面优于现有方法，且在跨数据和跨领域设置中表现出色。 Conclusion: TruthLens框架在DeepFake检测中表现出色，具有较高的通用性和可解释性。 Abstract: Detecting DeepFakes has become a crucial research area as the widespread use of AI image generators enables the effortless creation of face-manipulated and fully synthetic content, yet existing methods are often limited to binary classification (real vs. fake) and lack interpretability. To address these challenges, we propose TruthLens, a novel and highly generalizable framework for DeepFake detection that not only determines whether an image is real or fake but also provides detailed textual reasoning for its predictions. Unlike traditional methods, TruthLens effectively handles both face-manipulated DeepFakes and fully AI-generated content while addressing fine-grained queries such as "Does the eyes/nose/mouth look real or fake?" The architecture of TruthLens combines the global contextual understanding of multimodal large language models like PaliGemma2 with the localized feature extraction capabilities of vision-only models like DINOv2. This hybrid design leverages the complementary strengths of both models, enabling robust detection of subtle manipulations while maintaining interpretability. Extensive experiments on diverse datasets demonstrate that TruthLens outperforms state-of-the-art methods in detection accuracy (by 2-14%) and explainability, in both in-domain and cross-data settings, generalizing effectively across traditional and emerging manipulation techniques.

Representing data in words

Amandine M. Caut,Amy Rouillard,Beimnet Zenebe,Matthias Green,Ágúst Pálmason Morthens,David J. T. Sumpter

Task: 介绍一种称为wordalisations的新概念，用于通过文字描述数据。

Motivation: 数据科学中可视化是展示数据的重要方式，但文字描述同样可以简洁明了地传达数据信息。

Details

Method: 使用大型语言模型通过任务无关结构的提示模板生成wordalisations。 Result: 在足球球员选拔、人格测试和国际调查数据三个应用领域中生成了可靠且引人入胜的文本。 Conclusion: 提出模型卡片框架，强调在生成wordalisations时明确模型、数值转换方式、背景信息和局限性，认为该框架比基准数据集上的性能测试更适合设定最佳实践。 Abstract: An important part of data science is the use of visualisations to display data in a way that is easy to digest. Visualisations often rely on underlying statistical or machine learning models -- ranging from basic calculations like category means to advanced methods such as principal component analysis of multidimensional datasets -- to convey insights. We introduce an analogous concept for word descriptions of data, which we call wordalisations. Wordalisations describe data in easy to digest words, without necessarily reporting numerical values from the data. We show how to create wordalisations using large language models, through prompt templates engineered according to a task-agnostic structure which can be used to automatically generate prompts from data. We show how to produce reliable and engaging texts on three application areas: scouting football players, personality tests, and international survey data. Using the model cards framework, we emphasise the importance of clearly stating the model we are imposing on the data when creating the wordalisation, detailing how numerical values are translated into words, incorporating background information into prompts for the large language model, and documenting the limitations of the wordalisations. We argue that our model cards approach is a more appropriate framework for setting best practices in wordalisation of data than performance tests on benchmark datasets.

UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations

Debabrata Mandal,Soumitri Chattopadhyay,Guansen Tong,Praneeth Chakravarthula

Task: 提出一种统一的图像恢复方法UniCoRN，能够同时处理多种退化类型。

Motivation: 现有的图像恢复方法通常只能处理单一类型的退化，限制了其在现实世界中的适用性。

Details

Method: 使用多头扩散模型，通过低层次视觉线索引导可控扩散模型，并采用专家混合策略设计多头控制网络。 Result: 在多个具有挑战性的数据集上进行了广泛评估，证明了该方法在恢复严重退化图像方面的显著性能提升。 Conclusion: UniCoRN方法能够鲁棒地恢复具有多种退化的图像，并引入了MetaRestore基准测试集。 Abstract: Image restoration is essential for enhancing degraded images across computer vision tasks. However, most existing methods address only a single type of degradation (e.g., blur, noise, or haze) at a time, limiting their real-world applicability where multiple degradations often occur simultaneously. In this paper, we propose UniCoRN, a unified image restoration approach capable of handling multiple degradation types simultaneously using a multi-head diffusion model. Specifically, we uncover the potential of low-level visual cues extracted from images in guiding a controllable diffusion model for real-world image restoration and we design a multi-head control network adaptable via a mixture-of-experts strategy. We train our model without any prior assumption of specific degradations, through a smartly designed curriculum learning recipe. Additionally, we also introduce MetaRestore, a metalens imaging benchmark containing images with multiple degradations and artifacts. Extensive evaluations on several challenging datasets, including our benchmark, demonstrate that our method achieves significant performance gains and can robustly restore images with severe degradations. Project page: https://codejaeger.github.io/unicorn-gh

Superhuman AI Disclosure: Impacts on Toxicity, Fairness, and Trust Vary by Expertise and Persona Attributes

Jaymari Chua,Chen Wang,Lina Yao

Task: 研究透明度如何影响对人工智能的态度和感知。

Motivation: 随着人工智能在现实任务中表现出超越人类的能力，揭示其超人类能力对公平性、责任性和信任提出了挑战。

Details

Method: 引入一组经过验证的合成人物角色，反映不同的公平关注和技术接受水平，并在两个对比领域（StarCraft II 中的竞争性玩家和提供信息的合作性个人助手）中评估响应。 Result: 在StarCraft II中，明确标注AI为超人类减少了新手角色的毒性并提高了公平性，而专家角色则觉得披露声明令人烦恼但仍比不披露更少欺骗性。在LLM作为个人助手的设置中，披露超人类能力提高了感知的可信度，但可能导致某些角色过度依赖AI。 Conclusion: 透明度并非万能药：在合作环境中减少怀疑并增强信任，但在竞争领域可能引发抵抗或失望。 Abstract: As artificial intelligence demonstrates surpassing human performance across real-world tasks, disclosing superhuman capabilities poses challenges for fairness, accountability, and trust. To investigate how transparency impacts attitudes and perceptions, we introduce a grounded and validated set of synthetic personas reflecting diverse fairness concerns and technology acceptance levels. Then we evaluate responses in two contrasting domains: (1) a competitive player in StarCraft II, where strategy and high-skill gameplay often elicit toxic interactions, and (2) a cooperative personal-assistant in providing information. Across numerous interactions spanning persona profiles, we test non-disclosure versus explicit superhuman labelling under controlled game outcomes and usage contexts. Our findings reveal sharp domain-specific effects: in StarCraft II, explicitly labelling AI as superhuman, novice personas who learned of it reported lower toxicity and higher fairness-attributing defeat to advanced skill rather than hidden cheating-whereas expert personas found the disclosure statements irksome but still less deceptive than non-disclosure. Conversely, in the LLM as personal-assistant setting, disclosure of superhuman capabilities improved perceived trustworthiness, though it risked AI overreliance among certain persona segments. We release Dataset X-containing persona cards-including profile attributes, disclosure prompts, and detailed interaction logs, accompanied by reproducible protocols and disclaimers for adapting them to diverse tasks. Our results demonstrate that transparency is not a cure-all: while it reduces suspicion and enhances trust in cooperative contexts, it may inflame resistance or disappointment in competitive domains.

MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

Kyungho Bae,Jinhyung Kim,Sihaeng Lee,Soonyoung Lee,Gunhee Lee,Jinwoo Choi

Task: 解决视频大语言模型（Video-LLMs）中的动作-场景幻觉问题。

Motivation: 现有的Video-LLMs由于空间和时间特征的混合以及标准旋转位置嵌入（RoPE）的问题，常常出现动作-场景幻觉。

Details

Method: 提出了MASH-VLM，通过解耦空间-时间表示来解决动作-场景幻觉问题。具体包括DST-attention机制和Harmonic-RoPE。 Result: MASH-VLM在UNSCENE基准测试和现有视频理解基准测试中取得了最先进的结果。 Conclusion: MASH-VLM通过解耦空间-时间表示和引入新的注意力机制，有效缓解了Video-LLMs中的动作-场景幻觉问题。 Abstract: In this work, we tackle action-scene hallucination in Video Large Language Models (Video-LLMs), where models incorrectly predict actions based on the scene context or scenes based on observed actions. We observe that existing Video-LLMs often suffer from action-scene hallucination due to two main factors. First, existing Video-LLMs intermingle spatial and temporal features by applying an attention operation across all tokens. Second, they use the standard Rotary Position Embedding (RoPE), which causes the text tokens to overemphasize certain types of tokens depending on their sequential orders. To address these issues, we introduce MASH-VLM, Mitigating Action-Scene Hallucination in Video-LLMs through disentangled spatial-temporal representations. Our approach includes two key innovations: (1) DST-attention, a novel attention mechanism that disentangles the spatial and temporal tokens within the LLM by using masked attention to restrict direct interactions between the spatial and temporal tokens; (2) Harmonic-RoPE, which extends the dimensionality of the positional IDs, allowing the spatial and temporal tokens to maintain balanced positions relative to the text tokens. To evaluate the action-scene hallucination in Video-LLMs, we introduce the UNSCENE benchmark with 1,320 videos and 4,078 QA pairs. Extensive experiments demonstrate that MASH-VLM achieves state-of-the-art results on the UNSCENE benchmark, as well as on existing video understanding benchmarks.

From Divergence to Consensus: Evaluating the Role of Large Language Models in Facilitating Agreement through Adaptive Strategies

Loukas Triantafyllopoulos,Dimitris Kalles

Task: 提出一种利用大型语言模型（LLMs）作为自动化协调者的框架，以改进群体决策中的共识达成过程。

Motivation: 传统的群体决策方法依赖于人类协调者，存在可扩展性和效率方面的限制，尤其是在大规模、快节奏的讨论中。

Details

Method: 使用余弦相似度作为核心指标，评估三种最先进的LLMs（ChatGPT 4.0、Mistral Large 2和AI21 Jamba Instruct）在生成与参与者观点一致的共识提案方面的能力。系统集成了自适应协调策略，包括澄清误解、总结讨论和提出妥协方案。 Result: 实验结果表明，ChatGPT 4.0在达成共识方面表现最佳，与参与者观点的对齐度更高，且需要更少的迭代次数。 Conclusion: LLM驱动的协调具有改善集体决策过程的潜力，未来的研究应进一步推进评估指标和跨文化适应性。 Abstract: Achieving consensus in group decision-making often involves overcoming significant challenges, particularly in reconciling diverse perspectives and mitigating biases that hinder agreement. Traditional methods relying on human facilitators are often constrained by scalability and efficiency, especially in large-scale, fast-paced discussions. To address these challenges, this study proposes a novel framework employing large language models (LLMs) as automated facilitators within a custom-built multi-user chat system. Leveraging cosine similarity as a core metric, this approach evaluates the ability of three state-of-the-art LLMs- ChatGPT 4.0, Mistral Large 2, and AI21 Jamba Instruct- to synthesize consensus proposals that align with participants' viewpoints. Unlike conventional techniques, the system integrates adaptive facilitation strategies, including clarifying misunderstandings, summarizing discussions, and proposing compromises, enabling the LLMs to iteratively refine consensus proposals based on user feedback. Experimental results demonstrate the superiority of ChatGPT 4.0, which achieves higher alignment with participant opinions, requiring fewer iterations to reach consensus compared to its counterparts. Moreover, analysis reveals the nuanced performance of the models across various sustainability-focused discussion topics, such as climate action, quality education, good health and well-being, and access to clean water and sanitation. These findings highlight the transformative potential of LLM-driven facilitation for improving collective decision-making processes and underscore the importance of advancing evaluation metrics and cross-cultural adaptability in future research.

MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving

Haiguang Wang,Daqi Liu,Hongwei Xie,Haisong Liu,Enhui Ma,Kaicheng Yu,Limin Wang,Bing Wang

Task: 提出一种名为MiLA的框架，用于生成高保真、长时间的视频，以解决现有方法在生成长视频时累积误差的问题。

Motivation: 数据驱动技术在自动驾驶系统中取得了显著进展，但稀有和多样化的训练数据需求仍然是一个挑战，需要大量设备和人力投入。世界模型通过合成标注视频数据来提供解决方案，但现有方法在生成长视频时难以保持一致性和避免误差累积。

Details

Method: MiLA采用从粗到细的方法来稳定视频生成并纠正动态对象的失真，同时引入了时间渐进去噪调度器和联合去噪与校正流模块来提高生成视频的质量。 Result: 在nuScenes数据集上的大量实验表明，MiLA在视频生成质量方面达到了最先进的性能。 Conclusion: MiLA框架通过其独特的方法和模块，显著提高了长时间视频生成的质量和一致性，为自动驾驶系统的训练数据生成提供了有效的解决方案。 Abstract: In recent years, data-driven techniques have greatly advanced autonomous driving systems, but the need for rare and diverse training data remains a challenge, requiring significant investment in equipment and labor. World models, which predict and generate future environmental states, offer a promising solution by synthesizing annotated video data for training. However, existing methods struggle to generate long, consistent videos without accumulating errors, especially in dynamic scenes. To address this, we propose MiLA, a novel framework for generating high-fidelity, long-duration videos up to one minute. MiLA utilizes a Coarse-to-Re(fine) approach to both stabilize video generation and correct distortion of dynamic objects. Additionally, we introduce a Temporal Progressive Denoising Scheduler and Joint Denoising and Correcting Flow modules to improve the quality of generated videos. Extensive experiments on the nuScenes dataset show that MiLA achieves state-of-the-art performance in video generation quality. For more information, visit the project website: https://github.com/xiaomi-mlab/mila.github.io.

Tharindu Kumarage,Cameron Johnson,Jadie Adams,Lin Ai,Matthias Kirchner,Anthony Hoogs,Joshua Garland,Julia Hirschberg,Arslan Basharat,Huan Liu

Task: 提出一个基于LLM的框架SE-VSim，用于模拟社交工程攻击机制，并通过生成多轮对话来评估受害者的人格特质对攻击易感性的影响。

Motivation: 随着基于大语言模型的聊天机器人的快速发展，社交工程攻击在社交媒体平台上的风险显著增加。理解攻击机制和受害者人格特质对攻击易感性的影响是缓解这一威胁的关键。

Details

Method: 提出LLM-agentic框架SE-VSim，模拟社交工程攻击机制，生成多轮对话，并建模具有不同人格特质的受害者代理。使用包含1000多个模拟对话的数据集，分析攻击场景。 Result: 基于分析，提出了一个概念验证SE-OmniGuard，通过利用受害者人格的先验知识、评估攻击策略和监控对话中的信息交换，为用户提供个性化保护。 Conclusion: SE-VSim框架能够有效模拟社交工程攻击机制，SE-OmniGuard能够提供个性化的保护，识别潜在的社交工程攻击尝试。 Abstract: The rapid advancement of conversational agents, particularly chatbots powered by Large Language Models (LLMs), poses a significant risk of social engineering (SE) attacks on social media platforms. SE detection in multi-turn, chat-based interactions is considerably more complex than single-instance detection due to the dynamic nature of these conversations. A critical factor in mitigating this threat is understanding the mechanisms through which SE attacks operate, specifically how attackers exploit vulnerabilities and how victims' personality traits contribute to their susceptibility. In this work, we propose an LLM-agentic framework, SE-VSim, to simulate SE attack mechanisms by generating multi-turn conversations. We model victim agents with varying personality traits to assess how psychological profiles influence susceptibility to manipulation. Using a dataset of over 1000 simulated conversations, we examine attack scenarios in which adversaries, posing as recruiters, funding agencies, and journalists, attempt to extract sensitive information. Based on this analysis, we present a proof of concept, SE-OmniGuard, to offer personalized protection to users by leveraging prior knowledge of the victims personality, evaluating attack strategies, and monitoring information exchanges in conversations to identify potential SE attempts.

Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

Tiange Xiang,Kai Li,Chengjiang Long,Christian Häne,Peihong Guo,Scott Delp,Ehsan Adeli,Li Fei-Fei

Task: 将预训练的2D扩散模型重新用于3D对象生成。

Motivation: 由于高质量3D数据的稀缺，3D扩散模型的发展受到阻碍，导致其性能不如2D模型。

Details

Method: 提出了一种名为Gaussian Atlas的新表示方法，利用密集的2D网格，使2D扩散模型能够微调以生成3D高斯。 Result: 实验结果表明，文本到图像的扩散模型可以有效地适应3D内容生成，缩小了2D和3D建模之间的差距。 Conclusion: 通过Gaussian Atlas和GaussianVerse数据集，成功地将预训练的2D扩散模型应用于3D对象生成。 Abstract: Recent advances in text-to-image diffusion models have been driven by the increasing availability of paired 2D data. However, the development of 3D diffusion models has been hindered by the scarcity of high-quality 3D data, resulting in less competitive performance compared to their 2D counterparts. To address this challenge, we propose repurposing pre-trained 2D diffusion models for 3D object generation. We introduce Gaussian Atlas, a novel representation that utilizes dense 2D grids, enabling the fine-tuning of 2D diffusion models to generate 3D Gaussians. Our approach demonstrates successful transfer learning from a pre-trained 2D diffusion model to a 2D manifold flattened from 3D structures. To support model training, we compile GaussianVerse, a large-scale dataset comprising 205K high-quality 3D Gaussian fittings of various 3D objects. Our experimental results show that text-to-image diffusion models can be effectively adapted for 3D content generation, bridging the gap between 2D and 3D modeling.

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Federico Cocchi,Nicholas Moratelli,Davide Caffagni,Sara Sarto,Lorenzo Baraldi,Marcella Cornia,Rita Cucchiara

Task: 探讨多模态大语言模型（MLLMs）中视觉骨干和语言模型的关键作用，并引入LLaVA-MORE模型家族进行系统分析。

Motivation: 现有研究主要集中在将视觉骨干和语言模型扩展到数十亿参数，但模型大小、架构和性能之间的权衡尚未充分探索。此外，训练数据和评估协议的不一致性阻碍了直接比较，难以得出最佳设计选择。

Details

Method: 引入LLaVA-MORE模型家族，采用统一的训练协议，系统分析小规模和中规模语言模型（如Phi-4、LLaMA-3.1和Gemma-2）在多模态推理、生成和指令跟随方面的表现，并研究模型大小与性能的关系。同时，全面研究各种视觉编码器（如CLIP、DINOv2、SigLIP和SigLIP2）的影响。 Result: 研究结果提供了设计更有效MLLMs的见解，并提供了一个可重复的评估框架，便于直接比较和指导未来模型开发。 Conclusion: LLaVA-MORE模型家族及其统一的训练协议为多模态大语言模型的设计和评估提供了有价值的参考，有助于未来模型的开发和优化。 Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs -- including Phi-4, LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.

Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance

Hui Liu,Wenya Wang,Kecheng Chen,Jie Liu,Yibing Liu,Tiexin Qin,Peisong He,Xinghao Jiang,Haoliang Li

Task: 提出一种概念引导的类人贝叶斯推理（CHBR）框架，用于零样本图像识别任务。

Motivation: 现有的视觉语言模型（VLMs）在真实世界应用中表现不佳，主要由于提示工程不理想和无法有效适应目标类别。

Details

Method: 基于贝叶斯定理，CHBR框架将人类图像识别中使用的概念建模为潜在变量，并通过重要性采样算法生成判别性概念。 Result: 在十五个数据集上的广泛评估表明，CHBR框架在零样本泛化方法中表现优异。 Conclusion: CHBR框架通过动态调整概念组合，显著提升了零样本图像识别的性能。 Abstract: In zero-shot image recognition tasks, humans demonstrate remarkable flexibility in classifying unseen categories by composing known simpler concepts. However, existing vision-language models (VLMs), despite achieving significant progress through large-scale natural language supervision, often underperform in real-world applications because of sub-optimal prompt engineering and the inability to adapt effectively to target classes. To address these issues, we propose a Concept-guided Human-like Bayesian Reasoning (CHBR) framework. Grounded in Bayes' theorem, CHBR models the concept used in human image recognition as latent variables and formulates this task by summing across potential concepts, weighted by a prior distribution and a likelihood function. To tackle the intractable computation over an infinite concept space, we introduce an importance sampling algorithm that iteratively prompts large language models (LLMs) to generate discriminative concepts, emphasizing inter-class differences. We further propose three heuristic approaches involving Average Likelihood, Confidence Likelihood, and Test Time Augmentation (TTA) Likelihood, which dynamically refine the combination of concepts based on the test image. Extensive evaluations across fifteen datasets demonstrate that CHBR consistently outperforms existing state-of-the-art zero-shot generalization methods.

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak,Xiangru Jian,Kevin Qinghong Lin,Juan A. Rodriguez,Montek Kalsi,Rabiul Awal,Nicolas Chapados,M. Tamer Özsu,Aishwarya Agrawal,David Vazquez,Christopher Pal,Perouz Taslakian,Spandana Gella,Sai Rajeswar

Task: 介绍UI-Vision，一个用于在真实桌面环境中对计算机使用代理进行离线细粒度评估的综合基准。

Motivation: 现有的研究主要集中在在线环境中，而桌面环境对于许多专业和日常任务至关重要，但由于数据收集挑战和许可问题，桌面环境仍未得到充分探索。

Details

Method: 引入UI-Vision，提供密集、高质量的人类演示注释，包括边界框、UI标签和动作轨迹（点击、拖动和键盘输入），并设计了三个细到粗粒度的任务：元素定位、布局定位和动作预测。 Result: 评估揭示了最先进模型（如UI-TARS-72B）的关键局限性，包括理解专业软件、空间推理和执行复杂动作（如拖放）方面的问题。 Conclusion: UI-Vision的发布旨在推动开发更强大的代理，以应对现实世界中的桌面任务。 Abstract: Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.

DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering

Haochen Wang,Kai Hu,Liangcai Gao

Task: 介绍并评估DocVideoQA任务和数据集，旨在提升文档中心视频的多模态理解能力。

Motivation: 远程工作和在线课程的普及导致大量基于文档的教学视频出现，这些视频具有丰富的文本图像和音频信息，需要先进的多模态理解能力。然而，由于数据集可用性和复杂性，这一领域尚未得到充分探索。

Details

Method: 提出了DocVideoQA任务和数据集，包含1454个视频和154k个问答对。使用开源MLLMs建立基线，并提出了DV-LLaMA模型，通过多样化的指令调优数据和对比学习增强模态集成能力。 Result: DV-LLaMA在DocVideoQA数据集上显著优于现有模型。 Conclusion: DV-LLaMA通过增强单模态特征提取和模态集成能力，显著提升了文档中心视频的理解能力，代码和数据集将公开发布以促进未来研究。 Abstract: Remote work and online courses have become important methods of knowledge dissemination, leading to a large number of document-based instructional videos. Unlike traditional video datasets, these videos mainly feature rich-text images and audio that are densely packed with information closely tied to the visual content, requiring advanced multimodal understanding capabilities. However, this domain remains underexplored due to dataset availability and its inherent complexity. In this paper, we introduce the DocVideoQA task and dataset for the first time, comprising 1454 videos across 23 categories with a total duration of about 828 hours. The dataset is annotated with 154k question-answer pairs generated manually and via GPT, assessing models' comprehension, temporal awareness, and modality integration capabilities. Initially, we establish a baseline using open-source MLLMs. Recognizing the challenges in modality comprehension for document-centric videos, we present DV-LLaMA, a robust video MLLM baseline. Our method enhances unimodal feature extraction with diverse instruction-tuning data and employs contrastive learning to strengthen modality integration. Through fine-tuning, the LLM is equipped with audio-visual capabilities, leading to significant improvements in document-centric video understanding. Extensive testing on the DocVideoQA dataset shows that DV-LLaMA significantly outperforms existing models. We'll release the code and dataset to facilitate future research.

Mixture of Lookup Experts

Shibo Jie,Yehui Tang,Kai Han,Yitong Li,Duyu Tang,Zhi-Hong Deng,Yunhe Wang

Task: 提出一种新的Mixture of Lookup Experts (MoLE)架构，以解决Mixture-of-Experts (MoE)在推理时的高VRAM需求和通信开销问题。

Motivation: MoE在推理时虽然只激活部分专家，但仍需将所有专家加载到VRAM中，导致高VRAM需求和通信开销。

Details

Method: 在MoLE中，专家在训练时是前馈网络（FFNs），在推理前可以重新参数化为查找表（LUTs），并卸载到存储设备中，从而在推理时直接根据输入id检索专家输出。 Result: 实验表明，在相同的FLOPs和VRAM使用情况下，MoLE的推理速度与密集模型相当，显著快于专家卸载的MoE，同时保持与MoE相当的性能。 Conclusion: MoLE在通信和VRAM使用方面都更加高效，能够在保持性能的同时显著提高推理速度。 Abstract: Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert's computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE.

UMIT: Unifying Medical Imaging Tasks via Vision-Language Models

Haiyang Yu,Siyang Yi,Ke Niu,Minghan Zhuo,Bin Li

Task: 提出一个统一的多模态、多任务的视觉语言模型（UMIT），用于解决医学影像任务。

Motivation: 现有研究主要集中在特定任务或单一模态上，限制了其在多样化医学场景中的适用性和泛化能力。

Details

Method: 设计了一个独特的两阶段训练策略，并使用设计的指令模板对UMIT进行微调。 Result: UMIT在多个数据集的五个任务中优于之前的方法，显著提高了诊断准确性和工作流程效率。 Conclusion: UMIT能够为医学影像应用提供有效的解决方案，具有广泛的适用性和全球化的可访问性。 Abstract: With the rapid advancement of deep learning, particularly in the field of medical image analysis, an increasing number of Vision-Language Models (VLMs) are being widely applied to solve complex health and biomedical challenges. However, existing research has primarily focused on specific tasks or single modalities, which limits their applicability and generalization across diverse medical scenarios. To address this challenge, we propose UMIT, a unified multi-modal, multi-task VLM designed specifically for medical imaging tasks. UMIT is able to solve various tasks, including visual question answering, disease detection, and medical report generation. In addition, it is applicable to multiple imaging modalities (e.g., X-ray, CT and PET), covering a wide range of applications from basic diagnostics to complex lesion analysis. Moreover, UMIT supports both English and Chinese, expanding its applicability globally and ensuring accessibility to healthcare services in different linguistic contexts. To enhance the model's adaptability and task-handling capability, we design a unique two-stage training strategy and fine-tune UMIT with designed instruction templates. Through extensive empirical evaluation, UMIT outperforms previous methods in five tasks across multiple datasets. The performance of UMIT indicates that it can significantly enhance diagnostic accuracy and workflow efficiency, thus providing effective solutions for medical imaging applications.

ChatGPT and U(X): A Rapid Review on Measuring the User Experience

Katie Seaborn

Task: 探索ChatGPT用户体验（UX）的定量评估方法。

Motivation: ChatGPT自2022年发布以来，已经彻底改变了人机交互（HCI），但目前仍缺乏一个系统的评估ChatGPT用户体验的途径。

Details

Method: 通过快速综述（N = 58），分析ChatGPT用户体验的定量研究方法，重点关注独立变量（IVs）、因变量（DVs）以及测量方法。 Result: 研究发现用户体验评估中的趋势、差距和新兴共识，并提出了两个初步框架以指导未来研究和工具开发。 Conclusion: 该研究为ChatGPT用户体验的评估提供了初步的综合方法，并提出了标准化和广度提升的紧迫方向，旨在优化用户与ChatGPT及类似LLM系统的交互。 Abstract: ChatGPT, powered by a large language model (LLM), has revolutionized everyday human-computer interaction (HCI) since its 2022 release. While now used by millions around the world, a coherent pathway for evaluating the user experience (UX) ChatGPT offers remains missing. In this rapid review (N = 58), I explored how ChatGPT UX has been approached quantitatively so far. I focused on the independent variables (IVs) manipulated, the dependent variables (DVs) measured, and the methods used for measurement. Findings reveal trends, gaps, and emerging consensus in UX assessments. This work offers a first step towards synthesizing existing approaches to measuring ChatGPT UX, urgent trajectories to advance standardization and breadth, and two preliminary frameworks aimed at guiding future research and tool development. I seek to elevate the field of ChatGPT UX by empowering researchers and practitioners in optimizing user interactions with ChatGPT and similar LLM-based systems.

UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis

Jiawei Wang,Kai Hu,Qiang Huo

Task: 提出一种统一的层次文档结构分析方法（UniHDSA），将各种层次文档结构分析子任务视为关系预测问题。

Motivation: 文档结构分析对于理解文档的物理布局和逻辑结构至关重要，服务于信息检索、文档摘要、知识提取等。

Details

Method: 提出了一种统一的关系预测方法（UniHDSA），将各种层次文档结构分析子任务视为关系预测问题，并将关系预测标签整合到一个统一的标签空间中。 Result: 实验结果表明，UniHDSA在层次文档结构分析基准Comp-HRDoc上达到了最先进的性能，并在大规模文档布局分析数据集DocLayNet上取得了有竞争力的结果。 Conclusion: UniHDSA方法在所有子任务中均表现出优越性，证明了其有效性。 Abstract: Document structure analysis, aka document layout analysis, is crucial for understanding both the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. Hierarchical Document Structure Analysis (HDSA) specifically aims to restore the hierarchical structure of documents created using authoring software with hierarchical schemas. Previous research has primarily followed two approaches: one focuses on tackling specific subtasks of HDSA in isolation, such as table detection or reading order prediction, while the other adopts a unified framework that uses multiple branches or modules, each designed to address a distinct task. In this work, we propose a unified relation prediction approach for HDSA, called UniHDSA, which treats various HDSA sub-tasks as relation prediction problems and consolidates relation prediction labels into a unified label space. This allows a single relation prediction module to handle multiple tasks simultaneously, whether at a page-level or document-level structure analysis. To validate the effectiveness of UniHDSA, we develop a multimodal end-to-end system based on Transformer architectures. Extensive experimental results demonstrate that our approach achieves state-of-the-art performance on a hierarchical document structure analysis benchmark, Comp-HRDoc, and competitive results on a large-scale document layout analysis dataset, DocLayNet, effectively illustrating the superiority of our method across all sub-tasks.

Entropy-based Exploration Conduction for Multi-step Reasoning

Jinghan Zhang,Xiting Wang,Fengran Mo,Yeyang Zhou,Wanfu Gao,Kunpeng Liu

Task: 提出一种基于熵的探索深度引导方法（Entro-duction），用于动态调整多步推理中的探索深度。

Motivation: 现有的自动决定探索深度的方法成本高且缺乏灵活性，影响了模型的推理准确性。

Details

Method: 通过监控LLM的输出熵和方差熵，动态调整探索深度，根据概率选择加深、扩展或停止探索。 Result: 在四个基准数据集上的实验结果证明了Entro-duction的有效性。 Conclusion: Entro-duction在推理准确性和探索效果之间取得了平衡，并通过实验和分析讨论了其各组成部分对推理性能的贡献。 Abstract: In large language model (LLM) reasoning, multi-step processes have proven effective for solving complex tasks. However, the depth of exploration can significantly affect the reasoning performance. Existing methods to automatically decide the depth often bring high costs and lack flexibility, and thus undermine the model's reasoning accuracy. To address these issues, we propose Entropy-based Exploration Depth Conduction (Entro-duction), a novel method that dynamically adjusts the exploration depth during multi-step reasoning by monitoring LLM's output entropy and variance entropy. We employ these two metrics to capture the model's current uncertainty and the fluctuation of uncertainty across consecutive reasoning steps. Based on the observed changes, the LLM selects whether to deepen, expand or stop exploration according to the probability. In this way, we balance the reasoning accuracy and exploration effectiveness. Experimental results across four benchmark datasets demonstrate the efficacy of Entro-duction. We further conduct experiments and analysis on the components of Entro-duction to discuss their contributions to reasoning performance.

Learning 3D Scene Analogies with Neural Contextual Scene Maps

Junho Kim,Gwangtak Bae,Eun Sun Lee,Young Min Kim

Task: 提出了一种新的方法来识别3D场景中的关系共性，并生成3D场景类比。

Motivation: 理解场景上下文对于机器在未见或嘈杂的3D环境中执行任务和适应先验知识至关重要。数据驱动学习难以全面涵盖各种布局和开放空间，因此需要一种新的方法来识别3D空间中的关系共性。

Details

Method: 提出了神经上下文场景地图，提取描述符字段以总结语义和几何上下文，并以从粗到细的方式整体对齐以进行地图估计。 Result: 实验表明，该方法在识别场景类比和在不同室内场景中转移轨迹或物体放置方面具有有效性。 Conclusion: 该方法在机器人和AR/VR应用中具有潜力，能够减少对单个特征点的依赖，使其对输入噪声或形状变化具有鲁棒性。 Abstract: Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications.

InCo-DPO: Balancing Distribution Shift and Data Quality for Enhanced Preference Optimization

Yunan Wang,Jijie Li,Bo-Wen Zhang,Liangdong Wang,Guang Liu

Task: 优化语言模型以与人类偏好对齐。

Motivation: 当前研究主要依赖在策略数据，忽视了离策略数据在数据质量上的价值，由于分布偏移的挑战。

Details

Method: 提出InCo-DPO方法，通过整合在策略和离策略数据，动态调整以平衡分布偏移和数据质量，找到最优权衡。 Result: 在Alpaca-Eval 2.0和Arena-Hard基准测试中，InCo-DPO不仅优于在策略和离策略数据，还在Arena-Hard上实现了60.8的最先进胜率。 Conclusion: InCo-DPO克服了离策略数据的分布偏移限制和在策略数据的质量限制，实现了更好的性能。 Abstract: Direct Preference Optimization (DPO) optimizes language models to align with human preferences. Utilizing on-policy samples, generated directly by the policy model, typically results in better performance due to its distribution consistency with the model compared to off-policy samples. This paper identifies the quality of candidate preference samples as another critical factor. While the quality of on-policy data is inherently constrained by the capabilities of the policy model, off-policy data, which can be derived from diverse sources, offers greater potential for quality despite experiencing distribution shifts. However, current research mostly relies on on-policy data and neglects the value of off-policy data in terms of data quality, due to the challenge posed by distribution shift. In this paper, we propose InCo-DPO, an efficient method for synthesizing preference data by integrating on-policy and off-policy data, allowing dynamic adjustments to balance distribution shifts and data quality, thus finding an optimal trade-off. Consequently, InCo-DPO overcomes the limitations of distribution shifts in off-policy data and the quality constraints of on-policy data. We evaluated InCo-DPO with the Alpaca-Eval 2.0 and Arena-Hard benchmarks. Experimental results demonstrate that our approach not only outperforms both on-policy and off-policy data but also achieves a state-of-the-art win rate of 60.8 on Arena-Hard with the vanilla DPO using Gemma-2 model.

Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

Boran Wen,Dingbang Huang,Zichen Zhang,Jiahong Zhou,Jianbin Deng,Jingyu Gong,Yulong Chen,Lizhuang Ma,Yong-Lu Li

Task: 从单张图像中重建人-物体交互（HOI）

Motivation: 现有方法主要在室内场景中训练和测试，由于缺乏3D数据，特别是物体种类的限制，难以推广到具有广泛物体种类的现实世界场景。

Details

Method: 提出了一种从单张图像中注释细粒度3D人、物体及其交互的流程，并构建了首个开放词汇的野外3D HOI数据集Open3DHOI。设计了一种新颖的高斯-HOI优化器，有效重建人与物体之间的空间交互并学习接触区域。 Result: 注释了2500多个3D HOI资产，并构建了Open3DHOI数据集。高斯-HOI优化器能够有效重建空间交互并学习接触区域。 Conclusion: 提出了新的3D HOI理解任务，为未来的工作铺平了道路。数据和代码将公开提供。 Abstract: Reconstructing human-object interactions (HOI) from single images is fundamental in computer vision. Existing methods are primarily trained and tested on indoor scenes due to the lack of 3D data, particularly constrained by the object variety, making it challenging to generalize to real-world scenes with a wide range of objects. The limitations of previous 3D HOI datasets were primarily due to the difficulty in acquiring 3D object assets. However, with the development of 3D reconstruction from single images, recently it has become possible to reconstruct various objects from 2D HOI images. We therefore propose a pipeline for annotating fine-grained 3D humans, objects, and their interactions from single images. We annotated 2.5k+ 3D HOI assets from existing 2D HOI datasets and built the first open-vocabulary in-the-wild 3D HOI dataset Open3DHOI, to serve as a future test set. Moreover, we design a novel Gaussian-HOI optimizer, which efficiently reconstructs the spatial interactions between humans and objects while learning the contact regions. Besides the 3D HOI reconstruction, we also propose several new tasks for 3D HOI understanding to pave the way for future work. Data and code will be publicly available at https://wenboran2002.github.io/3dhoi.

Don't Fight Hallucinations, Use Them: Estimating Image Realism using NLI over Atomic Facts

Elisei Rykov,Kseniia Petrushina,Kseniia Titova,Alexander Panchenko,Vasily Konovalov

Task: 量化图像的真实性

Motivation: 评估图像真实性是人工智能领域的一个挑战性问题，例如爱因斯坦拿着智能手机的图像违反了常识，因为现代智能手机是在爱因斯坦去世后发明的。

Details

Method: 使用大型视觉语言模型（LVLMs）和自然语言推理（NLI）来评估图像真实性。通过LVLM从图像中提取原子事实，计算这些事实之间的成对蕴含分数，并聚合这些值以生成一个单一的真实性分数。 Result: 在WHOOPS!数据集上实现了零样本模式下的最新性能。 Conclusion: 该方法通过识别真实事实和幻觉元素之间的矛盾，能够有效检测违反常识的图像。 Abstract: Quantifying the realism of images remains a challenging problem in the field of artificial intelligence. For example, an image of Albert Einstein holding a smartphone violates common-sense because modern smartphone were invented after Einstein's death. We introduce a novel method for assessing image realism using Large Vision-Language Models (LVLMs) and Natural Language Inference (NLI). Our approach is based on the premise that LVLMs may generate hallucinations when confronted with images that defy common sense. Using LVLM to extract atomic facts from these images, we obtain a mix of accurate facts and erroneous hallucinations. We proceed by calculating pairwise entailment scores among these facts, subsequently aggregating these values to yield a singular reality score. This process serves to identify contradictions between genuine facts and hallucinatory elements, signaling the presence of images that violate common sense. Our approach has achieved a new state-of-the-art performance in zero-shot mode on the WHOOPS! dataset.

Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation

Jiyuan Wang,Chunyu Lin,Cheng Guan,Lang Nie,Jing He,Haodong Li,Kang Liao,Yao Zhao

Task: 提出了一种基于Stable Diffusion的自监督单目深度估计框架Jasmine。

Motivation: 现有的基于Stable Diffusion的方法都是监督学习的，而自监督重投影方法存在固有的挑战（如遮挡、无纹理区域、光照变化），导致预测结果模糊且有伪影，严重影响了Stable Diffusion的潜在先验。

Details

Method: 构建了一种新的混合图像重建替代任务，通过重建图像本身来保留Stable Diffusion模型的细节先验，同时防止深度估计退化。此外，为了解决Stable Diffusion的尺度和平移不变估计与自监督尺度不变深度估计之间的固有不对齐问题，构建了Scale-Shift GRU。 Result: 在KITTI基准测试中达到了最先进的性能，并在多个数据集上表现出卓越的零样本泛化能力。 Conclusion: Jasmine框架有效地利用了Stable Diffusion的视觉先验，增强了无监督预测的清晰度和泛化能力。 Abstract: In this paper, we propose Jasmine, the first Stable Diffusion (SD)-based self-supervised framework for monocular depth estimation, which effectively harnesses SD's visual priors to enhance the sharpness and generalization of unsupervised prediction. Previous SD-based methods are all supervised since adapting diffusion models for dense prediction requires high-precision supervision. In contrast, self-supervised reprojection suffers from inherent challenges (e.g., occlusions, texture-less regions, illumination variance), and the predictions exhibit blurs and artifacts that severely compromise SD's latent priors. To resolve this, we construct a novel surrogate task of hybrid image reconstruction. Without any additional supervision, it preserves the detail priors of SD models by reconstructing the images themselves while preventing depth estimation from degradation. Furthermore, to address the inherent misalignment between SD's scale and shift invariant estimation and self-supervised scale-invariant depth estimation, we build the Scale-Shift GRU. It not only bridges this distribution gap but also isolates the fine-grained texture of SD output against the interference of reprojection loss. Extensive experiments demonstrate that Jasmine achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.

Autonomous AI imitators increase diversity in homogeneous information ecosystems

Emil Bakkensen Johansen,Oliver Baumann

Task: 研究AI生成新闻文章对信息生态系统多样性和民主价值的影响。

Motivation: 探讨AI模仿人类生成内容的能力对信息多样性和民主价值的潜在影响。

Details

Method: 引入大规模模拟框架，测试两种不同的模仿策略在不同初始多样性的信息环境中的效果。 Result: AI生成的文章并不总是导致内容同质化，其影响强烈依赖于初始信息环境的多样性。在初始同质化的新闻环境中，AI可以引入有价值的多样性；而在初始高度异质化的环境中，AI可能减少多样性。 Conclusion: 信息空间的基线多样性对AI的影响至关重要，AI驱动的模仿并不总是威胁信息多样性，反而在初始同质化的环境中可以扩展视角、风格和主题，这对新闻环境中的信息多样性和民主价值尤为重要。 Abstract: Recent breakthroughs in large language models (LLMs) have facilitated autonomous AI agents capable of imitating human-generated content. This technological advancement raises fundamental questions about AI's potential impact on the diversity and democratic value of information ecosystems. Here, we introduce a large-scale simulation framework to examine AI-based imitation in news, a context critically influential for public discourse. By systematically testing two distinct imitation strategies across a range of information environments varying in initial diversity, we demonstrate that AI-generated articles do not uniformly homogenize content. Instead, AI's influence is strongly context-dependent: AI-generated articles can introduce valuable diversity in originally homogeneous news environments, while potentially diminishing diversity in contexts that initially display high heterogeneity. These results illustrate that the baseline diversity of an information space critically shapes AI's impact, challenging assumptions that AI-driven imitation uniformly threatens information diversity. Instead, when information is initially homogeneous, AI-driven imitation can expand perspectives, styles, and topics. This is especially important in news contexts, where information diversity fosters richer public debate by exposing citizens to alternative viewpoints, challenging biases, and preventing narrative monopolies, which is essential for a resilient democracy.

Enhancing Close-up Novel View Synthesis via Pseudo-labeling

Jiatong Xia,Libo Sun,Lingqiao Liu

Task: 提出一种基于伪标签的学习策略，以解决现有方法在生成与训练集显著偏离的视角（特别是近距离视角）时无法准确渲染的问题。

Motivation: 现有方法如NeRF和3DGS在生成与训练集相似的视角时表现出色，但在生成显著偏离的视角（特别是近距离视角）时效果不佳，主要原因是缺乏针对近距离视角的训练数据。

Details

Method: 引入一种基于伪标签的学习策略，利用现有训练数据生成的伪标签为广泛的近距离视角提供有针对性的监督。 Result: 实验结果表明，该方法在生成近距离视角时具有显著效果。 Conclusion: 提出的基于伪标签的学习策略有效解决了现有方法在生成近距离视角时的不足，并提供了一个新的数据集用于评估当前和未来方法在这一领域的表现。 Abstract: Recent methods, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have demonstrated remarkable capabilities in novel view synthesis. However, despite their success in producing high-quality images for viewpoints similar to those seen during training, they struggle when generating detailed images from viewpoints that significantly deviate from the training set, particularly in close-up views. The primary challenge stems from the lack of specific training data for close-up views, leading to the inability of current methods to render these views accurately. To address this issue, we introduce a novel pseudo-label-based learning strategy. This approach leverages pseudo-labels derived from existing training data to provide targeted supervision across a wide range of close-up viewpoints. Recognizing the absence of benchmarks for this specific challenge, we also present a new dataset designed to assess the effectiveness of both current and future methods in this area. Our extensive experiments demonstrate the efficacy of our approach.

Zhihang Liu,Chen-Wei Xie,Pandeng Li,Liming Zhao,Longxiang Tang,Yun Zheng,Chuanbin Liu,Hongtao Xie

Task: 提出一种用于多模态大语言模型（MLLMs）的条件令牌压缩策略，以减少视频帧带来的计算开销。

Motivation: 现有的压缩策略（如平均池化）会导致潜在有用信息的丢失，无法根据用户指令有效保留视觉内容。

Details

Method: 提出了一种混合级指令注入策略（HICom），利用指令作为条件，从局部和全局两个层次指导压缩，以保留用户关注的信息并减少视觉令牌。 Result: 实验表明，HICom在三个多项选择QA基准测试中平均性能提高了2.43%，并节省了78.8%的令牌。 Conclusion: HICom能够在减少令牌数量的同时，显著提高视频理解能力，并保留时空结构以便于大语言模型理解。 Abstract: Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code is available at https://github.com/lntzm/HICom.

No Thing, Nothing: Highlighting Safety-Critical Classes for Robust LiDAR Semantic Segmentation in Adverse Weather

Junsung Park,Hwijeong Lee,Inha Kang,Hyunjung Shim

Task: 提高在恶劣天气条件下LiDAR语义分割中'事物'类别的预测准确性。

Motivation: 在典型的驾驶场景中，'事物'类别通常是动态的且与较高的碰撞风险相关，因此对安全导航和规划至关重要。现有方法在恶劣天气条件下对'事物'类别的预测准确性较低，这成为一个严重的瓶颈。

Details

Method: 提出了一种名为NTN的方法，通过将每个点特征绑定到其超类来防止语义级特征的退化，并通过定义每个LiDAR光束为局部区域并提出一个正则化项来增强对局部特征退化的鲁棒性。 Result: NTN在SemanticKITTI-to-SemanticSTF基准测试中实现了+2.6 mIoU的提升，在SemanticPOSS-to-SemanticSTF基准测试中实现了+7.9 mIoU的提升。特别是在'事物'类别上分别实现了+4.8和+7.9 mIoU的提升。 Conclusion: NTN方法在恶劣天气条件下显著提高了LiDAR语义分割中'事物'类别的预测准确性，证明了其有效性。 Abstract: Existing domain generalization methods for LiDAR semantic segmentation under adverse weather struggle to accurately predict "things" categories compared to "stuff" categories. In typical driving scenes, "things" categories can be dynamic and associated with higher collision risks, making them crucial for safe navigation and planning. Recognizing the importance of "things" categories, we identify their performance drop as a serious bottleneck in existing approaches. We observed that adverse weather induces degradation of semantic-level features and both corruption of local features, leading to a misprediction of "things" as "stuff". To mitigate these corruptions, we suggest our method, NTN - segmeNt Things for No-accident. To address semantic-level feature corruption, we bind each point feature to its superclass, preventing the misprediction of things classes into visually dissimilar categories. Additionally, to enhance robustness against local corruption caused by adverse weather, we define each LiDAR beam as a local region and propose a regularization term that aligns the clean data with its corrupted counterpart in feature space. NTN achieves state-of-the-art performance with a +2.6 mIoU gain on the SemanticKITTI-to-SemanticSTF benchmark and +7.9 mIoU on the SemanticPOSS-to-SemanticSTF benchmark. Notably, NTN achieves a +4.8 and +7.9 mIoU improvement on "things" classes, respectively, highlighting its effectiveness.

Redefining Toxicity: An Objective and Context-Aware Approach for Stress-Level-Based Detection

Sergey Berezin,Reza Farahbakhsh,Noel Crespi

Task: 提出一种新颖的、客观的、上下文感知的毒性检测框架。

Motivation: 毒性检测的根本问题在于'毒性'一词的定义不明确，导致研究人员在模型训练中依赖主观和模糊的数据，从而产生不稳健和不准确的结果。

Details

Method: 利用压力水平作为毒性的关键决定因素，提出了新的定义、指标和训练方法。 Result: 使用收集的数据集证明了该框架的有效性。 Conclusion: 该研究为毒性检测提供了一种更客观和准确的框架。 Abstract: The fundamental problem of toxicity detection lies in the fact that the term "toxicity" is ill-defined. Such uncertainty causes researchers to rely on subjective and vague data during model training, which leads to non-robust and inaccurate results, following the 'garbage in - garbage out' paradigm. This study introduces a novel, objective, and context-aware framework for toxicity detection, leveraging stress levels as a key determinant of toxicity. We propose new definition, metric and training approach as a parts of our framework and demonstrate it's effectiveness using a dataset we collected.

Text-Driven Diffusion Model for Sign Language Production

Jiayi He,Xu Wang,Ruobei Zhang,Shengeng Tang,Yaxiong Wang,Lechao Cheng

Task: 生成语义对齐的手语姿势序列

Motivation: 解决从文本输入生成语义对齐的手语姿势序列的挑战

Details

Method: 提出了一个文本驱动的扩散模型（TDM）框架，利用编码器编码文本序列并将其作为条件输入到扩散模型中，生成手语姿势序列 Result: 在挑战中取得了BLEU-1得分20.17，排名第二 Conclusion: 精心设计的框架在手语生成任务中表现良好 Abstract: We introduce the hfut-lmc team's solution to the SLRTP Sign Production Challenge. The challenge aims to generate semantically aligned sign language pose sequences from text inputs. To this end, we propose a Text-driven Diffusion Model (TDM) framework. During the training phase, TDM utilizes an encoder to encode text sequences and incorporates them into the diffusion model as conditional input to generate sign pose sequences. To guarantee the high quality and accuracy of the generated pose sequences, we utilize two key loss functions. The joint loss function L_{joint} is used to precisely measure and minimize the differences between the joint positions of the generated pose sequences and those of the ground truth. Similarly, the bone orientation loss function L_{bone} is instrumental in ensuring that the orientation of the bones in the generated poses aligns with the actual, correct orientations. In the inference stage, the TDM framework takes on a different yet equally important task. It starts with noisy sequences and, under the strict constraints of the text conditions, gradually refines and generates semantically consistent sign language pose sequences. Our carefully designed framework performs well on the sign language production task, and our solution achieves a BLEU-1 score of 20.17, placing second in the challenge.

Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models

Mats Faulborn,Indira Sen,Max Pellert,Andreas Spitz,David Garcia

Task: 测量和分析语言模型中的政治偏见。

Motivation: 现有的政治偏见测量方法如政治罗盘测试（PCT）存在科学有效性不足的问题，且不同提示技术导致结果不一致。

Details

Method: 基于政治学理论和调查设计原则，开发了一种新的政治偏见测量方法，测试了11种不同的开放和商业模型，并自动分类了88,110个响应的政治立场。 Result: 发现PCT在某些模型如GPT3.5中夸大了偏见，政治偏见的测量通常不稳定，但指令调优模型通常更左倾。 Conclusion: 提出了一种更科学的政治偏见测量方法，揭示了指令调优模型的政治倾向性。 Abstract: Prompt-based language models like GPT4 and LLaMa have been used for a wide variety of use cases such as simulating agents, searching for information, or for content analysis. For all of these applications and others, political biases in these models can affect their performance. Several researchers have attempted to study political bias in language models using evaluation suites based on surveys, such as the Political Compass Test (PCT), often finding a particular leaning favored by these models. However, there is some variation in the exact prompting techniques, leading to diverging findings and most research relies on constrained-answer settings to extract model responses. Moreover, the Political Compass Test is not a scientifically valid survey instrument. In this work, we contribute a political bias measured informed by political science theory, building on survey design principles to test a wide variety of input prompts, while taking into account prompt sensitivity. We then prompt 11 different open and commercial models, differentiating between instruction-tuned and non-instruction-tuned models, and automatically classify their political stances from 88,110 responses. Leveraging this dataset, we compute political bias profiles across different prompt variations and find that while PCT exaggerates bias in certain models like GPT3.5, measures of political bias are often unstable, but generally more left-leaning for instruction-tuned models.

Learning to Efficiently Adapt Foundation Models for Self-Supervised Endoscopic 3D Scene Reconstruction from Any Cameras

Beilei Cui,Long Bai,Mobarakol Islam,An Wang,Zhiqi Ma,Yiming Huang,Feng Li,Zhen Chen,Zhongliang Jiang,Nassir Navab,Hongliang Ren

Task: 提出了一种用于内窥镜场景重建的统一框架Endo3DAC，能够同时估计深度图、相对姿态和相机内参。

Motivation: 由于获取真实数据的挑战，自监督学习在内窥镜深度估计中变得越来越重要，但基础模型在医学领域的直接应用往往效果不佳，需要有效的适应策略。

Details

Method: 设计了一个集成网络，通过冻结基础模型并训练专门设计的GDV-LoRA和分离的解码头，实现了高效的深度和姿态估计。 Result: 在四个内窥镜数据集上的广泛实验表明，Endo3DAC显著优于其他最先进的方法，同时需要更少的可训练参数。 Conclusion: Endo3DAC是第一个仅需手术视频即可执行自监督深度估计和场景重建任务的单一网络。 Abstract: Accurate 3D scene reconstruction is essential for numerous medical tasks. Given the challenges in obtaining ground truth data, there has been an increasing focus on self-supervised learning (SSL) for endoscopic depth estimation as a basis for scene reconstruction. While foundation models have shown remarkable progress in visual tasks, their direct application to the medical domain often leads to suboptimal results. However, the visual features from these models can still enhance endoscopic tasks, emphasizing the need for efficient adaptation strategies, which still lack exploration currently. In this paper, we introduce Endo3DAC, a unified framework for endoscopic scene reconstruction that efficiently adapts foundation models. We design an integrated network capable of simultaneously estimating depth maps, relative poses, and camera intrinsic parameters. By freezing the backbone foundation model and training only the specially designed Gated Dynamic Vector-Based Low-Rank Adaptation (GDV-LoRA) with separate decoder heads, Endo3DAC achieves superior depth and pose estimation while maintaining training efficiency. Additionally, we propose a 3D scene reconstruction pipeline that optimizes depth maps' scales, shifts, and a few parameters based on our integrated network. Extensive experiments across four endoscopic datasets demonstrate that Endo3DAC significantly outperforms other state-of-the-art methods while requiring fewer trainable parameters. To our knowledge, we are the first to utilize a single network that only requires surgical videos to perform both SSL depth estimation and scene reconstruction tasks. The code will be released upon acceptance.

CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models

Hong Yi Lin,Chunhua Liu,Haoyu Gao,Patanamon Thongtanunam,Christoph Treude

Task: 评估大型语言模型在代码审查任务中的能力，特别是代码修订任务。

Motivation: 现有的代码生成模型在实际软件工程任务中表现不佳，尤其是在处理代码审查评论时，这些评论往往隐含、模糊且口语化，需要模型理解代码和人类意图。

Details

Method: 引入了一个新的评估基准CodeReviewQA，将代码修订生成任务分解为三个基本推理步骤：变更类型识别（CTR）、变更定位（CL）和解决方案识别（SI），并将每个步骤重新表述为多项选择题。 Result: 在72个最近发布的大型语言模型上进行了全面评估，结果显示CodeReviewQA能够暴露模型在代码审查理解中的特定弱点。 Conclusion: CodeReviewQA能够对模型能力进行细粒度评估，并减轻数据污染风险，从而更好地理解模型在代码审查任务中的表现。 Abstract: State-of-the-art large language models (LLMs) have demonstrated impressive code generation capabilities but struggle with real-world software engineering tasks, such as revising source code to address code reviews, hindering their practical use. Code review comments are often implicit, ambiguous, and colloquial, requiring models to grasp both code and human intent. This challenge calls for evaluating large language models' ability to bridge both technical and conversational contexts. While existing work has employed the automated code refinement (ACR) task to resolve these comments, current evaluation methods fall short, relying on text matching metrics that provide limited insight into model failures and remain susceptible to training data contamination. To address these limitations, we introduce a novel evaluation benchmark, $\textbf{CodeReviewQA}$ that enables us to conduct fine-grained assessment of model capabilities and mitigate data contamination risks. In CodeReviewQA, we decompose the generation task of code refinement into $\textbf{three essential reasoning steps}$: $\textit{change type recognition}$ (CTR), $\textit{change localisation}$ (CL), and $\textit{solution identification}$ (SI). Each step is reformulated as multiple-choice questions with varied difficulty levels, enabling precise assessment of model capabilities, while mitigating data contamination risks. Our comprehensive evaluation spans 72 recently released large language models on $\textbf{900 manually curated, high-quality examples}$ across nine programming languages. Our results show that CodeReviewQA is able to expose specific model weaknesses in code review comprehension, disentangled from their generative automated code refinement results.

BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers

Hui Zhang,Tingwei Gao,Jie Shao,Zuxuan Wu

Task: 提出一种无需训练的方法BlockDance，通过探索相邻时间步的特征相似性来加速扩散变换器（DiTs）。

Motivation: 扩散变换器（DiTs）在生成能力上表现出色，但由于迭代去噪过程，推理速度较慢。为了解决这一问题，提出了BlockDance。

Details

Method: BlockDance通过识别结构上最相似的特征（STSS特征），在去噪后期阶段缓存并重用这些特征，以减少冗余计算。此外，还引入了BlockDance-Ada，一个轻量级的决策网络，用于实例特定的加速。 Result: BlockDance和BlockDance-Ada在各种生成任务和模型中都表现出色，实现了25%到50%的加速，同时保持了生成质量。 Conclusion: BlockDance和BlockDance-Ada通过重用结构上相似的特征和动态资源分配，有效加速了扩散变换器，同时保持了生成质量。 Abstract: Diffusion models have demonstrated impressive generation capabilities, particularly with recent advancements leveraging transformer architectures to improve both visual and artistic quality. However, Diffusion Transformers (DiTs) continue to encounter challenges related to low inference speed, primarily due to the iterative denoising process. To address this issue, we propose BlockDance, a training-free approach that explores feature similarities at adjacent time steps to accelerate DiTs. Unlike previous feature-reuse methods that lack tailored reuse strategies for features at different scales, BlockDance prioritizes the identification of the most structurally similar features, referred to as Structurally Similar Spatio-Temporal (STSS) features. These features are primarily located within the structure-focused blocks of the transformer during the later stages of denoising. BlockDance caches and reuses these highly similar features to mitigate redundant computation, thereby accelerating DiTs while maximizing consistency with the generated results of the original model. Furthermore, considering the diversity of generated content and the varying distributions of redundant features, we introduce BlockDance-Ada, a lightweight decision-making network tailored for instance-specific acceleration. BlockDance-Ada dynamically allocates resources and provides superior content quality. Both BlockDance and BlockDance-Ada have proven effective across various generation tasks and models, achieving accelerations between 25% and 50% while maintaining generation quality.

Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

Andrea Maracani,Savas Ozkan,Sijun Cho,Hyowon Kim,Eunchung Noh,Jeongwon Min,Cho Jung Min,Dookun Park,Mete Ozay

Task: 分析视觉编码器和文本解码器在场景文本识别（STR）中的扩展效果，并提出一种新的方法来缓解标签噪声。

Motivation: 尽管扩展架构已被证明对提高场景文本识别（STR）有效，但视觉编码器和文本解码器扩展的个体贡献尚未得到充分探索。此外，标签噪声是STR中的一个关键挑战，特别是在现实世界的数据中。

Details

Method: 提出了Cloze Self-Distillation (CSD)方法，通过从教师模型生成的上下文感知软预测和伪标签中蒸馏学生模型来缓解标签噪声。此外，还引入了差分交叉注意力来增强解码器架构。 Result: 在11个基准测试中的10个上实现了最先进的性能，同时显著减少了参数大小和计算成本。 Conclusion: 解码器扩展在STR中带来了显著的性能提升，CSD方法有效缓解了标签噪声，差分交叉注意力进一步增强了解码器架构。 Abstract: Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.

DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables

Sidi Yang,Binxiao Huang,Yulun Zhang,Dahai Yu,Yujiu Yang,Ngai Wong

Task: 提出一种基于查找表的超高效框架DnLUT，用于高质量彩色图像去噪。

Motivation: 深度神经网络在图像去噪方面取得了革命性进展，但其在边缘设备上的部署由于计算和内存需求大而具有挑战性。

Details

Method: DnLUT框架包含两个互补组件：Pairwise Channel Mixer (PCM) 和 L形卷积设计，通过将这些组件转换为优化的查找表来实现高效去噪。 Result: DnLUT仅需500KB存储和0.1%的能耗，推理速度比DnCNN快20倍，并在PSNR上超过现有LUT方法1dB以上。 Conclusion: DnLUT在资源高效的彩色图像去噪方面建立了新的最先进水平。 Abstract: While deep neural networks have revolutionized image denoising capabilities, their deployment on edge devices remains challenging due to substantial computational and memory requirements. To this end, we present DnLUT, an ultra-efficient lookup table-based framework that achieves high-quality color image denoising with minimal resource consumption. Our key innovation lies in two complementary components: a Pairwise Channel Mixer (PCM) that effectively captures inter-channel correlations and spatial dependencies in parallel, and a novel L-shaped convolution design that maximizes receptive field coverage while minimizing storage overhead. By converting these components into optimized lookup tables post-training, DnLUT achieves remarkable efficiency - requiring only 500KB storage and 0.1% energy consumption compared to its CNN contestant DnCNN, while delivering 20X faster inference. Extensive experiments demonstrate that DnLUT outperforms all existing LUT-based methods by over 1dB in PSNR, establishing a new state-of-the-art in resource-efficient color image denoising. The project is available at https://github.com/Stephen0808/DnLUT.

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

Quy-Anh Dang,Chris Ngo

Task: 研究如何通过强化学习（RL）提升小型大语言模型（LLMs）的推理能力。

Motivation: 大规模计算资源和数据集通常用于提升大语言模型的推理能力，但在资源受限的环境中难以实现。

Details

Method: 使用Group Relative Policy Optimization (GRPO)算法，并在4个NVIDIA A40 GPU（每个48 GB VRAM）上在24小时内训练1.5亿参数的DeepSeek-R1-Distill-Qwen-1.5B模型。 Result: 在仅使用7,000个样本和42美元的训练成本下，模型在AMC23上的准确率从63%提升到80%，在AIME24上达到46.7%，超过了o1-preview模型。 Conclusion: 基于RL的微调方法为小型LLMs提供了一种成本效益高的替代方案，适用于资源受限的环境。 Abstract: Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.

SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer

Hongda Liu,Longguang Wang,Ye Zhang,Ziru Yu,Yulan Guo

Task: 开发一种基于Mamba的图像风格迁移框架，称为SaMam。

Motivation: 现有的风格迁移方法（如CNNs和Transformers）在实现全局感受野时计算复杂度高，而状态空间模型（SSM）尤其是Mamba在长程依赖建模方面表现出色且具有线性复杂度。

Details

Method: 设计了一个Mamba编码器来高效提取内容和风格信息，并开发了一个风格感知的Mamba解码器以灵活适应各种风格。此外，引入了局部增强和Zigzag扫描来解决现有SSM的局部像素遗忘、通道冗余和空间不连续性问题。 Result: 定性和定量结果表明，SaMam在准确性和效率方面均优于现有最先进的方法。 Conclusion: SaMam框架通过Mamba模型有效解决了风格迁移中的全局感受野问题，并在性能和效率上取得了显著提升。 Abstract: Global effective receptive field plays a crucial role for image style transfer (ST) to obtain high-quality stylized results. However, existing ST backbones (e.g., CNNs and Transformers) suffer huge computational complexity to achieve global receptive fields. Recently, the State Space Model (SSM), especially the improved variant Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a approach to resolve the above dilemma. In this paper, we develop a Mamba-based style transfer framework, termed SaMam. Specifically, a mamba encoder is designed to efficiently extract content and style information. In addition, a style-aware mamba decoder is developed to flexibly adapt to various styles. Moreover, to address the problems of local pixel forgetting, channel redundancy and spatial discontinuity of existing SSMs, we introduce both local enhancement and zigzag scan. Qualitative and quantitative results demonstrate that our SaMam outperforms state-of-the-art methods in terms of both accuracy and efficiency.

Akhil Perincherry,Jacob Krantz,Stefan Lee

Task: 研究视觉表示的子目标是否可以作为导航线索并提高导航性能。

Motivation: 探索通过自然语言指令导航未见环境的视觉-语言导航（VLN）代理的性能提升方法。

Details

Method: 利用文本到图像扩散模型生成指令中包含的地标参考的视觉表示，并将这些表示作为额外的模态提供给VLN代理，同时添加辅助损失以明确鼓励将这些表示与相应的引用表达式关联。 Result: 成功率（SR）提高了约1个百分点，路径长度逆比成功率（SPL）提高了最多0.5个百分点。 Conclusion: 所提出的方法通过增强视觉理解，相比仅依赖语言指令，提高了导航性能。 Abstract: Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or imaginations, we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of around 1 point and up to 0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone. Code and data for our work can be found at https://www.akhilperincherry.com/VLN-Imagine-website/.

UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation

Yaxiong Chen,Chuang Du,Chunlei Li,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou

Task: 自动化放射学报告生成，旨在加快放射科医生繁琐且容易出错的报告过程。

Motivation: 由于标记医学数据的相对稀缺性，学习对齐医学图像和文本发现仍然具有挑战性。

Details

Method: 提出从大规模预训练的视觉-语言模型CLIP中转移表示，以更好地捕捉图像和文本之间的跨模态语义。引入UniCrossAdapter，轻量级适配器模块，将其纳入CLIP并在目标任务上进行微调，同时保持基础参数固定。 Result: 在两个公共数据集上的实验证明了该方法的有效性，推动了放射学报告生成的最新技术。 Conclusion: 所提出的迁移学习框架提供了一种利用大规模预训练模型的语义知识来解决数据稀缺的医学视觉-语言任务的方法。 Abstract: Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available at https://github.com/chauncey-tow/MRG-CLIP.

The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination

Yifan Sun,Han Wang,Dongbai Li,Gang Wang,Huan Zhang

Task: 评估现有基准数据污染（BDC）缓解策略的有效性。

Motivation: 基准数据污染（BDC）在大型语言模型（LLM）评估中引发了越来越多的关注，导致性能估计的虚假膨胀，削弱了评估的可靠性。

Details

Method: 设计了一个系统化和受控的管道，并提出了两个新指标——保真度和污染抵抗力，以提供对现有BDC缓解策略的细粒度和全面评估。 Result: 实验结果表明，现有的缓解策略在所有基准测试中均未显著提高抵抗力，且没有一种策略能有效平衡保真度和污染抵抗力。 Conclusion: 这些发现强调了设计更有效的BDC缓解策略的迫切需求。 Abstract: Benchmark Data Contamination (BDC)-the inclusion of benchmark testing samples in the training set-has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics-fidelity and contamination resistance-to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing question-level evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, and none effectively balances fidelity and contamination resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies. Our code repository is available at https://github.com/ASTRAL-Group/BDC_mitigation_assessment.

Don't Fight Hallucinations, Use Them: Estimating Image Realism using NLI over Atomic Facts

Elisei Rykov,Kseniia Petrushina,Kseniia Titova,Alexander Panchenko,Vasily Konovalov

Task: 提出一种使用大型视觉语言模型（LVLMs）和自然语言推理（NLI）来评估图像真实性的新方法。

Motivation: 量化图像的真实性在人工智能领域仍然是一个具有挑战性的问题，例如爱因斯坦拿着智能手机的图像违反了常识。

Details

Method: 使用LVLM从图像中提取原子事实，计算这些事实之间的成对蕴含分数，并聚合这些值以生成一个单一的现实分数。 Result: 在WHOOPS!数据集上实现了零样本模式下的最新性能。 Conclusion: 该方法能够识别违反常识的图像，并在评估图像真实性方面取得了显著进展。 Abstract: Quantifying the realism of images remains a challenging problem in the field of artificial intelligence. For example, an image of Albert Einstein holding a smartphone violates common-sense because modern smartphone were invented after Einstein's death. We introduce a novel method for assessing image realism using Large Vision-Language Models (LVLMs) and Natural Language Inference (NLI). Our approach is based on the premise that LVLMs may generate hallucinations when confronted with images that defy common sense. Using LVLM to extract atomic facts from these images, we obtain a mix of accurate facts and erroneous hallucinations. We proceed by calculating pairwise entailment scores among these facts, subsequently aggregating these values to yield a singular reality score. This process serves to identify contradictions between genuine facts and hallucinatory elements, signaling the presence of images that violate common sense. Our approach has achieved a new state-of-the-art performance in zero-shot mode on the WHOOPS! dataset.

Survey on Evaluation of LLM-based Agents

Asaf Yehudai,Lilach Eden,Alan Li,Guy Uziel,Yilun Zhao,Roy Bar-Haim,Arman Cohan,Michal Shmueli-Scheuer

Task: 对LLM-based agents的评估方法进行全面的调查和分析。

Motivation: 随着LLM-based agents的快速发展，需要系统化的评估方法来衡量其能力。

Details

Method: 系统分析了评估基准和框架，涵盖四个关键维度：基本代理能力、应用特定基准、通用代理基准和评估框架。 Result: 揭示了新兴趋势，包括向更现实、更具挑战性的评估转变，并识别了未来研究需要解决的关键差距。 Conclusion: 本文为LLM-based agents的评估提供了全面的调查，揭示了新兴趋势，识别了当前局限性，并提出了未来研究的方向。 Abstract: The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.

CausalCLIPSeg: Unlocking CLIP's Potential in Referring Medical Image Segmentation with Causal Intervention

Yaxiong Chen,Minghong Wei,Zixuan Zheng,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou

Task: Referring medical image segmentation targets delineating lesions indicated by textual descriptions.

Motivation: Aligning visual and textual cues is challenging due to their distinct data properties.

Details

Method: Propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP, with a tailored cross-modal decoding method and a causal intervention module. Result: Extensive experiments demonstrate the state-of-the-art performance of the proposed method. Conclusion: CausalCLIPSeg effectively aligns text-to-pixel and mitigates confounding bias, achieving superior segmentation performance. Abstract: Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP's rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at https://github.com/WUTCM-Lab/CausalCLIPSeg.

Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation

Clive Tinashe Marimo,Benedikt Blumenstiel,Maximilian Nitsche,Johannes Jakubik,Thomas Brunschwiler

Task: 开发并评估Llama3-MS-CLIP，一种基于对比学习的多光谱视觉语言模型，用于地球观测。

Motivation: 现有的视觉语言模型通常仅依赖视觉光谱数据，未能充分利用卫星记录的多光谱信息。

Details

Method: 引入Llama3-MS-CLIP模型，使用对比学习在大规模多光谱数据集上进行预训练，并开发了一个可扩展的标注管道。 Result: Llama3-MS-CLIP在多光谱零样本图像分类和检索任务中显著优于其他基于RGB的方法，分类准确率平均提高6.77%，检索性能提高4.63% mAP。 Conclusion: 多光谱视觉语言学习具有重要价值，并发布了图像标注数据集、代码和模型权重。 Abstract: Vision-language models for Earth observation (EO) typically rely on the visual spectrum of data as the only model input, thus failing to leverage the rich spectral information available in the multispectral channels recorded by satellites. Therefore, in this paper, we introduce Llama3-MS-CLIP, the first vision-language model pre-trained with contrastive learning on a large-scale multispectral dataset and report on the performance gains due to the extended spectral range. Furthermore, we present the largest-to-date image-caption dataset for multispectral data, consisting of one million Sentinel-2 samples and corresponding textual descriptions generated with Llama3-LLaVA-Next and Overture Maps data. We develop a scalable captioning pipeline, which is validated by domain experts. We evaluate Llama3-MS-CLIP on multispectral zero-shot image classification and retrieval using three datasets of varying complexity. Our results demonstrate that Llama3-MS-CLIP significantly outperforms other RGB-based approaches, improving classification accuracy by 6.77% on average and retrieval performance by 4.63% mAP compared to the second-best model. Our results emphasize the relevance of multispectral vision-language learning. We release the image-caption dataset, code, and model weights under an open-source license.

V-NAW: Video-based Noise-aware Adaptive Weighting for Facial Expression Recognition

JunGyu Lee,Kunyoung Lee,Haesol Park,Ig-Jae Kim,Gi Pyo Nam

Task: 解决视频中面部表情识别中的标签模糊和类别不平衡问题。

Motivation: 标签模糊和类别不平衡会导致性能下降，解决这些问题可以显著提高性能。

Details

Method: 提出了视频噪声感知自适应加权（V-NAW）方法，自适应地为每个帧分配重要性，并引入了一种简单有效的增强策略以减少连续帧之间的冗余。 Result: 通过大量实验验证了该方法的有效性，显著提高了视频面部表情识别的性能。 Conclusion: 提出的方法有效解决了标签模糊和类别不平衡问题，显著提升了视频面部表情识别的性能。 Abstract: Facial Expression Recognition (FER) plays a crucial role in human affective analysis and has been widely applied in computer vision tasks such as human-computer interaction and psychological assessment. The 8th Affective Behavior Analysis in-the-Wild (ABAW) Challenge aims to assess human emotions using the video-based Aff-Wild2 dataset. This challenge includes various tasks, including the video-based EXPR recognition track, which is our primary focus. In this paper, we demonstrate that addressing label ambiguity and class imbalance, which are known to cause performance degradation, can lead to meaningful performance improvements. Specifically, we propose Video-based Noise-aware Adaptive Weighting (V-NAW), which adaptively assigns importance to each frame in a clip to address label ambiguity and effectively capture temporal variations in facial expressions. Furthermore, we introduce a simple and effective augmentation strategy to reduce redundancy between consecutive frames, which is a primary cause of overfitting. Through extensive experiments, we validate the effectiveness of our approach, demonstrating significant improvements in video-based FER performance.

STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding

Zichen Liu,Kunlun Xu,Bing Su,Xu Zou,Yuxin Peng,Jiahuan Zhou

Task: 提出了一种集成空间-时间动态提示（STOP）模型，用于视频任务中的零样本泛化。

Motivation: 现有的视频提示方法通常依赖于单一静态提示，忽略了视频序列中的时间动态和空间变化，这限制了模型捕捉关键时间信息的能力。

Details

Method: STOP模型由两个互补模块组成：帧内空间提示和帧间时间提示。帧内空间提示通过利用帧内注意力和时间变化，自适应地突出每帧中的判别区域。帧间时间提示则动态地在具有高时间方差的帧之间插入提示。 Result: 在各种视频基准测试中，STOP模型始终优于最先进的方法。 Conclusion: STOP模型通过集成空间和时间动态提示，显著提高了视频理解的能力。 Abstract: Pre-trained on tremendous image-text pairs, vision-language models like CLIP have demonstrated promising zero-shot generalization across numerous image-based tasks. However, extending these capabilities to video tasks remains challenging due to limited labeled video data and high training costs. Recent video prompting methods attempt to adapt CLIP for video tasks by introducing learnable prompts, but they typically rely on a single static prompt for all video sequences, overlooking the diverse temporal dynamics and spatial variations that exist across frames. This limitation significantly hinders the model's ability to capture essential temporal information for effective video understanding. To address this, we propose an integrated Spatial-TempOral dynamic Prompting (STOP) model which consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. Our intra-frame spatial prompts are designed to adaptively highlight discriminative regions within each frame by leveraging intra-frame attention and temporal variation, allowing the model to focus on areas with substantial temporal dynamics and capture fine-grained spatial details. Additionally, to highlight the varying importance of frames for video understanding, we further introduce inter-frame temporal prompts, dynamically inserting prompts between frames with high temporal variance as measured by frame similarity. This enables the model to prioritize key frames and enhances its capacity to understand temporal dependencies across sequences. Extensive experiments on various video benchmarks demonstrate that STOP consistently achieves superior performance against state-of-the-art methods. The code is available at https://github.com/zhoujiahuan1991/CVPR2025-STOP.

Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation

Kendong Liu,Zhiyu Zhu,Hui Liu,Junhui Hou

Task: 加速从单张图像生成3D模型的扩散过程。

Motivation: 解决通过少量推理步骤生成高质量3D重建的关键问题，特别是正则化随机噪声状态下的分数函数学习。

Details

Method: 提出边缘一致性，即在高信噪比区域进行一致预测，以增强预训练的扩散模型，并通过蒸馏方法精炼端点分数函数。此外，提出对抗性增强策略以丰富生成细节并提升整体生成质量。 Result: Acc3D不仅实现了超过20倍的计算效率提升，还在生成质量上有显著改进。 Conclusion: Acc3D通过边缘一致性和对抗性增强策略，显著提升了生成3D模型的效率和质量。 Abstract: We present Acc3D to tackle the challenge of accelerating the diffusion process to generate 3D models from single images. To derive high-quality reconstructions through few-step inferences, we emphasize the critical issue of regularizing the learning of score function in states of random noise. To this end, we propose edge consistency, i.e., consistent predictions across the high signal-to-noise ratio region, to enhance a pre-trained diffusion model, enabling a distillation-based refinement of the endpoint score function. Building on those distilled diffusion models, we propose an adversarial augmentation strategy to further enrich the generation detail and boost overall generation quality. The two modules complement each other, mutually reinforcing to elevate generative performance. Extensive experiments demonstrate that our Acc3D not only achieves over a $20\times$ increase in computational efficiency but also yields notable quality improvements, compared to the state-of-the-arts.

A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli

Pengyu Liu,Guohua Dong,Dan Guo,Kun Li,Fengling Li,Xun Yang,Meng Wang,Xiaomin Ying

Task: 综述基于fMRI的脑解码技术，特别是从被动脑信号中重建刺激的研究进展。

Motivation: 通过解码脑信号重建刺激，揭示复杂的神经机制，推动人工智能、疾病治疗和脑机接口的进展。

Details

Method: 系统回顾了基于fMRI的脑解码技术的最新进展，总结了数据集、相关脑区域，并按模型结构对现有方法进行了分类。 Result: 评估了模型性能并讨论了其有效性，提出了未来的研究方向。 Conclusion: 该综述为基于fMRI的脑解码领域提供了有价值的见解，并提出了未来的研究挑战和方向。 Abstract: In daily life, we encounter diverse external stimuli, such as images, sounds, and videos. As research in multimodal stimuli and neuroscience advances, fMRI-based brain decoding has become a key tool for understanding brain perception and its complex cognitive processes. Decoding brain signals to reconstruct stimuli not only reveals intricate neural mechanisms but also drives progress in AI, disease treatment, and brain-computer interfaces. Recent advancements in neuroimaging and image generation models have significantly improved fMRI-based decoding. While fMRI offers high spatial resolution for precise brain activity mapping, its low temporal resolution and signal noise pose challenges. Meanwhile, techniques like GANs, VAEs, and Diffusion Models have enhanced reconstructed image quality, and multimodal pre-trained models have boosted cross-modal decoding tasks. This survey systematically reviews recent progress in fMRI-based brain decoding, focusing on stimulus reconstruction from passive brain signals. It summarizes datasets, relevant brain regions, and categorizes existing methods by model structure. Additionally, it evaluates model performance and discusses their effectiveness. Finally, it identifies key challenges and proposes future research directions, offering valuable insights for the field. For more information and resources related to this survey, visit https://github.com/LpyNow/BrainDecodingImage.

Suraj Singh,Anastasia Batsheva,Oleg Y. Rogov,Ahmed Bouridane

Task: 探索和改进深度图像先验（DIP）模型在天体摄影图像恢复和超分辨率中的应用。

Motivation: 天体摄影图像恢复和超分辨率面临训练数据有限的问题，现有的深度学习方法存在过拟合、伪影生成和不稳定性等挑战。

Details

Method: 通过多帧处理、Back Projection方法、TVNet模型、马尔可夫方法、蒙特卡洛估计、Langevin动力学和变分输入技术来改进DIP模型。 Result: 改进后的算法在天体和天体物体的多个图像集上验证，性能超越了传统的Lucky Imaging技术和现有的DIP模型、基于Transformer和扩散的模型。 Conclusion: 提出的改进方法有效减少了噪声学习和训练过程中的损失函数波动，显著提升了结果稳定性，展示了在天体摄影图像恢复和超分辨率中的重要性。 Abstract: Contemporary image restoration and super-resolution techniques effectively harness deep neural networks, markedly outperforming traditional methods. However, astrophotography presents unique challenges for deep learning due to limited training data. This work explores hybrid strategies, such as the Deep Image Prior (DIP) model, which facilitates blind training but is susceptible to overfitting, artifact generation, and instability when handling noisy images. We propose enhancements to the DIP model's baseline performance through several advanced techniques. First, we refine the model to process multiple frames concurrently, employing the Back Projection method and the TVNet model. Next, we adopt a Markov approach incorporating Monte Carlo estimation, Langevin dynamics, and a variational input technique to achieve unbiased estimates with minimal variance and counteract overfitting effectively. Collectively, these modifications reduce the likelihood of noise learning and mitigate loss function fluctuations during training, enhancing result stability. We validated our algorithm across multiple image sets of astronomical and celestial objects, achieving performance that not only mitigates limitations of Lucky Imaging, a classical computer vision technique that remains a standard in astronomical image reconstruction but surpasses the original DIP model, state of the art transformer- and diffusion-based models, underscoring the significance of our improvements.

Automating 3D Dataset Generation with Neural Radiance Fields

P. Schulz,T. Hempel,A. Al-Hamadi

Task: 提出一种用于自动生成任意物体3D数据集的管道。

Motivation: 训练高性能的检测模型需要多样化、精确标注且大规模的数据集，而这些数据集的创建过程复杂且昂贵，现有的公共3D数据集数量有限且类别范围有限。

Details

Method: 利用Radiance Fields的通用3D表示和渲染能力，生成高质量的任意物体3D模型，并将其作为合成数据集生成器的输入。 Result: 实验表明，使用生成的数据集训练的3D姿态估计网络在典型应用场景中表现出色。 Conclusion: 所提出的管道快速、易用且具有高度自动化，能够有效生成高质量的3D数据集。 Abstract: 3D detection is a critical task to understand spatial characteristics of the environment and is used in a variety of applications including robotics, augmented reality, and image retrieval. Training performant detection models require diverse, precisely annotated, and large scale datasets that involve complex and expensive creation processes. Hence, there are only few public 3D datasets that are additionally limited in their range of classes. In this work, we propose a pipeline for automatic generation of 3D datasets for arbitrary objects. By utilizing the universal 3D representation and rendering capabilities of Radiance Fields, our pipeline generates high quality 3D models for arbitrary objects. These 3D models serve as input for a synthetic dataset generator. Our pipeline is fast, easy to use and has a high degree of automation. Our experiments demonstrate, that 3D pose estimation networks, trained with our generated datasets, archive strong performance in typical application scenarios.

SenseExpo: Efficient Autonomous Exploration with Prediction Information from Lightweight Neural Networks

Haojia Gao,Haohua Que,Hoiian Au,Weihao Shan,Mingkai Liu,Yusen Qin,Lei Mu,Rong Zhao,Xinghua Yang,Qi Wei,Fei Qiao

Task: 提出一种基于轻量级预测网络的高效自主探索框架SenseExpo，解决传统方法在计算开销和环境泛化方面的局限性。

Motivation: 传统方法在计算开销和环境泛化方面存在局限性，需要一种更高效的解决方案。

Details

Method: 通过集成生成对抗网络（GANs）、Transformer和快速傅里叶卷积（FFC），设计了一个仅有709k参数的轻量级预测模型。 Result: 在KTH数据集上，最小的模型比U-net（24.5M）和LaMa（51M）表现更好，PSNR为9.026，SSIM为0.718，PSNR比51M参数的LaMa模型提高了38.7%。在HouseExpo数据集上的跨域测试显示其强大的泛化能力，FID得分为161.55，显著优于可比方法。在KTH数据集上，SenseExpo的探索时间比MapEx减少了约67.9%。在MRPB 1.0数据集上，SenseExpo的探索时间比MapEx减少了约77.1%。 Conclusion: SenseExpo作为一个即插即用的ROS节点，能够无缝集成到现有导航系统中，为资源受限的设备提供了高效的解决方案。 Abstract: This paper proposes SenseExpo, an efficient autonomous exploration framework based on a lightweight prediction network, which addresses the limitations of traditional methods in computational overhead and environmental generalization. By integrating Generative Adversarial Networks (GANs), Transformer, and Fast Fourier Convolution (FFC), we designed a lightweight prediction model with merely 709k parameters. Our smallest model achieves better performance on the KTH dataset than U-net (24.5M) and LaMa (51M), delivering PSNR 9.026 and SSIM 0.718, particularly representing a 38.7% PSNR improvement over the 51M-parameter LaMa model. Cross-domain testing demonstrates its strong generalization capability, with an FID score of 161.55 on the HouseExpo dataset, significantly outperforming comparable methods. Regarding exploration efficiency, on the KTH dataset,SenseExpo demonstrates approximately a 67.9% time reduction in exploration time compared to MapEx. On the MRPB 1.0 dataset, SenseExpo achieves 77.1% time reduction roughly compared to MapEx. Deployed as a plug-and-play ROS node, the framework seamlessly integrates with existing navigation systems, providing an efficient solution for resource-constrained devices.

GazeSCRNN: Event-based Near-eye Gaze Tracking using a Spiking Neural Network

Stijn Groenen,Marzieh Hassanshahi Varposhti,Mahyar Shahsavari

Task: 设计并评估一种新型的脉冲卷积循环神经网络（GazeSCRNN）用于基于事件的眼球追踪。

Motivation: 利用动态视觉传感器（DVS）相机的高时间分辨率、能量效率和与事件系统的兼容性，解决传统眼球追踪系统在捕捉动态运动方面的局限性。

Details

Method: 使用自适应漏积分发放（ALIF）神经元和优化的混合架构处理来自DVS相机的事件流。 Result: 在EV-Eye数据集上的广泛评估表明，该模型在预测注视向量方面具有高准确性，最佳模型的平均角度误差（MAE）为6.034度，平均瞳孔误差（MPE）为2.094毫米。 Conclusion: 该研究首次证明了使用脉冲神经网络（SNN）进行基于事件的眼球追踪的可行性，并揭示了进一步改进的关键挑战和机会。 Abstract: This work introduces GazeSCRNN, a novel spiking convolutional recurrent neural network designed for event-based near-eye gaze tracking. Leveraging the high temporal resolution, energy efficiency, and compatibility of Dynamic Vision Sensor (DVS) cameras with event-based systems, GazeSCRNN uses a spiking neural network (SNN) to address the limitations of traditional gaze-tracking systems in capturing dynamic movements. The proposed model processes event streams from DVS cameras using Adaptive Leaky-Integrate-and-Fire (ALIF) neurons and a hybrid architecture optimized for spatio-temporal data. Extensive evaluations on the EV-Eye dataset demonstrate the model's accuracy in predicting gaze vectors. In addition, we conducted ablation studies to reveal the importance of the ALIF neurons, dynamic event framing, and training techniques, such as Forward-Propagation-Through-Time, in enhancing overall system performance. The most accurate model achieved a Mean Angle Error (MAE) of 6.034{\deg} and a Mean Pupil Error (MPE) of 2.094 mm. Consequently, this work is pioneering in demonstrating the feasibility of using SNNs for event-based gaze tracking, while shedding light on critical challenges and opportunities for further improvement.

Single Image Iterative Subject-driven Generation and Editing

Yair Shpitzer,Gal Chechik,Idan Schwartz

Task: 提出一种无需训练的个性化图像生成和编辑方法SISO。

Motivation: 现有的个性化图像生成和编辑方法在只有少量或单张图像时效果不佳，且训练过程耗时。

Details

Method: SISO通过优化与输入主题图像的相似度得分，迭代生成图像并优化模型，直到达到满意的相似度水平。 Result: 在图像编辑和生成任务中，SISO在图像质量、主题保真度和背景保留方面显著优于现有方法。 Conclusion: SISO提供了一种无需训练的高效个性化图像生成和编辑方法。 Abstract: Personalizing image generation and editing is particularly challenging when we only have a few images of the subject, or even a single image. A common approach to personalization is concept learning, which can integrate the subject into existing models relatively quickly, but produces images whose quality tends to deteriorate quickly when the number of subject images is small. Quality can be improved by pre-training an encoder, but training restricts generation to the training distribution, and is time consuming. It is still an open hard challenge to personalize image generation and editing from a single image without training. Here, we present SISO, a novel, training-free approach based on optimizing a similarity score with an input subject image. More specifically, SISO iteratively generates images and optimizes the model based on loss of similarity with the given subject image until a satisfactory level of similarity is achieved, allowing plug-and-play optimization to any image generator. We evaluated SISO in two tasks, image editing and image generation, using a diverse data set of personal subjects, and demonstrate significant improvements over existing methods in image quality, subject fidelity, and background preservation.

Agentic Keyframe Search for Video Question Answering

Sunqi Fan,Meng-Hao Guo,Shuojin Yang

Task: 提出一种名为Agentic Keyframe Search (AKeyS)的算法，用于在视频问答任务中识别关键帧。

Motivation: 视频问答（VideoQA）需要深入理解视频内容且计算成本高，限制了其广泛应用。

Details

Method: 首先将视频分段并组织成树结构，然后使用语言代理估计启发式和移动成本，动态扩展节点，最后根据终止条件确定是否收集到足够的关键帧并提供答案。 Result: 在EgoSchema和NExT-QA数据集上的实验表明，AKeyS在关键帧搜索效率上优于所有先前的方法，能够以最小的计算开销准确识别关键信息并进行有效的视觉推理。 Conclusion: AKeyS代表了构建智能视频理解代理的重要一步，代码已公开。 Abstract: Video question answering (VideoQA) enables machines to extract and comprehend key information from videos through natural language interaction, which is a critical step towards achieving intelligence. However, the demand for a thorough understanding of videos and high computational costs still limit the widespread applications of VideoQA. To address it, we propose Agentic Keyframe Search (AKeyS), a simple yet powerful algorithm for identifying keyframes in the VideoQA task. It can effectively distinguish key information from redundant, irrelevant content by leveraging modern language agents to direct classical search algorithms. Specifically, we first segment the video and organize it as a tree structure. Then, AKeyS uses a language agent to estimate heuristics and movement costs while dynamically expanding nodes. Finally, the agent determines if sufficient keyframes have been collected based on termination conditions and provides answers. Extensive experiments on the EgoSchema and NExT-QA datasets show that AKeyS outperforms all previous methods with the highest keyframe searching efficiency, which means it can accurately identify key information and conduct effective visual reasoning with minimal computational overhead. For example, on the EgoSchema subset, it achieves 1.8% higher accuracy while processing only 43.5% of the frames compared to VideoTree. We believe that AKeyS represents a significant step towards building intelligent agents for video understanding. The code is publicly available at https://github.com/fansunqi/AKeyS.

Zhihang Liu,Chen-Wei Xie,Pandeng Li,Liming Zhao,Longxiang Tang,Yun Zheng,Chuanbin Liu,Hongtao Xie

Task: 提出一种用于多模态大语言模型（MLLMs）的条件令牌压缩策略，以减少视频帧带来的计算开销。

Motivation: 现有的压缩策略（如平均池化）不可避免地导致潜在有用信息的丢失，因为视觉内容对用户指令的贡献并不均等。

Details

Method: 提出了混合级指令注入策略（HICom），利用指令作为条件，从局部和全局两个层面指导压缩，以保留用户关注的信息并减少视觉令牌。 Result: 实验表明，HICom在三个多项选择QA基准测试中平均性能提高了2.43%，并节省了78.8%的令牌。 Conclusion: HICom在减少计算负担的同时，显著提高了视频理解能力。 Abstract: Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code is available at https://github.com/lntzm/HICom.

Closer to Ground Truth: Realistic Shape and Appearance Labeled Data Generation for Unsupervised Underwater Image Segmentation

Andrei Jelea,Ahmed Nabil Belbachir,Marius Leordeanu

Task: 解决水下视频中的鱼类分割问题，这是一个在海洋和水产养殖行业中具有重要实际价值的现实问题。

Motivation: 由于拍摄环境的困难、能见度差以及现有的水下鱼类标注数据有限，鱼类分割任务具有挑战性。

Details

Method: 提出了一种新颖的两阶段无监督分割方法，无需人工标注，结合了人工创建和真实图像。通过将虚拟鱼类放置在真实的水下栖息地中，并进行鱼类的形状扭曲和颜色直方图匹配，生成具有挑战性的合成训练数据。 Result: 在DeepFish数据集上验证了该无监督方法，获得了接近全监督SoTA模型的性能，并在特定的大马哈鱼分割案例中展示了其有效性，引入了DeepSalmon数据集。 Conclusion: 该方法不仅在全监督SoTA模型上表现出色，还能提升其性能。 Abstract: Solving fish segmentation in underwater videos, a real-world problem of great practical value in marine and aquaculture industry, is a challenging task due to the difficulty of the filming environment, poor visibility and limited existing annotated underwater fish data. In order to overcome these obstacles, we introduce a novel two stage unsupervised segmentation approach that requires no human annotations and combines artificially created and real images. Our method generates challenging synthetic training data, by placing virtual fish in real-world underwater habitats, after performing fish transformations such as Thin Plate Spline shape warping and color Histogram Matching, which realistically integrate synthetic fish into the backgrounds, making the generated images increasingly closer to the real world data with every stage of our approach. While we validate our unsupervised method on the popular DeepFish dataset, obtaining a performance close to a fully-supervised SoTA model, we further show its effectiveness on the specific case of salmon segmentation in underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature (30 GB). Moreover, on both datasets we prove the capability of our approach to boost the performance of the fully-supervised SoTA model.

Semantic-Guided Global-Local Collaborative Networks for Lightweight Image Super-Resolution

Wanshu Fan,Yue Wang,Cong Wang,Yunzhe Zhang,Wei Wang,Dongsheng Zhou

Task: 提出一种用于轻量级单图像超分辨率（SISR）的语义引导全局-局部协作网络（SGGLC-Net）。

Motivation: 视觉测量工具捕获的图像经常出现模糊和细节丢失等问题，影响测量精度。

Details

Method: 提出语义引导模块和全局-局部协作模块，利用语义先验和混合注意力机制来增强图像细节。 Result: SGGLC-Net在多个基准数据集上取得了竞争力的PSNR和SSIM值，性能优于现有的轻量级超分辨率方法。 Conclusion: SGGLC-Net能够有效提高视觉测量系统的精度和效果。 Abstract: Single-Image Super-Resolution (SISR) plays a pivotal role in enhancing the accuracy and reliability of measurement systems, which are integral to various vision-based instrumentation and measurement applications. These systems often require clear and detailed images for precise object detection and recognition. However, images captured by visual measurement tools frequently suffer from degradation, including blurring and loss of detail, which can impede measurement accuracy.As a potential remedy, we in this paper propose a Semantic-Guided Global-Local Collaborative Network (SGGLC-Net) for lightweight SISR. Our SGGLC-Net leverages semantic priors extracted from a pre-trained model to guide the super-resolution process, enhancing image detail quality effectively. Specifically,we propose a Semantic Guidance Module that seamlessly integrates the semantic priors into the super-resolution network, enabling the network to more adeptly capture and utilize semantic priors, thereby enhancing image details. To further explore both local and non-local interactions for improved detail rendition,we propose a Global-Local Collaborative Module, which features three Global and Local Detail Enhancement Modules, as well as a Hybrid Attention Mechanism to work together to efficiently learn more useful features. Our extensive experiments show that SGGLC-Net achieves competitive PSNR and SSIM values across multiple benchmark datasets, demonstrating higher performance with the multi-adds reduction of 12.81G compared to state-of-the-art lightweight super-resolution approaches. These improvements underscore the potential of our approach to enhance the precision and effectiveness of visual measurement systems. Codes are at https://github.com/fanamber831/SGGLC-Net.

Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Yike Yuan,Ziyu Wang,Zihao Huang,Defa Zhu,Xun Zhou,Jingyi Yu,Qiyang Min

Task: 提出一种新的混合专家模型Race-DiT，用于扩散变换器，通过灵活的专家竞争策略动态分配专家。

Motivation: 扩散模型在视觉生成领域取得了成功，但通过集成混合专家方法可以进一步提升模型的可扩展性和性能。

Details

Method: 引入Race-DiT模型，采用专家竞争策略和每层正则化，以及路由器相似性损失来防止模式崩溃。 Result: 在ImageNet上的广泛实验验证了该方法的有效性，展示了显著的性能提升和良好的扩展性。 Conclusion: Race-DiT模型通过动态分配专家和有效的正则化策略，显著提升了扩散变换器的性能和可扩展性。 Abstract: Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.

Landmarks Are Alike Yet Distinct: Harnessing Similarity and Individuality for One-Shot Medical Landmark Detection

Xu He,Zhen Huang,Qingsong Yao,Xiaoqian Zhou,S. Kevin Zhou

Task: 提出一种基于伪标签和模板数据的单地标检测模型，并结合适配器融合模型来解决多地标检测中的“跷跷板现象”。

Motivation: 解决多地标检测中存在的“跷跷板现象”以及减少内存和计算资源的需求。

Details

Method: 使用伪标签和模板数据训练单地标检测模型，并引入适配器融合模型结合共享权重和地标特定权重。 Result: 单地标模型在检测单个地标时显著优于传统的多点联合训练模型，适配器融合模型在资源效率上有显著提升。 Conclusion: 提出的方法有效缓解了多地标训练中的“跷跷板现象”，并在资源效率上取得了显著改进。 Abstract: Landmark detection plays a crucial role in medical imaging applications such as disease diagnosis, bone age estimation, and therapy planning. However, training models for detecting multiple landmarks simultaneously often encounters the "seesaw phenomenon", where improvements in detecting certain landmarks lead to declines in detecting others. Yet, training a separate model for each landmark increases memory usage and computational overhead. To address these challenges, we propose a novel approach based on the belief that "landmarks are distinct" by training models with pseudo-labels and template data updated continuously during the training process, where each model is dedicated to detecting a single landmark to achieve high accuracy. Furthermore, grounded on the belief that "landmarks are also alike", we introduce an adapter-based fusion model, combining shared weights with landmark-specific weights, to efficiently share model parameters while allowing flexible adaptation to individual landmarks. This approach not only significantly reduces memory and computational resource requirements but also effectively mitigates the seesaw phenomenon in multi-landmark training. Experimental results on publicly available medical image datasets demonstrate that the single-landmark models significantly outperform traditional multi-point joint training models in detecting individual landmarks. Although our adapter-based fusion model shows slightly lower performance compared to the combined results of all single-landmark models, it still surpasses the current state-of-the-art methods while achieving a notable improvement in resource efficiency.

Qiang Zou,Shuli Cheng,Jiayi Chen

Task: 提出一种基于亲和力提示感知协作学习的自适应跨模态哈希框架PromptHash，以解决现有方法在语义保留、上下文完整性和信息冗余方面的局限性。

Motivation: 现有的跨模态哈希方法在语义保留、上下文完整性和信息冗余方面存在显著局限性，限制了检索效果。

Details

Method: 提出了一种端到端的框架，包括文本亲和力提示学习机制、自适应门控选择融合架构和提示亲和力对齐策略。 Result: 在三个基准多标签数据集上的综合评估表明，PromptHash在现有方法基础上取得了显著的性能提升，特别是在NUS-WIDE数据集上，图像到文本和文本到图像检索任务分别提高了18.22%和18.65%。 Conclusion: PromptHash通过亲和力提示感知协作学习，建立了跨模态语义一致性的新范式，显著提升了跨模态检索的效果。 Abstract: Cross-modal hashing is a promising approach for efficient data retrieval and storage optimization. However, contemporary methods exhibit significant limitations in semantic preservation, contextual integrity, and information redundancy, which constrains retrieval efficacy. We present PromptHash, an innovative framework leveraging affinity prompt-aware collaborative learning for adaptive cross-modal hashing. We propose an end-to-end framework for affinity-prompted collaborative hashing, with the following fundamental technical contributions: (i) a text affinity prompt learning mechanism that preserves contextual information while maintaining parameter efficiency, (ii) an adaptive gated selection fusion architecture that synthesizes State Space Model with Transformer network for precise cross-modal feature integration, and (iii) a prompt affinity alignment strategy that bridges modal heterogeneity through hierarchical contrastive learning. To the best of our knowledge, this study presents the first investigation into affinity prompt awareness within collaborative cross-modal adaptive hash learning, establishing a paradigm for enhanced semantic consistency across modalities. Through comprehensive evaluation on three benchmark multi-label datasets, PromptHash demonstrates substantial performance improvements over existing approaches. Notably, on the NUS-WIDE dataset, our method achieves significant gains of 18.22% and 18.65% in image-to-text and text-to-image retrieval tasks, respectively. The code is publicly available at https://github.com/ShiShuMo/PromptHash.

Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model

Yingmao Miao,Zhanpeng Huang,Rui Han,Zibin Wang,Chenhao Lin,Chao Shen

Task: 提出了一种用于饰品的虚拟试戴方法，以提高几何和外观的保持。

Motivation: 由于饰品中的复杂微小图案和重复几何子结构，在大姿态和尺度变化下保证身份和外观一致性非常困难。

Details

Method: 通过估计准确的佩戴掩码来改进饰品与模型之间的对齐，并在去噪过程中进行迭代。此外，通过正则化注意力层以隐式方式将参考饰品掩码映射到佩戴掩码。 Result: 实验结果表明，该方法成功地将参考图像中的饰品佩戴到目标模型上，处理了尺度和姿态的显著差异，同时保持了身份并实现了逼真的视觉效果。 Conclusion: 该方法在饰品虚拟试戴中有效提高了几何和外观的保持，处理了尺度和姿态的显著差异，并实现了逼真的视觉效果。 Abstract: While virtual try-on for clothes and shoes with diffusion models has gained attraction, virtual try-on for ornaments, such as bracelets, rings, earrings, and necklaces, remains largely unexplored. Due to the intricate tiny patterns and repeated geometric sub-structures in most ornaments, it is much more difficult to guarantee identity and appearance consistency under large pose and scale variances between ornaments and models. This paper proposes the task of virtual try-on for ornaments and presents a method to improve the geometric and appearance preservation of ornament virtual try-ons. Specifically, we estimate an accurate wearing mask to improve the alignments between ornaments and models in an iterative scheme alongside the denoising process. To preserve structure details, we further regularize attention layers to map the reference ornament mask to the wearing mask in an implicit way. Experimental results demonstrate that our method successfully wears ornaments from reference images onto target models, handling substantial differences in scale and pose while preserving identity and achieving realistic visual effects.

Bokehlicious: Photorealistic Bokeh Rendering with Controllable Apertures

Tim Seizinger,Florin-Alexandru Vasluianu,Marcos V. Conde,Radu Timofte

Task: 提出一种高效的网络Bokehlicious，通过Aperture-Aware Attention机制实现对Bokeh强度的直观控制，并引入RealBokeh数据集以解决高质量真实数据缺乏的问题。

Motivation: 现有的Bokeh渲染方法需要额外输入，并且由于依赖合成数据而导致不真实的Bokeh再现。

Details

Method: 提出Bokehlicious网络，采用Aperture-Aware Attention机制模拟物理镜头光圈，并引入RealBokeh数据集。 Result: Bokehlicious在RealBokeh和现有Bokeh渲染基准测试中表现优异，显著降低计算成本并展示出强大的零样本泛化能力。 Conclusion: Bokehlicious和RealBokeh数据集在Bokeh渲染和散焦去模糊任务中表现出色，具有广泛的应用前景。 Abstract: Bokeh rendering methods play a key role in creating the visually appealing, softly blurred backgrounds seen in professional photography. While recent learning-based approaches show promising results, generating realistic Bokeh with variable strength remains challenging. Existing methods require additional inputs and suffer from unrealistic Bokeh reproduction due to reliance on synthetic data. In this work, we propose Bokehlicious, a highly efficient network that provides intuitive control over Bokeh strength through an Aperture-Aware Attention mechanism, mimicking the physical lens aperture. To further address the lack of high-quality real-world data, we present RealBokeh, a novel dataset featuring 23,000 high-resolution (24-MP) images captured by professional photographers, covering diverse scenes with varied aperture and focal length settings. Evaluations on both our new RealBokeh and established Bokeh rendering benchmarks show that Bokehlicious consistently outperforms SOTA methods while significantly reducing computational cost and exhibiting strong zero-shot generalization. Our method and dataset further extend to defocus deblurring, achieving competitive results on the RealDOF benchmark. Our code and data can be found at https://github.com/TimSeizinger/Bokehlicious

PoseTraj: Pose-Aware Trajectory Control in Video Diffusion

Longbin Ji,Lei Zhong,Pengfei Wei,Changjian Li

Task: 生成具有6D姿态变化的3D对齐运动视频

Motivation: 现有模型在生成具有宽范围旋转的物体运动时面临挑战，主要由于3D理解的限制。

Details

Method: 提出PoseTraj模型，采用两阶段姿态感知预训练框架，利用大规模合成数据集PoseTraj-10K和3D边界框作为中间监督信号，增强模型对物体姿态变化的感知。 Result: 在各种基准数据集上的实验表明，该方法在3D姿态对齐拖动和轨迹准确性及视频质量方面优于现有基线。 Conclusion: PoseTraj模型在生成3D对齐运动视频方面表现出色，特别是在旋转轨迹的拖动任务中。 Abstract: Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under wide-range rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories. Our method adopts a novel two-stage pose-aware pretraining framework, improving 3D understanding across diverse trajectories. Specifically, we propose a large-scale synthetic dataset PoseTraj-10K, containing 10k videos of objects following rotational trajectories, and enhance the model perception of object pose changes by incorporating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on real-world videos, applying an additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark datasets demonstrate that our method not only excels in 3D pose-aligned dragging for rotational trajectories but also outperforms existing baselines in trajectory accuracy and video quality.

Disentangled and Interpretable Multimodal Attention Fusion for Cancer Survival Prediction

Aniek Eijpe,Soufyan Lakbir,Melis Erdal Cesur,Sara P. Oliveira,Sanne Abeln,Wilson Silva

Task: 利用全切片图像和转录组数据改进癌症生存预测。

Motivation: 多模态框架通常会将模态共享和模态特定的信息纠缠在一起，限制了可解释性并可能抑制判别特征。

Details

Method: 提出了解耦和可解释的多模态注意力融合（DIMAF）框架，通过基于注意力的融合机制分离模态内和模态间的交互，以学习不同的模态特定和模态共享表示。 Result: 在四个公共癌症生存数据集上评估DIMAF，性能相对平均提高了1.85%，解耦度提高了23.7%。 Conclusion: DIMAF不仅提高了性能，还使得对癌症生物学中模态间和模态内相互作用的深入探索成为可能。 Abstract: To improve the prediction of cancer survival using whole-slide images and transcriptomics data, it is crucial to capture both modality-shared and modality-specific information. However, multimodal frameworks often entangle these representations, limiting interpretability and potentially suppressing discriminative features. To address this, we propose Disentangled and Interpretable Multimodal Attention Fusion (DIMAF), a multimodal framework that separates the intra- and inter-modal interactions within an attention-based fusion mechanism to learn distinct modality-specific and modality-shared representations. We introduce a loss based on Distance Correlation to promote disentanglement between these representations and integrate Shapley additive explanations to assess their relative contributions to survival prediction. We evaluate DIMAF on four public cancer survival datasets, achieving a relative average improvement of 1.85% in performance and 23.7% in disentanglement compared to current state-of-the-art multimodal models. Beyond improved performance, our interpretable framework enables a deeper exploration of the underlying interactions between and within modalities in cancer biology.

Hyperspectral Imaging for Identifying Foreign Objects on Pork Belly

Gabriela Ghimpeteanu,Hayat Rajani,Josep Quintana,Rafael Garcia

Task: 使用高光谱成像技术检测猪肉肚上的异物。

Motivation: 确保食品安全和质量，解决传统视觉检测方法无法检测的污染物问题。

Details

Method: 结合预处理技术和基于轻量级Vision Transformer（ViT）的分割方法，区分污染物、肉、脂肪和传送带材料。 Result: 实验结果表明，高光谱成像技术在提高食品安全性方面有效，具有广泛的实时应用潜力。 Conclusion: 高光谱成像技术在自动化质量控制过程中具有广泛的应用前景，能够有效检测污染物并提高食品安全性。 Abstract: Ensuring food safety and quality is critical in the food processing industry, where the detection of contaminants remains a persistent challenge. This study presents an automated solution for detecting foreign objects on pork belly meat using hyperspectral imaging (HSI). A hyperspectral camera was used to capture data across various bands in the near-infrared (NIR) spectrum (900-1700 nm), enabling accurate identification of contaminants that are often undetectable through traditional visual inspection methods. The proposed solution combines pre-processing techniques with a segmentation approach based on a lightweight Vision Transformer (ViT) to distinguish contaminants from meat, fat, and conveyor belt materials. The adopted strategy demonstrates high detection accuracy and training efficiency, while also addressing key industrial challenges such as inherent noise, temperature variations, and spectral similarity between contaminants and pork belly. Experimental results validate the effectiveness of hyperspectral imaging in enhancing food safety, highlighting its potential for broad real-time applications in automated quality control processes.

MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures

Lucas Morin,Valéry Weber,Ahmed Nassar,Gerhard Ingmar Meijer,Luc Van Gool,Yawei Li,Peter Staar

Task: 自动识别专利文档中的Markush结构

Motivation: 化学文献的自动化分析可以加速材料科学和药物开发等领域的发现，特别是化学结构和Markush结构的搜索能力在专利文档中非常有价值。

Details

Method: 提出MarkushGrapher，一种多模态方法，通过Vision-Text-Layout编码器和光学化学结构识别视觉编码器联合编码文本、图像和布局信息，生成Markush结构的序列图表示和变量组定义表。 Result: 在大多数评估设置中，该方法优于最先进的化学专用和通用视觉语言模型。 Conclusion: MarkushGrapher在识别Markush结构方面表现出色，代码、模型和数据集将公开。 Abstract: The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures from text and images, yet the Markush structures remain largely unexplored due to their complex multi-modal nature. In this work, we present MarkushGrapher, a multi-modal approach for recognizing Markush structures in documents. Our method jointly encodes text, image, and layout information through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. These representations are merged and used to auto-regressively generate a sequential graph representation of the Markush structure along with a table defining its variable groups. To overcome the lack of real-world training data, we propose a synthetic data generation pipeline that produces a wide range of realistic Markush structures. Additionally, we present M2S, the first annotated benchmark of real-world Markush structures, to advance research on this challenging task. Extensive experiments demonstrate that our approach outperforms state-of-the-art chemistry-specific and general-purpose vision-language models in most evaluation settings. Code, models, and datasets will be available.

OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP

Mohamad Hassan N C,Divyam Gupta,Mainak Singha,Sai Bhargav Rongali,Ankit Jha,Muhammad Haris Khan,Biplab Banerjee

Task: 提出一种新的低样本开放集领域泛化（LSOSDG）范式，结合低样本学习和开放集领域泛化（ODG）。

Motivation: 现有的基于提示的方法在低数据情况下表现不佳，且在检测与训练类相关的细粒度语义的开放集样本时缺乏精度。

Details

Method: 提出了OSLOPROMPT，一个先进的提示学习框架，包含两个核心创新：领域无关的提示学习机制和通过系统合成的伪开放样本训练专门提示。 Result: 在五个基准测试中，OSLOPROMPT显著优于现有方法，建立了新的最先进水平。 Conclusion: OSLOPROMPT在低样本开放集领域泛化中表现出色，显著提升了开放样本检测的精度。 Abstract: We introduce Low-Shot Open-Set Domain Generalization (LSOSDG), a novel paradigm unifying low-shot learning with open-set domain generalization (ODG). While prompt-based methods using models like CLIP have advanced DG, they falter in low-data regimes (e.g., 1-shot) and lack precision in detecting open-set samples with fine-grained semantics related to training classes. To address these challenges, we propose OSLOPROMPT, an advanced prompt-learning framework for CLIP with two core innovations. First, to manage limited supervision across source domains and improve DG, we introduce a domain-agnostic prompt-learning mechanism that integrates adaptable domain-specific cues and visually guided semantic attributes through a novel cross-attention module, besides being supported by learnable domain- and class-generic visual prompts to enhance cross-modal adaptability. Second, to improve outlier rejection during inference, we classify unfamiliar samples as "unknown" and train specialized prompts with systematically synthesized pseudo-open samples that maintain fine-grained relationships to known classes, generated through a targeted query strategy with off-the-shelf foundation models. This strategy enhances feature learning, enabling our model to detect open samples with varied granularity more effectively. Extensive evaluations across five benchmarks demonstrate that OSLOPROMPT establishes a new state-of-the-art in LSOSDG, significantly outperforming existing methods.

Probabilistic Prompt Distribution Learning for Animal Pose Estimation

Jiyong Rao,Brian Nlong Zhao,Yu Wang

Task: 通过高效的提示学习解决多物种动物姿态估计中的跨物种泛化问题。

Motivation: 多物种动物姿态估计面临视觉多样性和不确定性的挑战，需要解决跨物种泛化问题。

Details

Method: 提出了一种新颖的概率提示方法，通过设计提示、概率提示建模和跨模态适应来克服数据分布不平衡下的数据方差。 Result: 在多物种动物姿态基准测试中，该方法在监督和零样本设置下均达到了最先进的性能。 Conclusion: 该方法通过概率提示和跨模态融合策略，有效解决了多物种动物姿态估计中的跨物种泛化问题。 Abstract: Multi-species animal pose estimation has emerged as a challenging yet critical task, hindered by substantial visual diversity and uncertainty. This paper challenges the problem by efficient prompt learning for Vision-Language Pretrained (VLP) models, \textit{e.g.} CLIP, aiming to resolve the cross-species generalization problem. At the core of the solution lies in the prompt designing, probabilistic prompt modeling and cross-modal adaptation, thereby enabling prompts to compensate for cross-modal information and effectively overcome large data variances under unbalanced data distribution. To this end, we propose a novel probabilistic prompting approach to fully explore textual descriptions, which could alleviate the diversity issues caused by long-tail property and increase the adaptability of prompts on unseen category instance. Specifically, we first introduce a set of learnable prompts and propose a diversity loss to maintain distinctiveness among prompts, thus representing diverse image attributes. Diverse textual probabilistic representations are sampled and used as the guidance for the pose estimation. Subsequently, we explore three different cross-modal fusion strategies at spatial level to alleviate the adverse impacts of visual uncertainty. Extensive experiments on multi-species animal pose benchmarks show that our method achieves the state-of-the-art performance under both supervised and zero-shot settings. The code is available at https://github.com/Raojiyong/PPAP.

Uncertainty Meets Diversity: A Comprehensive Active Learning Framework for Indoor 3D Object Detection

Jiangyi Wang,Na Zhao

Task: 研究主动学习在室内3D物体检测中的应用。

Motivation: 室内3D数据集面临训练样本少、类别多、类别不平衡、场景类型多样和类内差异大等挑战，主动学习在室内环境中的应用尚未探索。

Details

Method: 提出了一种结合不确定性和多样性标准的新框架，通过Class-aware Adaptive Prototype (CAP) bank动态分配代表性原型，选择最具歧义和信息量的未标注样本进行标注。 Result: 在SUN RGB-D和ScanNetV2数据集上，该方法显著优于基线，仅使用10%的标注预算即可达到超过85%的全监督性能。 Conclusion: 该研究首次将主动学习应用于室内3D物体检测，提出的方法在减少标注负担的同时保持了高性能。 Abstract: Active learning has emerged as a promising approach to reduce the substantial annotation burden in 3D object detection tasks, spurring several initiatives in outdoor environments. However, its application in indoor environments remains unexplored. Compared to outdoor 3D datasets, indoor datasets face significant challenges, including fewer training samples per class, a greater number of classes, more severe class imbalance, and more diverse scene types and intra-class variances. This paper presents the first study on active learning for indoor 3D object detection, where we propose a novel framework tailored for this task. Our method incorporates two key criteria - uncertainty and diversity - to actively select the most ambiguous and informative unlabeled samples for annotation. The uncertainty criterion accounts for both inaccurate detections and undetected objects, ensuring that the most ambiguous samples are prioritized. Meanwhile, the diversity criterion is formulated as a joint optimization problem that maximizes the diversity of both object class distributions and scene types, using a new Class-aware Adaptive Prototype (CAP) bank. The CAP bank dynamically allocates representative prototypes to each class, helping to capture varying intra-class diversity across different categories. We evaluate our method on SUN RGB-D and ScanNetV2, where it outperforms baselines by a significant margin, achieving over 85% of fully-supervised performance with just 10% of the annotation budget.

Coupling deep and handcrafted features to assess smile genuineness

Benedykt Pawlus,Bogdan Smolka,Jolanta Kawulok,Michal Kawulok

Task: 评估视频序列中的微笑真实性

Motivation: 识别面部表情并将其与潜在情感状态联系起来是一个重要课题

Details

Method: 结合长短期记忆网络学习到的特征与手工制作的特征来捕捉面部动作单元的动态 Result: 实验结果表明，所提出的解决方案比基线技术更有效，并且可以实时评估视频序列中的微笑真实性 Conclusion: 结合深度学习和手工特征的方法在评估微笑真实性方面具有更高的效果 Abstract: Assessing smile genuineness from video sequences is a vital topic concerned with recognizing facial expression and linking them with the underlying emotional states. There have been a number of techniques proposed underpinned with handcrafted features, as well as those that rely on deep learning to elaborate the useful features. As both of these approaches have certain benefits and limitations, in this work we propose to combine the features learned by a long short-term memory network with the features handcrafted to capture the dynamics of facial action units. The results of our experiments indicate that the proposed solution is more effective than the baseline techniques and it allows for assessing the smile genuineness from video sequences in real-time.

Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing

Shiyang Zhou,Haijin Zeng,Yunfan Lu,Tong Shao,Ke Tang,Yongyong Chen,Jie Liu,Jingyong Su

Task: 提出一种轻量级的基于Mamba的二元神经网络，用于高效且高性能的HybridEVS RAW图像去马赛克。

Motivation: 现有的基于学习的方法虽然取得了不错的效果，但其复杂性严重限制了在移动设备上的实际应用。

Details

Method: 提出了一种混合二值化Mamba-Transformer架构（BMTNet），结合了Mamba和Swin Transformer架构的优势，并引入了二值化Mamba（Bi-Mamba）以显著降低计算复杂度。 Result: 通过定量和定性实验证明了BMTNet在性能和计算效率方面的有效性，提供了一个适用于实际边缘设备的轻量级去马赛克解决方案。 Conclusion: BMTNet在保持高性能的同时显著降低了计算复杂度，适用于实际边缘设备。 Abstract: Quad Bayer demosaicing is the central challenge for enabling the widespread application of Hybrid Event-based Vision Sensors (HybridEVS). Although existing learning-based methods that leverage long-range dependency modeling have achieved promising results, their complexity severely limits deployment on mobile devices for real-world applications. To address these limitations, we propose a lightweight Mamba-based binary neural network designed for efficient and high-performing demosaicing of HybridEVS RAW images. First, to effectively capture both global and local dependencies, we introduce a hybrid Binarized Mamba-Transformer architecture that combines the strengths of the Mamba and Swin Transformer architectures. Next, to significantly reduce computational complexity, we propose a binarized Mamba (Bi-Mamba), which binarizes all projections while retaining the core Selective Scan in full precision. Bi-Mamba also incorporates additional global visual information to enhance global context and mitigate precision loss. We conduct quantitative and qualitative experiments to demonstrate the effectiveness of BMTNet in both performance and computational efficiency, providing a lightweight demosaicing solution suited for real-world edge devices. Our codes and models are available at https://github.com/Clausy9/BMTNet.

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

Tianyi Wei,Yifan Zhou,Dongdong Chen,Xingang Pan

Task: 分析RoPE在MMDiT模型中的作用，并提出一种基于任务的无训练图像编辑框架。

Motivation: 探讨自注意力层在生成过程中对位置嵌入与查询-键相似性的依赖关系，并揭示RoPE在MMDiT中的具体作用。

Details

Method: 引入自动化探测策略，通过策略性地操纵RoPE来分离位置信息与内容依赖关系，并设计基于任务的无训练图像编辑框架。 Result: 揭示了RoPE在MMDiT中的不同依赖模式，提出了三种编辑任务类型，并设计了相应的键值注入策略。 Conclusion: 提出的方法在保持原始语义内容和实现无缝修改方面优于现有方法。 Abstract: The integration of Rotary Position Embedding (RoPE) in Multimodal Diffusion Transformer (MMDiT) has significantly enhanced text-to-image generation quality. However, the fundamental reliance of self-attention layers on positional embedding versus query-key similarity during generation remains an intriguing question. We present the first mechanistic analysis of RoPE-based MMDiT models (e.g., FLUX), introducing an automated probing strategy that disentangles positional information versus content dependencies by strategically manipulating RoPE during generation. Our analysis reveals distinct dependency patterns that do not straightforwardly correlate with depth, offering new insights into the layer-specific roles in RoPE-based MMDiT. Based on these findings, we propose a training-free, task-specific image editing framework that categorizes editing tasks into three types: position-dependent editing (e.g., object addition), content similarity-dependent editing (e.g., non-rigid editing), and region-preserved editing (e.g., background replacement). For each type, we design tailored key-value injection strategies based on the characteristics of the editing task. Extensive qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art approaches, particularly in preserving original semantic content and achieving seamless modifications.

Iterative Optimal Attention and Local Model for Single Image Rain Streak Removal

Xiangyu Li,Wanshu Fan,Yue Shen,Cong Wang,Wei Wang,Xin Yang,Qiang Zhang,Dongsheng Zhou

Task: 提出一种用于单图像雨纹去除的期望最大化重建变换器（EMResformer）。

Motivation: 高保真成像对于基于视觉的测量系统（VBMS）的安全监督和智能部署至关重要，但恶劣天气条件（特别是雨）会显著降低成像质量，导致图像模糊和对比度降低，从而增加VBMS中不准确评估和误解的风险。

Details

Method: 提出了一种期望最大化重建变换器（EMResformer），通过保留关键的自注意力值进行特征聚合，增强局部特征以产生更好的图像重建。具体包括期望最大化块和局部模型残差块的集成。 Result: 在合成和真实数据集上的广泛实验表明，EMResformer在单图像雨纹去除方面优于当前最先进的方法，并在模型复杂度和单图像去雨性能之间取得了更好的平衡。 Conclusion: EMResformer显著提高了VBMS任务中高保真成像的准确性和可靠性。 Abstract: High-fidelity imaging is crucial for the successful safety supervision and intelligent deployment of vision-based measurement systems (VBMS). It ensures high-quality imaging in VBMS, which is fundamental for reliable visual measurement and analysis. However, imaging quality can be significantly impaired by adverse weather conditions, particularly rain, leading to blurred images and reduced contrast. Such impairments increase the risk of inaccurate evaluations and misinterpretations in VBMS. To address these limitations, we propose an Expectation Maximization Reconstruction Transformer (EMResformer) for single image rain streak removal. The EMResformer retains the key self-attention values for feature aggregation, enhancing local features to produce superior image reconstruction. Specifically, we propose an Expectation Maximization Block seamlessly integrated into the single image rain streak removal network, enhancing its ability to eliminate superfluous information and restore a cleaner background image. Additionally, to further enhance local information for improved detail rendition, we introduce a Local Model Residual Block, which integrates two local model blocks along with a sequence of convolutions and activation functions. This integration synergistically facilitates the extraction of more pertinent features for enhanced single image rain streak removal. Extensive experiments validate that our proposed EMResformer surpasses current state-of-the-art single image rain streak removal methods on both synthetic and real-world datasets, achieving an improved balance between model complexity and single image deraining performance. Furthermore, we evaluate the effectiveness of our method in VBMS scenarios, demonstrating that high-quality imaging significantly improves the accuracy and reliability of VBMS tasks.

Guardians of Generation: Dynamic Inference-Time Copyright Shielding with Adaptive Guidance for AI Image Generation

Soham Roy,Abhishek Mishra,Shirish Karande,Murari Mandal

Task: 提出一种模型无关的推理时框架，用于在AI图像生成中动态屏蔽版权内容。

Motivation: 现代文本到图像生成模型可能会无意中复制其训练数据中记忆的受版权保护的内容，引发潜在的版权侵权问题。

Details

Method: 引入Guardians of Generation框架，包括检测模块、提示重写模块和指导调整模块，无需重新训练或修改生成模型的权重。 Result: 在多种生成模型（如Stable Diffusion、SDXL和Flux）上验证了该方法，显著减少了受版权保护内容的生成，同时对输出保真度或用户意图对齐的影响可以忽略不计。 Conclusion: 该工作为生成图像模型提供了一种实用的即插即用保护措施，使其能够在现实世界的版权约束下更负责任地部署。 Abstract: Modern text-to-image generative models can inadvertently reproduce copyrighted content memorized in their training data, raising serious concerns about potential copyright infringement. We introduce Guardians of Generation, a model agnostic inference time framework for dynamic copyright shielding in AI image generation. Our approach requires no retraining or modification of the generative model weights, instead integrating seamlessly with existing diffusion pipelines. It augments the generation process with an adaptive guidance mechanism comprising three components: a detection module, a prompt rewriting module, and a guidance adjustment module. The detection module monitors user prompts and intermediate generation steps to identify features indicative of copyrighted content before they manifest in the final output. If such content is detected, the prompt rewriting mechanism dynamically transforms the user's prompt by sanitizing or replacing references that could trigger copyrighted material while preserving the prompt's intended semantics. The adaptive guidance module adaptively steers the diffusion process away from flagged content by modulating the model's sampling trajectory. Together, these components form a robust shield that enables a tunable balance between preserving creative fidelity and ensuring copyright compliance. We validate our method on a variety of generative models such as Stable Diffusion, SDXL, and Flux, demonstrating substantial reductions in copyrighted content generation with negligible impact on output fidelity or alignment with user intent. This work provides a practical, plug-and-play safeguard for generative image models, enabling more responsible deployment under real-world copyright constraints. Source code is available at: https://respailab.github.io/gog

Narrowing Class-Wise Robustness Gaps in Adversarial Training

Fatemeh Amerehi,Patrick Healy

Task: 探讨对抗训练对整体和类别特定性能的影响及其溢出效应。

Motivation: 解决由于数据偏移导致的准确性下降问题，特别是对抗训练在提高鲁棒性的同时可能阻碍对干净样本的泛化并加剧类别间性能不平衡。

Details

Method: 通过增强训练中的标注来提高对抗鲁棒性并缓解类别不平衡。 Result: 增强标注使对抗鲁棒性提高了53.50%，并缓解了类别不平衡5.73%，在干净和对抗设置下的准确性均有所提高。 Conclusion: 增强标注在对抗训练中显著提高了鲁棒性并缓解了类别不平衡，从而在干净和对抗设置下均提高了准确性。 Abstract: Efforts to address declining accuracy as a result of data shifts often involve various data-augmentation strategies. Adversarial training is one such method, designed to improve robustness to worst-case distribution shifts caused by adversarial examples. While this method can improve robustness, it may also hinder generalization to clean examples and exacerbate performance imbalances across different classes. This paper explores the impact of adversarial training on both overall and class-specific performance, as well as its spill-over effects. We observe that enhanced labeling during training boosts adversarial robustness by 53.50% and mitigates class imbalances by 5.73%, leading to improved accuracy in both clean and adversarial settings compared to standard adversarial training.

Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

Andrea Maracani,Savas Ozkan,Sijun Cho,Hyowon Kim,Eunchung Noh,Jeongwon Min,Cho Jung Min,Dookun Park,Mete Ozay

Task: 分析视觉编码器和文本解码器扩展对场景文本识别（STR）的贡献，并提出一种新的方法来减轻标签噪声的影响。

Motivation: 尽管扩展架构已被证明对提高场景文本识别（STR）有效，但视觉编码器和文本解码器扩展的个体贡献尚未得到充分探索。此外，标签噪声在真实世界数据中是一个关键挑战，限制了STR模型的有效性。

Details

Method: 提出了Cloze Self-Distillation (CSD)方法，通过从教师模型生成的上下文感知软预测和伪标签中蒸馏学生模型来减轻标签噪声。此外，通过引入差分交叉注意力来增强解码器架构。 Result: 在11个基准测试中的10个上实现了最先进的性能，同时显著减少了参数大小和计算成本。 Conclusion: 解码器扩展比编码器扩展带来更大的性能提升，Cloze Self-Distillation (CSD)方法有效减轻了标签噪声的影响，差分交叉注意力增强了解码器架构，从而在多个基准测试中取得了最先进的性能。 Abstract: Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.

MapGlue: Multimodal Remote Sensing Image Matching

Peihao Wu,Yongxiang Yao,Wenfei Zhang,Dong Wei,Yi Wan,Yansheng Li,Yongjun Zhang

Task: 提出了一种通用的多模态遥感图像匹配框架MapGlue，并构建了一个大规模多模态数据集MapData。

Motivation: 现有的单模态数据集缺乏规模和多样性，限制了深度学习解决方案的发展。多模态遥感图像匹配在跨模态融合、定位和目标检测中至关重要，但由于成像模态之间的几何、辐射和视角差异，面临严重挑战。

Details

Method: MapGlue框架通过整合语义上下文和双图引导机制来提取跨模态不变特征，实现全局到局部的交互，增强描述符对模态特定失真的鲁棒性。MapData数据集提供了121,781对对齐的电子地图-可见光图像对，解决了可扩展多模态基准的稀缺问题。 Result: 在MapData和五个公共数据集上的广泛评估表明，MapGlue在复杂条件下的匹配精度优于现有方法，且无需重新训练即可有效泛化到未见过的模态。 Conclusion: MapGlue通过结合可扩展的数据集构建和鲁棒的语义驱动框架，解决了多模态遥感图像匹配中的长期挑战，并在其他模态匹配任务中表现出强大的泛化能力。 Abstract: Multimodal remote sensing image (MRSI) matching is pivotal for cross-modal fusion, localization, and object detection, but it faces severe challenges due to geometric, radiometric, and viewpoint discrepancies across imaging modalities. Existing unimodal datasets lack scale and diversity, limiting deep learning solutions. This paper proposes MapGlue, a universal MRSI matching framework, and MapData, a large-scale multimodal dataset addressing these gaps. Our contributions are twofold. MapData, a globally diverse dataset spanning 233 sampling points, offers original images (7,000x5,000 to 20,000x15,000 pixels). After rigorous cleaning, it provides 121,781 aligned electronic map-visible image pairs (512x512 pixels) with hybrid manual-automated ground truth, addressing the scarcity of scalable multimodal benchmarks. MapGlue integrates semantic context with a dual graph-guided mechanism to extract cross-modal invariant features. This structure enables global-to-local interaction, enhancing descriptor robustness against modality-specific distortions. Extensive evaluations on MapData and five public datasets demonstrate MapGlue's superiority in matching accuracy under complex conditions, outperforming state-of-the-art methods. Notably, MapGlue generalizes effectively to unseen modalities without retraining, highlighting its adaptability. This work addresses longstanding challenges in MRSI matching by combining scalable dataset construction with a robust, semantics-driven framework. Furthermore, MapGlue shows strong generalization capabilities on other modality matching tasks for which it was not specifically trained. The dataset and code are available at https://github.com/PeihaoWu/MapGlue.

CLS-RL: Image Classification with Rule-Based Reinforcement Learning

Ming Li,Shitian Zhao,Jike Zhong,Yuxiang Lai,Kaipeng Zhang

Task: 探索少样本多模态大语言模型（MLLM）分类微调方法。

Motivation: 获取大规模标注数据成本高昂，且传统的监督微调（SFT）方法在少样本设置下容易导致过拟合，甚至可能比零样本方法表现更差。

Details

Method: 提出了CLS-RL方法，使用可验证信号作为奖励来微调MLLMs，并进一步提出了No-Thinking-CLS-RL方法，通过设置相等准确率奖励来最小化训练过程中的思考过程。 Result: CLS-RL在大多数数据集上优于SFT，并且在基础到新任务和少样本学习设置下具有更高的平均准确率。No-Thinking-CLS-RL方法在更少的微调时间内实现了优于CLS-RL的域内性能和泛化能力。 Conclusion: 基于强化学习的方法能有效教授模型分类的基本原理，且减少训练过程中的思考过程可以进一步提升模型性能。 Abstract: Classification is a core task in machine learning. Recent research has shown that although Multimodal Large Language Models (MLLMs) are initially poor at image classification, fine-tuning them with an adequate amount of data can significantly enhance their performance, making them comparable to SOTA classification models. However, acquiring large-scale labeled data is expensive. In this paper, we explore few-shot MLLM classification fine-tuning. We found that SFT can cause severe overfitting issues and may even degrade performance over the zero-shot approach. To address this challenge, inspired by the recent successes in rule-based reinforcement learning, we propose CLS-RL, which uses verifiable signals as reward to fine-tune MLLMs. We discovered that CLS-RL outperforms SFT in most datasets and has a much higher average accuracy on both base-to-new and few-shot learning setting. Moreover, we observed a free-lunch phenomenon for CLS-RL; when models are fine-tuned on a particular dataset, their performance on other distinct datasets may also improve over zero-shot models, even if those datasets differ in distribution and class names. This suggests that RL-based methods effectively teach models the fundamentals of classification. Lastly, inspired by recent works in inference time thinking, we re-examine the `thinking process' during fine-tuning, a critical aspect of RL-based methods, in the context of visual classification. We question whether such tasks require extensive thinking process during fine-tuning, proposing that this may actually detract from performance. Based on this premise, we introduce the No-Thinking-CLS-RL method, which minimizes thinking processes during training by setting an equality accuracy reward. Our findings indicate that, with much less fine-tuning time, No-Thinking-CLS-RL method achieves superior in-domain performance and generalization capabilities than CLS-RL.

Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction

Ziyao Guo,Kaipeng Zhang,Michael Qizhe Shieh

Task: 提出一种从粗到细（CTF）的预测方法，以在图像生成中享受大码本的好处而不增加自回归建模的难度。

Motivation: 现有的自回归模型在图像生成中需要将连续像素数据离散化，通常使用VQ-VAE等方法。然而，较大的码本会增加词汇量，使自回归建模任务复杂化。本文旨在找到一种方法，既能享受大码本的好处，又不增加自回归建模的难度。

Details

Method: 提出了一种从粗到细（CTF）的预测方法，包括两个阶段：（1）一个自回归模型，顺序预测序列中每个标记的粗标签；（2）一个辅助模型，根据粗标签同时预测所有标记的细粒度标签。 Result: 在ImageNet上的实验表明，该方法在Inception Score上平均提高了59点，并且尽管增加了推理步骤，采样速度更快。 Conclusion: 通过从粗到细的预测方法，可以在不增加自回归建模难度的情况下，享受大码本的好处，并在图像生成中取得更好的性能。 Abstract: Autoregressive models have shown remarkable success in image generation by adapting sequential prediction techniques from language modeling. However, applying these approaches to images requires discretizing continuous pixel data through vector quantization methods like VQ-VAE. To alleviate the quantization errors that existed in VQ-VAE, recent works tend to use larger codebooks. However, this will accordingly expand vocabulary size, complicating the autoregressive modeling task. This paper aims to find a way to enjoy the benefits of large codebooks without making autoregressive modeling more difficult. Through empirical investigation, we discover that tokens with similar codeword representations produce similar effects on the final generated image, revealing significant redundancy in large codebooks. Based on this insight, we propose to predict tokens from coarse to fine (CTF), realized by assigning the same coarse label for similar tokens. Our framework consists of two stages: (1) an autoregressive model that sequentially predicts coarse labels for each token in the sequence, and (2) an auxiliary model that simultaneously predicts fine-grained labels for all tokens conditioned on their coarse labels. Experiments on ImageNet demonstrate our method's superior performance, achieving an average improvement of 59 points in Inception Score compared to baselines. Notably, despite adding an inference step, our approach achieves faster sampling speeds.

VP-NTK: Exploring the Benefits of Visual Prompting in Differentially Private Data Synthesis

Chia-Yi Hsu,Jia-You Chen,Yu-Lin Tsai,Chih-Hsun Lin,Pin-Yu Chen,Chia-Mu Yu,Chun-Ying Huang

Task: 探索在差分隐私（DP）约束下构建高效生成模型的方法。

Motivation: 差分隐私合成数据在发布敏感数据时已成为标准，但许多DP生成模型的合成数据效用较低，尤其是高分辨率图像。

Details

Method: 结合视觉提示（VP）和DP-NTK（一种利用神经切线核（NTK）训练DP生成模型的DP生成器）来提升性能。 Result: VP与DP-NTK结合显著提升了高分辨率图像数据集的性能，准确率从0.644±0.044提高到0.769。 Conclusion: 该研究展示了在提高DP合成数据效用，特别是高分辨率图像方面迈出了有希望的一步。 Abstract: Differentially private (DP) synthetic data has become the de facto standard for releasing sensitive data. However, many DP generative models suffer from the low utility of synthetic data, especially for high-resolution images. On the other hand, one of the emerging techniques in parameter efficient fine-tuning (PEFT) is visual prompting (VP), which allows well-trained existing models to be reused for the purpose of adapting to subsequent downstream tasks. In this work, we explore such a phenomenon in constructing captivating generative models with DP constraints. We show that VP in conjunction with DP-NTK, a DP generator that exploits the power of the neural tangent kernel (NTK) in training DP generative models, achieves a significant performance boost, particularly for high-resolution image datasets, with accuracy improving from 0.644$\pm$0.044 to 0.769. Lastly, we perform ablation studies on the effect of different parameters that influence the overall performance of VP-NTK. Our work demonstrates a promising step forward in improving the utility of DP synthetic data, particularly for high-resolution images.

Temporal Score Analysis for Understanding and Correcting Diffusion Artifacts

Yu Cao,Zengqun Zhao,Ioannis Patras,Shaogang Gong

Task: 提出ASCED方法，通过监测扩散过程中的异常分数动态来检测和减少视觉伪影。

Motivation: 现有方法主要依赖监督检测器，缺乏对伪影产生原因的理解，且仅关注最终输出的空间不确定性，无法有效定位伪影。

Details

Method: 提出ASCED方法，通过监测扩散过程中的异常分数动态，采用轨迹感知的即时缓解策略，在检测到的区域适当生成噪声。 Result: 实验表明，ASCED方法在多个领域中有效减少了伪影，匹配或超越了现有的监督方法，且无需额外训练。 Conclusion: ASCED方法通过监测和缓解扩散过程中的异常分数动态，有效减少了视觉伪影，且无需额外训练。 Abstract: Visual artifacts remain a persistent challenge in diffusion models, even with training on massive datasets. Current solutions primarily rely on supervised detectors, yet lack understanding of why these artifacts occur in the first place. In our analysis, we identify three distinct phases in the diffusion generative process: Profiling, Mutation, and Refinement. Artifacts typically emerge during the Mutation phase, where certain regions exhibit anomalous score dynamics over time, causing abrupt disruptions in the normal evolution pattern. This temporal nature explains why existing methods focusing only on spatial uncertainty of the final output fail at effective artifact localization. Based on these insights, we propose ASCED (Abnormal Score Correction for Enhancing Diffusion), that detects artifacts by monitoring abnormal score dynamics during the diffusion process, with a trajectory-aware on-the-fly mitigation strategy that appropriate generation of noise in the detected areas. Unlike most existing methods that apply post hoc corrections, \eg, by applying a noising-denoising scheme after generation, our mitigation strategy operates seamlessly within the existing diffusion process. Extensive experiments demonstrate that our proposed approach effectively reduces artifacts across diverse domains, matching or surpassing existing supervised methods without additional training.

OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection

Max Gutbrod,David Rauber,Danilo Weber Nunes,Christoph Palm

Task: 评估医疗影像中的分布外（OOD）检测方法。

Motivation: 确保在医疗等关键领域中的人工智能系统的可信度，特别是在面对意外或异常输入时。

Details

Method: 提出了Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD)框架，包含三个基准测试，涵盖14个数据集，分为协变量偏移的分布内、近OOD和远OOD类别。 Result: 结果表明，自然图像领域的广泛OOD基准测试结果不适用于医疗应用，强调了在医疗领域建立此类基准测试的重要性。 Conclusion: OpenMIBOOD旨在通过减少AI模型暴露于训练分布之外的输入的风险，支持医疗领域可靠和可信的AI系统的进步。 Abstract: The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The repository is available at https://github.com/remic-othr/OpenMIBOOD.

Markus Karmann,Peng-Tao Jiang,Bo Li,Onay Urfalioglu

Task: 提出了一种新的无监督且无需训练的基于点提示的交互式分割方法M2N2V2。

Motivation: 为了解决M2N2在交互过程中出现的分割大小波动问题，并提高分割的准确性和效率。

Details

Method: 通过深度引导和注意力图，结合深度作为额外模态，创建深度引导的Markov-map，并提出一种新的自适应评分函数来防止不合理的分割大小变化。 Result: M2N2V2在大多数数据集上显著提高了点击次数（NoC）和mIoU，并在DAVIS和HQSeg44K数据集上取得了与监督方法竞争的结果。 Conclusion: M2N2V2在无监督方法中表现出色，缩小了与监督方法之间的差距。 Abstract: We present Markov Map Nearest Neighbor V2 (M2N2V2), a novel and simple, yet effective approach which leverages depth guidance and attention maps for unsupervised and training-free point-prompt-based interactive segmentation. Following recent trends in supervised multimodal approaches, we carefully integrate depth as an additional modality to create novel depth-guided Markov-maps. Furthermore, we observe occasional segment size fluctuations in M2N2 during the interactive process, which can decrease the overall mIoU's. To mitigate this problem, we model the prompting as a sequential process and propose a novel adaptive score function which considers the previous segmentation and the current prompt point in order to prevent unreasonable segment size changes. Using Stable Diffusion 2 and Depth Anything V2 as backbones, we empirically show that our proposed M2N2V2 significantly improves the Number of Clicks (NoC) and mIoU compared to M2N2 in all datasets except those from the medical domain. Interestingly, our unsupervised approach achieves competitive results compared to supervised methods like SAM and SimpleClick in the more challenging DAVIS and HQSeg44K datasets in the NoC metric, reducing the gap between supervised and unsupervised methods.

Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

Keda Tao,Haoxuan You,Yang Sui,Can Qin,Huan Wang

Task: 提出一种新的KV缓存量化方法VidKV，用于压缩视频大语言模型（VideoLLMs）的KV缓存至低于2位。

Motivation: 由于视频帧产生的数千个视觉令牌，KV缓存显著增加了内存需求，成为推理速度和内存使用的瓶颈。

Details

Method: 提出VidKV方法，包括对关键通道进行混合精度量化，对值通道进行1.58位量化并选择性保留语义显著的视觉令牌。 Result: 在LLaVA-OV-7B和Qwen2.5-VL-7B模型上，VidKV将KV缓存压缩至1.5位和1.58位精度，几乎不影响性能。 Conclusion: VidKV方法有效压缩了KV缓存，且几乎不影响模型性能，表明VideoLLMs的值缓存应按通道而非按令牌进行量化。 Abstract: Video large language models (VideoLLMs) have demonstrated the capability to process longer video inputs and enable complex reasoning and analysis. However, due to the thousands of visual tokens from the video frames, key-value (KV) cache can significantly increase memory requirements, becoming a bottleneck for inference speed and memory usage. KV cache quantization is a widely used approach to address this problem. In this paper, we find that 2-bit KV quantization of VideoLLMs can hardly hurt the model performance, while the limit of KV cache quantization in even lower bits has not been investigated. To bridge this gap, we introduce VidKV, a plug-and-play KV cache quantization method to compress the KV cache to lower than 2 bits. Specifically, (1) for key, we propose a mixed-precision quantization strategy in the channel dimension, where we perform 2-bit quantization for anomalous channels and 1-bit quantization combined with FFT for normal channels; (2) for value, we implement 1.58-bit quantization while selectively filtering semantically salient visual tokens for targeted preservation, for a better trade-off between precision and model performance. Importantly, our findings suggest that the value cache of VideoLLMs should be quantized in a per-channel fashion instead of the per-token fashion proposed by prior KV cache quantization works for LLMs. Empirically, extensive results with LLaVA-OV-7B and Qwen2.5-VL-7B on six benchmarks show that VidKV effectively compresses the KV cache to 1.5-bit and 1.58-bit precision with almost no performance drop compared to the FP16 counterparts.

Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data

Zijian Li,Jingjing Fu,Lei Song,Jiang Bian,Jun Zhang,Rui Wang

Task: 提出一种新的程序化推理数据生成管道（Chain of Functions, CoF），用于生成高质量的多模态大语言模型（MLLMs）所需的视觉推理数据。

Motivation: 现有的方法利用（M）LLMs生成数据，但直接提示往往导致精度和多样性有限。高质量的推理数据对于解决复杂的图表查询至关重要。

Details

Method: CoF通过自由探索的推理路径作为监督，生成多样化的函数链，并将其转化为语言推理和问题。该方法利用中等规模的开源LLM进行翻译。 Result: 构建了ChartCoF数据集，包含1.4k个复杂推理Q&A用于细粒度分析，以及50k个Q&A用于推理增强。实验表明，使用ChartCoF进行微调的MLLMs在广泛使用的基准测试中达到了最先进的性能。 Conclusion: CoF提供了一种新颖的函数控制推理生成范式，不仅提高了数据精度和多样性，还增强了可解释性和实用性，具有超越图表的广泛应用潜力。 Abstract: Visual reasoning is crucial for multimodal large language models (MLLMs) to address complex chart queries, yet high-quality rationale data remains scarce. Existing methods leveraged (M)LLMs for data generation, but direct prompting often yields limited precision and diversity. In this paper, we propose \textit{Chain of Functions (CoF)}, a novel programmatic reasoning data generation pipeline that utilizes freely-explored reasoning paths as supervision to ensure data precision and diversity. Specifically, it starts with human-free exploration among the atomic functions (e.g., maximum data and arithmetic operations) to generate diverse function chains, which are then translated into linguistic rationales and questions with only a moderate open-sourced LLM. \textit{CoF} provides multiple benefits: 1) Precision: function-governed generation reduces hallucinations compared to freeform generation; 2) Diversity: enumerating function chains enables varied question taxonomies; 3) Explainability: function chains serve as built-in rationales, allowing fine-grained evaluation beyond overall accuracy; 4) Practicality: eliminating reliance on extremely large models. Employing \textit{CoF}, we construct the \textit{ChartCoF} dataset, with 1.4k complex reasoning Q\&A for fine-grained analysis and 50k Q\&A for reasoning enhancement. The fine-grained evaluation on \textit{ChartCoF} reveals varying performance across question taxonomies for each MLLM, and the experiments also show that finetuning with \textit{ChartCoF} achieves state-of-the-art performance among same-scale MLLMs on widely used benchmarks. Furthermore, the novel paradigm of function-governed rationale generation in \textit{CoF} could inspire broader applications beyond charts.

From Monocular Vision to Autonomous Action: Guiding Tumor Resection via 3D Reconstruction

Ayberk Acar,Mariana Smith,Lidia Al-Zogbi,Tanner Watts,Fangjie Li,Hao Li,Nural Yilmaz,Paul Maria Scheikl,Jesse F. d'Almeida,Susheela Sharma,Lauren Branscombe,Tayfun Efe Ertop,Robert J. Webster III,Ipek Oguz,Alan Kuntz,Axel Krieger,Jie Ying Wu

Task: 提出一种仅使用RGB图像生成目标解剖结构分割点云的3D映射管道。

Motivation: 当前方法依赖于笨重的深度相机来创建解剖结构地图，这在空间有限的临床应用中效果不佳。单目相机小巧，适合在狭小空间进行微创手术，但需要额外的处理来生成3D场景理解。

Details

Method: 比较不同的运动结构算法在映射中央气道阻塞中的性能，并在肿瘤切除的下游任务中测试管道。 Result: 在包括术后组织模型评估在内的多个指标中，该管道的性能与RGB-D相机相当，在某些情况下甚至超过其性能。 Conclusion: 这些有希望的结果表明，使用单目相机可以在微创手术中实现自动化引导。这项研究是朝着手术机器人完全自主迈出的一步。 Abstract: Surgical automation requires precise guidance and understanding of the scene. Current methods in the literature rely on bulky depth cameras to create maps of the anatomy, however this does not translate well to space-limited clinical applications. Monocular cameras are small and allow minimally invasive surgeries in tight spaces but additional processing is required to generate 3D scene understanding. We propose a 3D mapping pipeline that uses only RGB images to create segmented point clouds of the target anatomy. To ensure the most precise reconstruction, we compare different structure from motion algorithms' performance on mapping the central airway obstructions, and test the pipeline on a downstream task of tumor resection. In several metrics, including post-procedure tissue model evaluation, our pipeline performs comparably to RGB-D cameras and, in some cases, even surpasses their performance. These promising results demonstrate that automation guidance can be achieved in minimally invasive procedures with monocular cameras. This study is a step toward the complete autonomy of surgical robots.

Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model

Zhaochong An,Guolei Sun,Yun Liu,Runjia Li,Junlin Han,Ender Konukoglu,Serge Belongie

Task: 提出了一种名为GFS-VL的广义少样本3D点云分割框架，结合了3D视觉语言模型的密集但噪声的伪标签和精确但稀疏的少样本样本，以最大化两者的优势。

Motivation: 现有的广义少样本3D点云分割方法通过支持或查询特征交互增强原型，但仍受限于少样本样本的稀疏知识。同时，3D视觉语言模型在开放世界的新类别中具有丰富的但噪声的新类别知识。

Details

Method: 提出了原型引导的伪标签选择来过滤低质量区域，随后采用自适应填充策略，结合伪标签上下文和少样本样本的知识来自适应地标记过滤后的未标记区域。此外，设计了一种新-基混合策略，将少样本样本嵌入训练场景中，保留关键上下文以改进新类别的学习。 Result: 实验验证了该框架在模型和数据集上的有效性。 Conclusion: 该方法和基准为在现实世界中推进广义少样本3D点云分割提供了坚实的基础。 Abstract: Generalized few-shot 3D point cloud segmentation (GFS-PCS) adapts models to new classes with few support samples while retaining base class segmentation. Existing GFS-PCS methods enhance prototypes via interacting with support or query features but remain limited by sparse knowledge from few-shot samples. Meanwhile, 3D vision-language models (3D VLMs), generalizing across open-world novel classes, contain rich but noisy novel class knowledge. In this work, we introduce a GFS-PCS framework that synergizes dense but noisy pseudo-labels from 3D VLMs with precise yet sparse few-shot samples to maximize the strengths of both, named GFS-VL. Specifically, we present a prototype-guided pseudo-label selection to filter low-quality regions, followed by an adaptive infilling strategy that combines knowledge from pseudo-label contexts and few-shot samples to adaptively label the filtered, unlabeled areas. Additionally, we design a novel-base mix strategy to embed few-shot samples into training scenes, preserving essential context for improved novel class learning. Moreover, recognizing the limited diversity in current GFS-PCS benchmarks, we introduce two challenging benchmarks with diverse novel classes for comprehensive generalization evaluation. Experiments validate the effectiveness of our framework across models and datasets. Our approach and benchmarks provide a solid foundation for advancing GFS-PCS in the real world. The code is at https://github.com/ZhaochongAn/GFS-VL

PSA-MIL: A Probabilistic Spatial Attention-Based Multiple Instance Learning for Whole Slide Image Classification

Sharon Peled,Yosef E. Maruvka,Moti Freiman

Task: 提出一种新的基于注意力的多实例学习框架（PSA-MIL），用于全切片图像（WSI）分类。

Motivation: 现有的基于注意力的多实例学习方法未能充分利用瓦片之间的空间关系，可能忽略对准确诊断至关重要的复杂组织结构。

Details

Method: 提出Probabilistic Spatial Attention MIL（PSA-MIL），通过可学习的距离衰减先验将空间上下文整合到注意力机制中，并在训练过程中动态推断空间关系。此外，提出了一种空间剪枝策略以减少自注意力的二次复杂度，并引入了多样性损失以增强空间建模。 Result: PSA-MIL在上下文和非上下文基线中实现了最先进的性能，同时显著降低了计算成本。 Conclusion: PSA-MIL通过数据驱动和自适应的方式整合空间上下文，超越了预定义的约束，提高了全切片图像分类的准确性和效率。 Abstract: Whole Slide Images (WSIs) are high-resolution digital scans widely used in medical diagnostics. WSI classification is typically approached using Multiple Instance Learning (MIL), where the slide is partitioned into tiles treated as interconnected instances. While attention-based MIL methods aim to identify the most informative tiles, they often fail to fully exploit the spatial relationships among them, potentially overlooking intricate tissue structures crucial for accurate diagnosis. To address this limitation, we propose Probabilistic Spatial Attention MIL (PSA-MIL), a novel attention-based MIL framework that integrates spatial context into the attention mechanism through learnable distance-decayed priors, formulated within a probabilistic interpretation of self-attention as a posterior distribution. This formulation enables a dynamic inference of spatial relationships during training, eliminating the need for predefined assumptions often imposed by previous approaches. Additionally, we suggest a spatial pruning strategy for the posterior, effectively reducing self-attention's quadratic complexity. To further enhance spatial modeling, we introduce a diversity loss that encourages variation among attention heads, ensuring each captures distinct spatial representations. Together, PSA-MIL enables a more data-driven and adaptive integration of spatial context, moving beyond predefined constraints. We achieve state-of-the-art performance across both contextual and non-contextual baselines, while significantly reducing computational costs.

SceneMI: Motion In-betweening for Modeling Human-Scene Interactions

Inwoo Hwang,Bing Zhou,Young Min Kim,Jian Wang,Chuan Guo

Task: 将人-场景交互（HSI）建模问题重新定义为场景感知的运动插值任务。

Motivation: 现有的生成建模方法在人-场景交互建模方面取得了一定进展，但在可控性和灵活性方面存在局限，难以应用于实际场景。

Details

Method: 提出了SceneMI框架，采用双场景描述符全面编码全局和局部场景上下文，并利用扩散模型的去噪特性来泛化噪声关键帧。 Result: 实验结果表明，SceneMI在场景感知关键帧插值和泛化到真实世界GIMO数据集方面表现出色，并展示了其在单目视频HSI重建中的适用性。 Conclusion: SceneMI框架在场景感知运动插值和HSI重建方面具有显著优势，能够有效提升运动质量和应用灵活性。 Abstract: Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening -- a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes. Experimental results demonstrate SceneMI's effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI's applicability in HSI reconstruction from monocular videos.

Unleashing Vecset Diffusion Model for Fast Shape Generation

Zeqiang Lai,Yunfei Zhao,Zibo Zhao,Haolin Liu,Fuyun Wang,Huiwen Shi,Xianghui Yang,Qinxiang Lin,Jinwei Huang,Yuhong Liu,Jie Jiang,Chunchao Guo,Xiangyu Yue

Task: 加速3D形状生成中的VAE和DiT过程

Motivation: 现有的Vecset Diffusion Model (VDM)在生成高分辨率3D形状时存在速度瓶颈，特别是在加速扩散采样和VAE解码方面。

Details

Method: 提出了FlashVDM框架，通过Progressive Flow Distillation稳定一致性蒸馏，实现灵活的扩散采样；并引入Adaptive KV Selection、Hierarchical Volume Decoding和Efficient Network Design等技术，优化VAE解码。 Result: FlashVDM在Hunyuan3D-2 Turbo上应用，显著优于现有的快速3D生成方法，推理时间减少了45倍（重建）和32倍（生成）。 Conclusion: FlashVDM通过系统性的优化，显著提升了3D形状生成的速度，同时保持了与现有技术相当的性能。 Abstract: 3D shape generation has greatly flourished through the development of so-called "native" 3D diffusion, particularly through the Vecset Diffusion Model (VDM). While recent advancements have shown promising results in generating high-resolution 3D shapes, VDM still struggles with high-speed generation. Challenges exist because of difficulties not only in accelerating diffusion sampling but also VAE decoding in VDM, areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. For DiT, FlashVDM enables flexible diffusion sampling with as few as 5 inference steps and comparable quality, which is made possible by stabilizing consistency distillation with our newly introduced Progressive Flow Distillation. For VAE, we introduce a lightning vecset decoder equipped with Adaptive KV Selection, Hierarchical Volume Decoding, and Efficient Network Design. By exploiting the locality of the vecset and the sparsity of shape surface in the volume, our decoder drastically lowers FLOPs, minimizing the overall decoding overhead. We apply FlashVDM to Hunyuan3D-2 to obtain Hunyuan3D-2 Turbo. Through systematic evaluation, we show that our model significantly outperforms existing fast 3D generation methods, achieving comparable performance to the state-of-the-art while reducing inference time by over 45x for reconstruction and 32x for generation. Code and models are available at https://github.com/Tencent/FlashVDM.

Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction

Edgar Sucar,Zihang Lai,Eldar Insafutdinov,Andrea Vedaldi

Task: 通过引入动态点地图（DPM）来解决动态场景中的多视图几何任务。

Motivation: 现有的DUSt3R方法在处理动态场景时存在局限性，需要一种新的方法来支持4D任务，如运动分割、场景流估计、3D对象跟踪和2D对应。

Details

Method: 提出动态点地图（DPM）的概念，扩展标准点地图以支持4D任务，并通过网络回归最小子集来解决这些子任务。 Result: 在视频深度预测、动态点云重建、3D场景流和对象姿态跟踪等多个基准测试中取得了最先进的性能。 Conclusion: 动态点地图（DPM）是一种有效的方法，能够解决动态场景中的多视图几何任务，并在多个任务中表现出色。 Abstract: DUSt3R has recently shown that one can reduce many tasks in multi-view geometry, including estimating camera intrinsics and extrinsics, reconstructing the scene in 3D, and establishing image correspondences, to the prediction of a pair of viewpoint-invariant point maps, i.e., pixel-aligned point clouds defined in a common reference frame. This formulation is elegant and powerful, but unable to tackle dynamic scenes. To address this challenge, we introduce the concept of Dynamic Point Maps (DPM), extending standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. Our key intuition is that, when time is introduced, there are several possible spatial and time references that can be used to define the point maps. We identify a minimal subset of such combinations that can be regressed by a network to solve the sub tasks mentioned above. We train a DPM predictor on a mixture of synthetic and real data and evaluate it across diverse benchmarks for video depth prediction, dynamic point cloud reconstruction, 3D scene flow and object pose tracking, achieving state-of-the-art performance. Code, models and additional results are available at https://www.robots.ox.ac.uk/~vgg/research/dynamic-point-maps/.

Ultra-Resolution Adaptation with Ease

Ruonan Yu,Songhua Liu,Zhenxiong Tan,Xinchao Wang

Task: 探索高分辨率图像生成中的数据效率和参数效率问题，并提出超分辨率适应的关键指南URAE。

Motivation: 训练高分辨率图像生成模型在数据和计算资源有限的情况下具有挑战性，因此需要探索数据效率和参数效率的解决方案。

Details

Method: 提出了URAE方法，包括使用教师模型生成的合成数据促进训练收敛，以及在合成数据不可用时调整权重矩阵的次要组件以提高参数效率。 Result: URAE在仅使用3K样本和2K迭代的情况下，实现了与FLUX1.1 [Pro] Ultra相当的性能，并在4K分辨率生成中设定了新的基准。 Conclusion: URAE在数据和参数效率方面表现出色，为高分辨率图像生成提供了有效的解决方案。 Abstract: Text-to-image diffusion models have achieved remarkable progress in recent years. However, training models for high-resolution image generation remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency, and propose a set of key guidelines for ultra-resolution adaptation termed \emph{URAE}. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by some teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, \textit{i.e.}, setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation. Codes are available \href{https://github.com/Huage001/URAE}{here}.

Gaussian Graph Network: Learning Efficient and Generalizable Gaussian Representations from Multi-view Images

Shengjun Zhang,Xin Fei,Fangfu Liu,Haixu Song,Yueqi Duan

Task: 提出一种新的高斯图网络（GGN）来生成高效且可泛化的高斯表示。

Motivation: 现有的前馈方法通过简单地组合多视图的像素对齐高斯来表示场景，导致伪影和额外的内存成本，且未能充分捕捉不同图像中高斯之间的关系。

Details

Method: 构建高斯图来建模不同视图中高斯组之间的关系，重新定义高斯表示上的基本图操作以支持高斯级别的消息传递，并设计高斯池化层来聚合各种高斯组以实现高效表示。 Result: 在大规模RealEstate10K和ACID数据集上的实验表明，该方法使用较少的高斯并实现了更好的图像质量和更高的渲染速度。 Conclusion: 高斯图网络（GGN）能够生成高效且可泛化的高斯表示，优于现有方法。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis performance. While conventional methods require per-scene optimization, more recently several feed-forward methods have been proposed to generate pixel-aligned Gaussian representations with a learnable network, which are generalizable to different scenes. However, these methods simply combine pixel-aligned Gaussians from multiple views as scene representations, thereby leading to artifacts and extra memory cost without fully capturing the relations of Gaussians from different images. In this paper, we propose Gaussian Graph Network (GGN) to generate efficient and generalizable Gaussian representations. Specifically, we construct Gaussian Graphs to model the relations of Gaussian groups from different views. To support message passing at Gaussian level, we reformulate the basic graph operations over Gaussian representations, enabling each Gaussian to benefit from its connected Gaussian groups with Gaussian feature fusion. Furthermore, we design a Gaussian pooling layer to aggregate various Gaussian groups for efficient representations. We conduct experiments on the large-scale RealEstate10K and ACID datasets to demonstrate the efficiency and generalization of our method. Compared to the state-of-the-art methods, our model uses fewer Gaussians and achieves better image quality with higher rendering speed.

UniSync: A Unified Framework for Audio-Visual Synchronization

Tao Feng,Yifan Xie,Xun Guan,Jiyuan Song,Zhou Liu,Fei Ma,Fei Yu

Task: 评估语音视频中的音频-视觉同步

Motivation: 现有的方法在处理音频-视觉同步问题时，依赖于有限的音频-视觉表示和次优的学习策略，限制了其在复杂场景中的有效性。

Details

Method: 提出了UniSync，一种使用嵌入相似性评估音频-视觉同步的新方法，兼容多种音频和视觉表示，并通过对比学习框架和跨说话者非同步对增强其判别能力。 Result: UniSync在标准数据集上优于现有方法，并在多种音频-视觉表示中表现出色。 Conclusion: UniSync在自然和AI生成的内容中提高了同步质量，展示了其在不同音频-视觉表示中的多功能性。 Abstract: Precise audio-visual synchronization in speech videos is crucial for content quality and viewer comprehension. Existing methods have made significant strides in addressing this challenge through rule-based approaches and end-to-end learning techniques. However, these methods often rely on limited audio-visual representations and suboptimal learning strategies, potentially constraining their effectiveness in more complex scenarios. To address these limitations, we present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities. UniSync offers broad compatibility with various audio representations (e.g., Mel spectrograms, HuBERT) and visual representations (e.g., RGB images, face parsing maps, facial landmarks, 3DMM), effectively handling their significant dimensional differences. We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs, improving discriminative capabilities. UniSync outperforms existing methods on standard datasets and demonstrates versatility across diverse audio-visual representations. Its integration into talking face generation frameworks enhances synchronization quality in both natural and AI-generated content.

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Muyao Li,Zihao Wang,Kaichen He,Xiaojian Ma,Yitao Liang

Task: 通过视觉和语言指导的自监督方式改进视觉语言模型（VLMs），以增强其在开放世界环境中的世界知识、视觉识别和空间定位能力。

Motivation: 现有的视觉语言动作（VLA）模型主要关注动作后训练，而忽略了基础模型的改进。

Details

Method: 提出了一种名为“Act from Visual Language Post-Training”的新方法，通过视觉和语言指导的自监督方式改进VLMs。 Result: 在Minecraft中获得了首个能够执行超过1000种不同原子任务的VLA模型，并在非轨迹任务上后训练后，性能比最佳基线提高了40%。 Conclusion: 该方法在Minecraft中超越了传统的基于模仿学习的策略，达到了最先进的性能，并开源了代码、模型和数据集以促进进一步研究。 Abstract: Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in https://craftjarvis.github.io/JarvisVLA.

NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

Han-Hung Lee,Qinghong Han,Angel X. Chang

Task: 生成广阔的户外场景，从城堡到高楼大厦。

Motivation: 户外场景生成与室内场景生成不同，具有独特的挑战，如场景高度的广泛变化和需要能够快速生成大型景观的方法。

Details

Method: 提出了一种高效的方法，将场景块编码为统一的向量集，提供了比之前方法中使用的空间结构化潜在表示更好的压缩和性能。此外，训练了一个显式的外绘模型，用于无界生成，提高了与之前基于重采样的修复方案相比的连贯性，同时通过消除额外的扩散步骤加快了生成速度。 Result: 当在不同风格的场景上进行训练时，模型可以在同一场景中混合不同的环境，如乡村房屋和城市摩天大楼，展示了利用异质场景进行联合训练的潜力。 Conclusion: 所提出的方法在户外场景生成中表现出色，能够高效、连贯地生成大型景观，展示了异质场景联合训练的潜力。 Abstract: In this paper, we explore the task of generating expansive outdoor scenes, ranging from castles to high-rises. Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including wide variations in scene heights and the need for a method capable of rapidly producing large landscapes. To address this, we propose an efficient approach that encodes scene chunks as uniform vector sets, offering better compression and performance than the spatially structured latents used in prior methods. Furthermore, we train an explicit outpainting model for unbounded generation, which improves coherence compared to prior resampling-based inpainting schemes while also speeding up generation by eliminating extra diffusion steps. To facilitate this task, we curate NuiScene43, a small but high-quality set of scenes, preprocessed for joint training. Notably, when trained on scenes of varying styles, our model can blend different environments, such as rural houses and city skyscrapers, within the same scene, highlighting the potential of our curation process to leverage heterogeneous scenes for joint training.

Leyang Wang,Joice Lin

Task: 提出一种名为LLM-assisted Paired Image Generation (LaPIG)的新框架，用于生成高质量配对的可见光和热成像图像。

Motivation: 现代机器学习，特别是面部翻译网络的成功，高度依赖于高质量、配对的大规模数据集的可用性。然而，获取足够的数据通常具有挑战性且成本高昂。

Details

Method: 该方法包括三个部分：使用ArcFace嵌入进行可见光图像合成，使用潜在扩散模型（LDMs）进行热成像图像翻译，以及使用大型语言模型（LLMs）生成标题。 Result: 我们的方法不仅生成了多视角配对的可见光和热成像图像以增加数据多样性，还生成了高质量的配对数据，同时保持了它们的身份信息。 Conclusion: 通过在公开数据集上与现有方法进行比较，证明了LaPIG的优越性。 Abstract: The success of modern machine learning, particularly in facial translation networks, is highly dependent on the availability of high-quality, paired, large-scale datasets. However, acquiring sufficient data is often challenging and costly. Inspired by the recent success of diffusion models in high-quality image synthesis and advancements in Large Language Models (LLMs), we propose a novel framework called LLM-assisted Paired Image Generation (LaPIG). This framework enables the construction of comprehensive, high-quality paired visible and thermal images using captions generated by LLMs. Our method encompasses three parts: visible image synthesis with ArcFace embedding, thermal image translation using Latent Diffusion Models (LDMs), and caption generation with LLMs. Our approach not only generates multi-view paired visible and thermal images to increase data diversity but also produces high-quality paired data while maintaining their identity information. We evaluate our method on public datasets by comparing it with existing methods, demonstrating the superiority of LaPIG.

Panoptic-CUDAL Technical Report: Rural Australia Point Cloud Dataset in Rainy Conditions

Tzu-Yun Tseng,Alexey Nekrasov,Malcolm Burdorf,Bastian Leibe,Julie Stephany Berrio,Mao Shan,Stewart Worrall

Task: 介绍并分析Panoptic-CUDAL数据集，该数据集专为农村地区雨天条件下的全景分割而设计。

Motivation: 现有的自动驾驶数据集主要面向结构良好的城市环境和有利的天气条件，农村环境和恶劣天气条件的复杂性尚未得到充分解决。

Details

Method: 通过记录高分辨率的LiDAR、相机和姿态数据，Panoptic-CUDAL提供了一个多样化、信息丰富的数据集。 Result: 提供了对记录数据的分析，并提供了LiDAR点云上全景和语义分割方法的基线结果。 Conclusion: Panoptic-CUDAL数据集为农村地区雨天条件下的自动驾驶研究提供了重要的数据支持。 Abstract: Existing autonomous driving datasets are predominantly oriented towards well-structured urban settings and favorable weather conditions, leaving the complexities of rural environments and adverse weather conditions largely unaddressed. Although some datasets encompass variations in weather and lighting, bad weather scenarios do not appear often. Rainfall can significantly impair sensor functionality, introducing noise and reflections in LiDAR and camera data and reducing the system's capabilities for reliable environmental perception and safe navigation. We introduce the Panoptic-CUDAL dataset, a novel dataset purpose-built for panoptic segmentation in rural areas subject to rain. By recording high-resolution LiDAR, camera, and pose data, Panoptic-CUDAL offers a diverse, information-rich dataset in a challenging scenario. We present analysis of the recorded data and provide baseline results for panoptic and semantic segmentation methods on LiDAR point clouds. The dataset can be found here: https://robotics.sydney.edu.au/our-research/intelligent-transportation-systems/

Akhil Perincherry,Jacob Krantz,Stefan Lee

Task: 研究视觉表示的子目标是否可以作为导航线索并提高导航性能。

Motivation: 探索在未见过的环境中使用自然语言指令进行导航的视觉-语言导航（VLN）代理的性能提升。

Details

Method: 利用文本到图像扩散模型生成指令中包含的地标参考的视觉表示，并将这些表示作为额外的模态提供给VLN代理，同时添加辅助损失以明确鼓励将这些表示与相应的引用表达式关联。 Result: 成功率（SR）提高了约1个百分点，路径长度逆比例成功率（SPL）提高了最多0.5个百分点。 Conclusion: 所提出的方法增强了视觉理解，相比仅依赖语言指令，导航性能有所提升。 Abstract: Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or imaginations, we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of around 1 point and up to 0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone. Code and data for our work can be found at https://www.akhilperincherry.com/VLN-Imagine-website/.

SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

Chun-Han Yao,Yiming Xie,Vikram Voleti,Huaizu Jiang,Varun Jampani

Task: 动态3D资产生成的多视角视频扩散模型

Motivation: 提高对遮挡和大运动的鲁棒性，更好地泛化到真实世界视频，并生成更高质量的细节清晰度和时空一致性输出

Details

Method: 1) 网络架构：消除参考多视角的依赖，设计3D和帧注意力的混合机制；2) 数据：提高训练数据的质量和数量；3) 训练策略：采用渐进式3D-4D训练以更好地泛化；4) 4D优化：通过两阶段细化和渐进帧采样处理3D不一致性和大运动 Result: 在视觉和定量上均表现出显著的性能提升，在新视角视频合成和4D优化中实现了更好的细节（-14% LPIPS）和4D一致性（-44% FV4D） Conclusion: SV4D 2.0在动态3D资产生成方面表现出色，显著提升了细节和一致性 Abstract: We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14\% LPIPS) and 4D consistency (-44\% FV4D) in novel-view video synthesis and 4D optimization (-12\% LPIPS and -24\% FV4D) compared to SV4D. Project page: https://sv4d2.0.github.io.

Scale-wise Distillation of Diffusion Models

Nikita Starodubcev,Denis Kuznedelev,Artem Babenko,Dmitry Baranchuk

Task: 提出一种用于扩散模型的尺度蒸馏框架（SwD），通过逐步预测的方法优化少步生成器。

Motivation: 基于扩散过程与隐式频谱自回归的最新见解，假设扩散模型可以在较低数据分辨率下启动生成，并在每个去噪步骤中逐步放大样本，从而在不损失性能的情况下显著降低计算成本。

Details

Method: SwD将这一想法自然整合到现有的基于分布匹配的扩散蒸馏方法中，并引入一种新的补丁损失来增强对目标分布的细粒度相似性。 Result: 应用于最先进的文本到图像扩散模型时，SwD在接近两个全分辨率步骤的推理时间内显著优于同类方法，并通过自动化指标和人类偏好研究得到验证。 Conclusion: SwD框架在保持生成质量的同时显著降低了计算成本，为扩散模型的优化提供了新的方向。 Abstract: We present SwD, a scale-wise distillation framework for diffusion models (DMs), which effectively employs next-scale prediction ideas for diffusion-based few-step generators. In more detail, SwD is inspired by the recent insights relating diffusion processes to the implicit spectral autoregression. We suppose that DMs can initiate generation at lower data resolutions and gradually upscale the samples at each denoising step without loss in performance while significantly reducing computational costs. SwD naturally integrates this idea into existing diffusion distillation methods based on distribution matching. Also, we enrich the family of distribution matching approaches by introducing a novel patch loss enforcing finer-grained similarity to the target distribution. When applied to state-of-the-art text-to-image diffusion models, SwD approaches the inference times of two full resolution steps and significantly outperforms the counterparts under the same computation budget, as evidenced by automated metrics and human preference studies.

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

Chen Chen,Zhirui Wang,Taowei Sheng,Yi Jiang,Yundu Li,Peirui Cheng,Luning Zhang,Kaiqiang Chen,Yanfeng Hu,Xue Yang,Xian Sun

Task: 提出了一种基于卫星辅助的3D占用预测模型SA-Occ，以解决现有方法仅依赖街景图像的局限性。

Motivation: 现有的基于视觉的3D占用预测方法由于仅依赖街景图像，在准确性上存在局限性，忽略了卫星视图的潜在优势。

Details

Method: 提出了SA-Occ模型，利用GPS和IMU将历史卫星图像集成到实时应用中，解决了动态区域不一致性、2D卫星图像的3D特征提取以及街景和卫星视图采样密度对齐的问题。 Result: 在Occ3D-nuScenes数据集上，SA-Occ达到了最先进的性能，特别是在单帧方法中，mIoU提高了6.97%，达到39.05%，每帧仅增加6.93毫秒的延迟。 Conclusion: SA-Occ通过整合卫星视图，有效缓解了自车感知的局限性，显著提高了3D占用预测的准确性。 Abstract: Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views. We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions. To address the core challenges of cross-view perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views. Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame. Our code and newly curated dataset are available at https://github.com/chenchen235/SA-Occ.

DreamTexture: Shape from Virtual Texture with Analysis by Augmentation

Ananta R. Bhattarai,Xingzhe He,Alla Sheffer,Helge Rhodin

Task: 提出了一种新的单目深度线索重建3D对象的方法。

Motivation: DreamFusion的多视角渲染和大规模生成模型的监督计算成本高且约束不足。

Details

Method: 利用单目深度线索，通过虚拟纹理与输入图像中的真实深度线索对齐来重建3D对象，并使用新的共形映射优化从虚拟纹理变形中重建深度。 Result: 实验表明生成模型具有对单目形状线索的理解，可以通过增强和对齐纹理线索来提取这些线索。 Conclusion: 提出了一种新的单目重建范式，称为“通过增强的分析”。 Abstract: DreamFusion established a new paradigm for unsupervised 3D reconstruction from virtual views by combining advances in generative models and differentiable rendering. However, the underlying multi-view rendering, along with supervision from large-scale generative models, is computationally expensive and under-constrained. We propose DreamTexture, a novel Shape-from-Virtual-Texture approach that leverages monocular depth cues to reconstruct 3D objects. Our method textures an input image by aligning a virtual texture with the real depth cues in the input, exploiting the inherent understanding of monocular geometry encoded in modern diffusion models. We then reconstruct depth from the virtual texture deformation with a new conformal map optimization, which alleviates memory-intensive volumetric representations. Our experiments reveal that generative models possess an understanding of monocular shape cues, which can be extracted by augmenting and aligning texture cues -- a novel monocular reconstruction paradigm that we call Analysis by Augmentation.

M3: 3D-Spatial MultiModal Memory

Xueyan Zou,Yuchen Song,Ri-Zhao Qiu,Xuanbin Peng,Jianglong Ye,Sifei Liu,Xiaolong Wang

Task: 设计并验证3D Spatial MultiModal Memory (M3)系统，用于通过视频源保留中等规模静态场景的信息。

Motivation: 解决先前工作中在特征splatting中遇到的计算约束和特征对齐问题。

Details

Method: 通过整合3D Gaussian Splatting技术和基础模型，构建多模态记忆系统，并提出关键组件如主要场景组件和高斯记忆注意力机制。 Result: 通过全面的定量评估和定性可视化验证了M3的有效性，并在四足机器人上展示了其在实际场景中的应用。 Conclusion: M3是首个解决3D特征蒸馏核心压缩挑战的工作，展示了其在多模态记忆系统中的潜力。 Abstract: We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rendering feature representations across granularities, encompassing a wide range of knowledge. In our exploration, we identify two key challenges in previous works on feature splatting: (1) computational constraints in storing high-dimensional features for each Gaussian primitive, and (2) misalignment or information loss between distilled features and foundation model features. To address these challenges, we propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and inference. To validate M3, we conduct comprehensive quantitative evaluations of feature similarity and downstream tasks, as well as qualitative visualizations to highlight the pixel trace of Gaussian memory attention. Our approach encompasses a diverse range of foundation models, including vision-language models (VLMs), perception models, and large multimodal and language models (LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy M3's feature field in indoor scenes on a quadruped robot. Notably, we claim that M3 is the first work to address the core compression challenges in 3D feature distillation.

InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

Liming Jiang,Qing Yan,Yumin Jia,Zichuan Liu,Hao Kang,Xin Lu

Task: 实现灵活且高保真的身份保留图像生成。

Motivation: 现有方法在身份相似性、文本-图像对齐、生成质量和美学方面存在不足。

Details

Method: 引入InfuseNet组件，通过残差连接将身份特征注入DiT基础模型，并采用多阶段训练策略，包括预训练和监督微调。 Result: InfU在实验中表现出色，超越了现有基线，达到了最先进的性能。 Conclusion: InfU的即插即用设计确保了与各种现有方法的兼容性，为更广泛的社区提供了有价值的贡献。 Abstract: Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.

SynCity: Training-Free Generation of 3D Worlds

Paul Engstler,Aleksandar Shtedritski,Iro Laina,Christian Rupprecht,Andrea Vedaldi

Task: 从文本描述生成3D世界

Motivation: 解决现有3D生成模型无法生成大规模世界的问题

Details

Method: 提出SynCity方法，结合预训练的3D生成模型的几何精度和2D图像生成器的艺术多样性，通过基于瓦片的方法生成大规模、高质量的3D场景 Result: 生成具有丰富细节和多样性的沉浸式场景 Conclusion: SynCity能够生成高质量且可扩展的3D场景，解决了现有3D生成模型的局限性 Abstract: We address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training- and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Quanhao Li,Zhen Xing,Rui Wang,Hui Zhang,Qi Dai,Zuxuan Wu

Task: 提出一种新的图像到视频生成框架MagicMotion，通过三种不同密度的条件（掩码、边界框和稀疏框）实现轨迹控制。

Motivation: 现有方法在复杂物体运动和多物体运动控制方面存在不足，导致轨迹遵循不精确、物体一致性差和视觉质量下降。此外，现有方法仅支持单一格式的轨迹控制，缺乏专门的数据集和基准测试。

Details

Method: 引入MagicMotion框架，通过掩码、边界框和稀疏框三种条件实现轨迹控制，并提出了MagicData数据集和MagicBench基准测试。 Result: 实验表明，MagicMotion在多个指标上优于现有方法。 Conclusion: MagicMotion框架在轨迹控制的视频生成中表现出色，提供了高质量的视频生成和精确的轨迹控制。 Abstract: Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics. Our project page are publicly available at https://quanhaol.github.io/magicmotion-site.

1000+ FPS 4D Gaussian Splatting for Dynamic Scene Rendering

Yuheng Yuan,Qiuhong Shen,Xingyi Yang,Xinchao Wang

Task: 提出4DGS-1K方法，解决4D高斯溅射（4DGS）在动态场景重建中的存储和渲染速度问题。

Motivation: 4DGS在动态场景重建中虽然质量优越，但存在存储需求大和渲染速度慢的问题。

Details

Method: 引入空间-时间变化评分（Spatial-Temporal Variation Score）来去除短寿命高斯，并存储连续帧中活跃高斯的掩码以减少冗余计算。 Result: 4DGS-1K在复杂动态场景中实现了41倍的存储减少和9倍的栅格化速度提升，同时保持相当的视觉质量。 Conclusion: 4DGS-1K方法有效解决了4DGS的存储和渲染速度问题，显著提升了性能。 Abstract: 4D Gaussian Splatting (4DGS) has recently gained considerable attention as a method for reconstructing dynamic scenes. Despite achieving superior quality, 4DGS typically requires substantial storage and suffers from slow rendering speed. In this work, we delve into these issues and identify two key sources of temporal redundancy. (Q1) \textbf{Short-Lifespan Gaussians}: 4DGS uses a large portion of Gaussians with short temporal span to represent scene dynamics, leading to an excessive number of Gaussians. (Q2) \textbf{Inactive Gaussians}: When rendering, only a small subset of Gaussians contributes to each frame. Despite this, all Gaussians are processed during rasterization, resulting in redundant computation overhead. To address these redundancies, we present \textbf{4DGS-1K}, which runs at over 1000 FPS on modern GPUs. For Q1, we introduce the Spatial-Temporal Variation Score, a new pruning criterion that effectively removes short-lifespan Gaussians while encouraging 4DGS to capture scene dynamics using Gaussians with longer temporal spans. For Q2, we store a mask for active Gaussians across consecutive frames, significantly reducing redundant computations in rendering. Compared to vanilla 4DGS, our method achieves a $41\times$ reduction in storage and $9\times$ faster rasterization speed on complex dynamic scenes, while maintaining comparable visual quality. Please see our project page at https://4DGS-1K.github.io.

GAEA: A Geolocation Aware Conversational Model

Ron Campos,Ashmal Vayani,Parth Parag Kulkarni,Rohit Gupta,Aritra Dutta,Mubarak Shah

Task: 提出一种能够提供图像位置信息的对话模型GAEA，以解决现有大模型在图像地理定位任务中的不足。

Motivation: 现有的AI模型在图像地理定位任务中只能预测GPS坐标，缺乏对位置的理解和与用户的对话能力。

Details

Method: 引入对话模型GAEA，并构建了一个包含80万张图像和160万问答对的数据集，利用OpenStreetMap属性和地理上下文线索。 Result: GAEA在对话能力评估中显著优于现有的开源和专有模型，分别比LLaVA-OneVision和GPT-4o高出25.69%和8.28%。 Conclusion: GAEA模型在图像地理定位任务中表现出色，能够提供更丰富的位置信息和对话能力。 Abstract: Image geolocalization, in which, traditionally, an AI model predicts the precise GPS coordinates of an image is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge other than the GPS coordinate; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with tremendous progress of large multimodal models (LMMs) proprietary and open-source researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, one of which is geolocalization, LMMs struggle. In this work, we propose to solve this problem by introducing a conversational model GAEA that can provide information regarding the location of an image, as required by a user. No large-scale dataset enabling the training of such a model exists. Thus we propose a comprehensive dataset GAEA with 800K images and around 1.6M question answer pairs constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark comprising 4K image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision by 25.69% and the best proprietary model, GPT-4o by 8.28%. Our dataset, model and codes are available

Tokenize Image as a Set

Zigang Geng,Mengde Xu,Han Hu,Shuyang Gu

Task: 提出一种基于集合的标记化和分布建模的图像生成新范式。

Motivation: 传统方法将图像序列化为固定位置的潜在代码，具有统一的压缩比，无法动态分配编码容量。

Details

Method: 引入无序标记集表示，动态分配编码容量，并提出双转换机制和固定和离散扩散框架。 Result: 实验表明该方法在语义感知表示和生成质量上具有优越性。 Conclusion: 通过新颖的表示和建模策略，推动了视觉生成超越传统的序列标记范式。 Abstract: This paper proposes a fundamentally new paradigm for image generation through set-based tokenization and distribution modeling. Unlike conventional methods that serialize images into fixed-position latent codes with a uniform compression ratio, we introduce an unordered token set representation to dynamically allocate coding capacity based on regional semantic complexity. This TokenSet enhances global context aggregation and improves robustness against local perturbations. To address the critical challenge of modeling discrete sets, we devise a dual transformation mechanism that bijectively converts sets into fixed-length integer sequences with summation constraints. Further, we propose Fixed-Sum Discrete Diffusion--the first framework to simultaneously handle discrete values, fixed sequence length, and summation invariance--enabling effective set distribution modeling. Experiments demonstrate our method's superiority in semantic-aware representation and generation quality. Our innovations, spanning novel representation and modeling strategies, advance visual generation beyond traditional sequential token paradigms. Our code and models are publicly available at https://github.com/Gengzigang/TokenSet.

DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding

Keyan Chen,Chenyang Liu,Bowen Chen,Wenyuan Li,Zhengxia Zou,Zhenwei Shi

Task: 提出一种动态视觉感知基础模型DynamicVis，用于遥感图像的理解。

Motivation: 现有方法在跨任务泛化能力上表现有限，且主要处理低分辨率图像，未能充分利用高分辨率数据和大场景语义。

Details

Method: 提出了一种基于选择性状态空间模型的动态区域感知骨干网络，结合多实例学习范式，利用元嵌入表示进行跨任务知识迁移。 Result: DynamicVis在九个下游任务中表现出色，处理2048x2048像素图像时延迟为97毫秒，GPU内存占用为833 MB。 Conclusion: DynamicVis能够高效地进行多层次特征建模，具有优异的计算效率和架构可扩展性。 Abstract: The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model's versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's).

Sonata: Self-Supervised Learning of Reliable Point Representations

Xiaoyang Wu,Daniel DeTone,Duncan Frost,Tianwei Shen,Chris Xie,Nan Yang,Jakob Engel,Richard Newcombe,Hengshuang Zhao,Julian Straub

Task: 开发一种可靠的自我监督点云模型，用于通过简单的线性探测处理多样化的3D任务。

Motivation: 现有的3D自我监督学习方法在通过线性探测评估表示质量时表现不佳，主要由于几何捷径导致表示崩溃为低层次空间特征。

Details

Method: 通过两种关键策略解决这一问题：模糊空间信息和增强对输入特征的依赖，最终通过自我蒸馏构建了一个包含140k点云的Sonata模型。 Result: Sonata模型在ScanNet上的线性探测准确率从21.8%提高到72.5%，并且在仅使用1%数据的情况下性能几乎翻倍。 Conclusion: Sonata模型展示了卓越的参数和数据效率，通过全微调进一步提升了3D室内外感知任务的最先进水平。 Abstract: In this paper, we question whether we have a reliable self-supervised point cloud model that can be used for diverse 3D tasks via simple linear probing, even with limited data and minimal computation. We find that existing 3D self-supervised learning approaches fall short when evaluated on representation quality through linear probing. We hypothesize that this is due to what we term the "geometric shortcut", which causes representations to collapse to low-level spatial features. This challenge is unique to 3D and arises from the sparse nature of point cloud data. We address it through two key strategies: obscuring spatial information and enhancing the reliance on input features, ultimately composing a Sonata of 140k point clouds through self-distillation. Sonata is simple and intuitive, yet its learned representations are strong and reliable: zero-shot visualizations demonstrate semantic grouping, alongside strong spatial reasoning through nearest-neighbor relationships. Sonata demonstrates exceptional parameter and data efficiency, tripling linear probing accuracy (from 21.8% to 72.5%) on ScanNet and nearly doubling performance with only 1% of the data compared to previous approaches. Full fine-tuning further advances SOTA across both 3D indoor and outdoor perception tasks.

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

Yuqing Wang,Zhijie Lin,Yao Teng,Yuanzhi Zhu,Shuhuai Ren,Jiashi Feng,Xihui Liu

Task: 提出TokenBridge方法，以在保持连续令牌强表示能力的同时，保留离散令牌的建模简单性。

Motivation: 解决自回归视觉生成模型中离散令牌和连续令牌之间的权衡问题，离散令牌虽然建模简单但存在信息丢失和训练不稳定的问题，而连续令牌虽然能更好地保留视觉细节但需要复杂的分布建模。

Details

Method: 通过后训练量化从连续表示中直接获得离散令牌，引入维度量化策略独立离散化每个特征维度，并配合轻量级自回归预测机制高效建模大令牌空间。 Result: 实验表明，该方法在使用标准分类预测的情况下，重建和生成质量与连续方法相当。 Conclusion: 通过桥接离散和连续范式，可以有效地结合两者的优势，为高质量视觉生成提供了一种有前景的方向。 Abstract: Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling. Project page: https://yuqingwang1029.github.io/TokenBridge.

KHAIT: K-9 Handler Artificial Intelligence Teaming for Collaborative Sensemaking

Matthew Wilchek,Linhan Wang,Sally Dickinson,Erica Feuerbacher,Kurt Luther,Feras A. Batarseh

Task: 提出一种名为KHAIT的新方法，通过整合基于对象检测的人工智能（AI）和增强现实（AR）来缩小城市搜救（USAR）操作中的感知差距。

Motivation: 在城市搜救操作中，由于环境复杂和犬只特定行为，处理者与犬只之间的沟通存在困难，导致处理者无法了解犬只的位置和情况，即所谓的'感知差距'。

Details

Method: KHAIT方法通过配备AI摄像头、边缘计算和AR头显，从犬只的视角进行精确和快速的对象检测，从而提高幸存者的定位。 Result: 在真实世界的USAR环境中进行评估，结果显示平均生存分配时间减少了22%，提高了操作的速度和准确性。 Conclusion: KHAIT方法有效缩小了感知差距，提高了城市搜救操作的效率和准确性。 Abstract: In urban search and rescue (USAR) operations, communication between handlers and specially trained canines is crucial but often complicated by challenging environments and the specific behaviors canines are trained to exhibit when detecting a person. Since a USAR canine often works out of sight of the handler, the handler lacks awareness of the canine's location and situation, known as the 'sensemaking gap.' In this paper, we propose KHAIT, a novel approach to close the sensemaking gap and enhance USAR effectiveness by integrating object detection-based Artificial Intelligence (AI) and Augmented Reality (AR). Equipped with AI-powered cameras, edge computing, and AR headsets, KHAIT enables precise and rapid object detection from a canine's perspective, improving survivor localization. We evaluate this approach in a real-world USAR environment, demonstrating an average survival allocation time decrease of 22%, enhancing the speed and accuracy of operations.

Motion Synthesis with Sparse and Flexible Keyjoint Control

Inwoo Hwang,Jinseok Bae,Donggeun Lim,Young Min Kim

Task: 提出一种实用的可控运动合成框架，以处理稀疏和灵活的关键关节信号。

Motivation: 创建富有表现力的角色动画需要大量的手工调整，现有的可控运动生成方法通常依赖于预定义的密集时空规范，限制了动画师的实用性。

Details

Method: 采用分解的基于扩散的运动合成框架，首先从稀疏输入控制信号合成关键关节运动，然后基于完成的关键关节轨迹合成全身运动。 Result: 通过在不同数据集和场景上的综合实验，展示了稀疏和灵活的关键关节控制的有效性。 Conclusion: 提出的框架能够处理高层次意图和直观控制，增强了控制的灵活性，并能够精确满足任务要求。 Abstract: Creating expressive character animations is labor-intensive, requiring intricate manual adjustment of animators across space and time. Previous works on controllable motion generation often rely on a predefined set of dense spatio-temporal specifications (e.g., dense pelvis trajectories with exact per-frame timing), limiting practicality for animators. To process high-level intent and intuitive control in diverse scenarios, we propose a practical controllable motions synthesis framework that respects sparse and flexible keyjoint signals. Our approach employs a decomposed diffusion-based motion synthesis framework that first synthesizes keyjoint movements from sparse input control signals and then synthesizes full-body motion based on the completed keyjoint trajectories. The low-dimensional keyjoint movements can easily adapt to various control signal types, such as end-effector position for diverse goal-driven motion synthesis, or incorporate functional constraints on a subset of keyjoints. Additionally, we introduce a time-agnostic control formulation, eliminating the need for frame-specific timing annotations and enhancing control flexibility. Then, the shared second stage can synthesize a natural whole-body motion that precisely satisfies the task requirement from dense keyjoint movements. We demonstrate the effectiveness of sparse and flexible keyjoint control through comprehensive experiments on diverse datasets and scenarios.

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA,:,Alisson Azzolini,Hannah Brandon,Prithvijit Chattopadhyay,Huayu Chen,Jinju Chu,Yin Cui,Jenna Diamond,Yifan Ding,Francesco Ferroni,Rama Govindaraju,Jinwei Gu,Siddharth Gururani,Imad El Hanafi,Zekun Hao,Jacob Huffman,Jingyi Jin,Brendan Johnson,Rizwan Khan,George Kurian,Elena Lantz,Nayeon Lee,Zhaoshuo Li,Xuan Li,Tsung-Yi Lin,Yen-Chen Lin,Ming-Yu Liu,Andrew Mathau,Yun Ni,Lindsey Pavao,Wei Ping,David W. Romero,Misha Smelyanskiy,Shuran Song,Lyne Tchapmi,Andrew Z. Wang,Boxin Wang,Haoxiang Wang,Fangyin Wei,Jiashu Xu,Yao Xu,Xiaodong Yang,Zhuolin Yang,Xiaohui Zeng,Zhe Zhang

Task: 开发能够理解和生成物理世界决策的多模态大语言模型。

Motivation: 物理AI系统需要感知、理解并在物理世界中执行复杂动作。

Details

Method: 使用分层本体表示物理常识，开发两种多模态大语言模型（Cosmos-Reason1-8B和Cosmos-Reason1-56B），并通过四个阶段进行数据整理和模型训练：视觉预训练、通用监督微调、物理AI监督微调和物理AI强化学习。 Result: 评估结果显示，物理AI监督微调和强化学习带来了显著改进。 Conclusion: 为了促进物理AI的发展，代码和预训练模型将在NVIDIA开放模型许可证下提供。 Abstract: Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-8B and Cosmos-Reason1-56B. We curate data and train our models in four stages: vision pre-training, general supervised fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL) as the post-training. To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and reinforcement learning bring significant improvements. To facilitate the development of Physical AI, we will make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.

Shap-MeD

Nicolás Laverde,Melissa Robles,Johan Rodríguez

Task: 开发一个专门用于生物医学领域的文本到3D对象生成模型Shap-MeD，以辅助医学对象的3D建模。

Motivation: 3D建模在医学中有多种应用，包括手术模拟和规划、个性化假体植入物设计、医学教育、解剖模型创建以及研究原型开发。通过减少开发时间，Shap-MeD旨在提高这些应用的效率。

Details

Method: 利用OpenAI开发的Shap-e开源文本到3D生成模型，并使用生物医学对象数据集对其进行微调。 Result: Shap-MeD在评估集上的潜在生成均方误差（MSE）为0.089，优于Shap-e的0.147。定性评估表明，Shap-MeD在生物医学对象生成中表现出更高的结构准确性。 Conclusion: Shap-MeD在生物医学领域的3D对象生成中表现出色，具有较高的结构准确性和较低的潜在生成误差。 Abstract: We present Shap-MeD, a text-to-3D object generative model specialized in the biomedical domain. The objective of this study is to develop an assistant that facilitates the 3D modeling of medical objects, thereby reducing development time. 3D modeling in medicine has various applications, including surgical procedure simulation and planning, the design of personalized prosthetic implants, medical education, the creation of anatomical models, and the development of research prototypes. To achieve this, we leverage Shap-e, an open-source text-to-3D generative model developed by OpenAI, and fine-tune it using a dataset of biomedical objects. Our model achieved a mean squared error (MSE) of 0.089 in latent generation on the evaluation set, compared to Shap-e's MSE of 0.147. Additionally, we conducted a qualitative evaluation, comparing our model with others in the generation of biomedical objects. Our results indicate that Shap-MeD demonstrates higher structural accuracy in biomedical object generation.

A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana

Alba Márquez-Rodríguez,Miguel Ángel Mohedano-Munoz,Manuel J. Marín-Jiménez,Eduardo Santamaría-García,Giulia Bastianelli,Pedro Jordano,Irene Mendoza

Task: 开发一种多阶段管道，用于在Doñana国家公园自动识别鸟类鸣叫。

Motivation: 被动声学监测生成的大量无监督音频数据难以提取有意义的信息，深度学习技术提供了有前景的解决方案。

Details

Method: 开发了一个包括鸟类鸣叫检测器和基于BirdNET嵌入的自定义分类器的多阶段管道。 Result: 结合鸟类鸣叫检测器和微调的BirdNET模型，相比基线模型，显著提高了物种识别的准确性。 Conclusion: 整合鸟类鸣叫检测器和微调的分类模型在本地声景中有效识别鸟类，强调了为特定生态挑战调整通用工具的必要性。 Abstract: Passive Acoustic Monitoring with automatic recorders is essential for ecosystem conservation but generates vast unsupervised audio data, posing challenges for extracting meaningful information. Deep Learning techniques offer a promising solution. BirdNET, a widely used model for bird identification, has shown success in many study systems but is limited in some regions due to biases in its training data. A key challenge in bird species detection is that many recordings either lack target species or contain overlapping vocalizations. To overcome these problems, we developed a multi-stage pipeline for automatic bird vocalization identification in Do\~nana National Park (SW Spain), a region facing significant conservation threats. Our approach included a Bird Song Detector to isolate vocalizations and custom classifiers trained with BirdNET embeddings. We manually annotated 461 minutes of audio from three habitats across nine locations, yielding 3,749 annotations for 34 classes. Spectrograms facilitated the use of image processing techniques. Applying the Bird Song Detector before classification improved species identification, as all classification models performed better when analyzing only the segments where birds were detected. Specifically, the combination of the Bird Song Detector and fine-tuned BirdNET compared to the baseline without the Bird Song Detector. Our approach demonstrated the effectiveness of integrating a Bird Song Detector with fine-tuned classification models for bird identification at local soundscapes. These findings highlight the need to adapt general-purpose tools for specific ecological challenges, as demonstrated in Do\~nana. Automatically detecting bird species serves for tracking the health status of this threatened ecosystem, given the sensitivity of birds to environmental changes, and helps in the design of conservation measures for reducing biodiversity loss

How to Train Your Dragon: Automatic Diffusion-Based Rigging for Characters with Diverse Topologies

Zeqi Gu,Difan Liu,Timothy Langlois,Matthew Fisher,Abe Davis

Task: 扩展基于扩散模型的方法以动画化具有多样化骨骼拓扑结构的角色图像。

Motivation: 现有的基于扩散模型的方法在动画化人类图像方面取得了显著成果，但这些方法依赖于人类特定的身体姿势表示和大量标注的真实视频训练数据。

Details

Method: 提出了一种程序化数据生成管道，动态生成具有多样化拓扑结构的训练数据，并使用一种新的骨骼表示方法来训练模型。在微调阶段，模型能够快速适应未见过的目标角色，并生成新的姿势图像。 Result: 通过大量实验，证明了该方法在生成新姿势图像方面的优越质量。 Conclusion: 该方法能够有效地动画化具有多样化骨骼拓扑结构的角色图像，并在生成新姿势图像方面表现出色。 Abstract: Recent diffusion-based methods have achieved impressive results on animating images of human subjects. However, most of that success has built on human-specific body pose representations and extensive training with labeled real videos. In this work, we extend the ability of such models to animate images of characters with more diverse skeletal topologies. Given a small number (3-5) of example frames showing the character in different poses with corresponding skeletal information, our model quickly infers a rig for that character that can generate images corresponding to new skeleton poses. We propose a procedural data generation pipeline that efficiently samples training data with diverse topologies on the fly. We use it, along with a novel skeleton representation, to train our model on articulated shapes spanning a large space of textures and topologies. Then during fine-tuning, our model rapidly adapts to unseen target characters and generalizes well to rendering new poses, both for realistic and more stylized cartoon appearances. To better evaluate performance on this novel and challenging task, we create the first 2D video dataset that contains both humanoid and non-humanoid subjects with per-frame keypoint annotations. With extensive experiments, we demonstrate the superior quality of our results. Project page: https://traindragondiffusion.github.io/

Cancelable Biometric Template Generation Using Random Feature Vector Transformations

Ragendhu Sp,Tony Thomas,Sabu Emmanuel

Task: 提出一种新的、模态无关的可取消生物特征方案，以克服现有方案的局限性。

Motivation: 现有可取消生物特征方案存在性能问题、模态特定性以及容易受到重建攻击的缺点。

Details

Method: 通过将生物特征向量的多个随机变换之间的距离向量生成可取消模板（伪标识符）。 Result: 在面部和指纹模态上评估了所提出方案的识别性能，最坏情况下分别获得了1.5和1.7的等错误率（EER）。 Conclusion: 所提出的方案消除了模板重建的可能性，并且不存储任何生物特征模板的详细信息，具有较好的识别性能。 Abstract: Cancelable biometric schemes are designed to extract an identity-preserving, non-invertible as well as revocable pseudo-identifier from biometric data. Recognition systems need to store only this pseudo-identifier, to avoid tampering and/or stealing of original biometric data during the recognition process. State-of-the-art cancelable schemes generate pseudo-identifiers by transforming the original template using either user-specific salting or many-to-one transformations. In addition to the performance concerns, most of such schemes are modality-specific and prone to reconstruction attacks as there are chances for unauthorized access to security-critical transformation keys. A novel, modality-independent cancelable biometric scheme is proposed to overcome these limitations. In this scheme, a cancelable template (pseudo identifier) is generated as a distance vector between multiple random transformations of the biometric feature vector. These transformations were done by grouping feature vector components based on a set of user-specific random vectors. The proposed scheme nullifies the possibility of template reconstruction as the generated cancelable template contains only the distance values between the different random transformations of the feature vector and it does not store any details of the biometric template. The recognition performance of the proposed scheme is evaluated for face and fingerprint modalities. Equal Error Rate (EER) of 1.5 is obtained for face and 1.7 is obtained for the fingerprint in the worst case.

Nano-3D: Metasurface-Based Neural Depth Imaging

Bingxuan Li,Jiahao Wu,Yuan Xu,Yunxiang Zhang,Zezheng Zhu,Nanfang Yu,Qi Sun

Task: 提出一种基于超表面的神经深度成像解决方案Nano-3D，用于从单眼超表面偏振图像中提取精确的度量深度信息。

Motivation: 传统的深度相机在体积和精度之间存在权衡，限制了其在空间受限场景中的适用性。

Details

Method: 将定制的700纳米厚的TiO2超表面与多模块深度神经网络集成，从单眼超表面偏振图像中提取深度信息。 Result: 通过模拟和物理实验证明了Nano-3D的有效性。 Conclusion: Nano-3D的成功为未来图形系统与新兴纳米材料技术通过新颖的计算方法结合铺平了道路。 Abstract: Depth imaging is a foundational building block for broad applications, such as autonomous driving and virtual/augmented reality. Traditionally, depth cameras have relied on time-of-flight sensors or multi-lens systems to achieve physical depth measurements. However, these systems often face a trade-off between a bulky form factor and imprecise approximations, limiting their suitability for spatially constrained scenarios. Inspired by the emerging advancements of nano-optics, we present Nano-3D, a metasurface-based neural depth imaging solution with an ultra-compact footprint. Nano-3D integrates our custom-fabricated 700 nm thick TiO2 metasurface with a multi-module deep neural network to extract precise metric depth information from monocular metasurface-polarized imagery. We demonstrate the effectiveness of Nano-3D with both simulated and physical experiments. We hope the exhibited success paves the way for the community to bridge future graphics systems with emerging nanomaterial technologies through novel computational approaches.

Yuci Han,Charles Toth,Alper Yilmaz

Task: 开发一种方法，使无人机系统（UAS）能够在大规模城市环境中高效学习导航，并将其习得的专业知识转移到新环境中。

Motivation: 传统的强化学习（RL）专注于为特定任务获取策略，而元强化学习（MRL）旨在学习具有快速转移能力到新任务的策略。然而，MRL训练过程耗时较长，因此需要一种更高效的算法。

Details

Method: 提出了一种元课程训练方案，包括元训练和下游任务的微调。此外，引入了增量自适应强化学习（ISAR）算法，结合了增量学习和元强化学习的思想。 Result: 在模拟环境中评估了所提出的方法，结果表明，使用这种训练理念与ISAR算法相结合，显著提高了大规模城市导航的收敛速度和新环境的适应能力。 Conclusion: 所提出的元课程训练方案和ISAR算法能够显著提高无人机系统在大规模城市环境中的导航效率和在新环境中的适应能力。 Abstract: The aim of this work is to develop an approach that enables Unmanned Aerial System (UAS) to efficiently learn to navigate in large-scale urban environments and transfer their acquired expertise to novel environments. To achieve this, we propose a meta-curriculum training scheme. First, meta-training allows the agent to learn a master policy to generalize across tasks. The resulting model is then fine-tuned on the downstream tasks. We organize the training curriculum in a hierarchical manner such that the agent is guided from coarse to fine towards the target task. In addition, we introduce Incremental Self-Adaptive Reinforcement learning (ISAR), an algorithm that combines the ideas of incremental learning and meta-reinforcement learning (MRL). In contrast to traditional reinforcement learning (RL), which focuses on acquiring a policy for a specific task, MRL aims to learn a policy with fast transfer ability to novel tasks. However, the MRL training process is time consuming, whereas our proposed ISAR algorithm achieves faster convergence than the conventional MRL algorithm. We evaluate the proposed methodologies in simulated environments and demonstrate that using this training philosophy in conjunction with the ISAR algorithm significantly improves the convergence speed for navigation in large-scale cities and the adaptation proficiency in novel environments.

Controlling Avatar Diffusion with Learnable Gaussian Embedding

Xuan Gao,Jingtao Zhou,Dongyu Liu,Yuqi Zhou,Juyong Zhang

Task: 提出一种新的控制信号表示方法，以增强基于扩散模型的头部生成的一致性和表现力。

Motivation: 现有的数字人生成模型在3D一致性、时间连贯性和运动准确性方面存在不足，主要原因是常用控制信号（如地标、深度图等）的表示能力有限，且公开数据集中身份和姿势变化的多样性不足。

Details

Method: 在参数化头部表面上嵌入可学习的神经高斯分布，合成大规模多姿势和身份的数据集，并使用真实/合成标签有效区分真实和合成数据。 Result: 实验表明，该方法在真实性、表现力和3D一致性方面优于现有方法。 Conclusion: 提出的方法显著提高了基于扩散模型的头部生成的一致性和表现力，代码、合成数据集和预训练模型将在项目页面上发布。 Abstract: Recent advances in diffusion models have made significant progress in digital human generation. However, most existing models still struggle to maintain 3D consistency, temporal coherence, and motion accuracy. A key reason for these shortcomings is the limited representation ability of commonly used control signals(e.g., landmarks, depth maps, etc.). In addition, the lack of diversity in identity and pose variations in public datasets further hinders progress in this area. In this paper, we analyze the shortcomings of current control signals and introduce a novel control signal representation that is optimizable, dense, expressive, and 3D consistent. Our method embeds a learnable neural Gaussian onto a parametric head surface, which greatly enhances the consistency and expressiveness of diffusion-based head models. Regarding the dataset, we synthesize a large-scale dataset with multiple poses and identities. In addition, we use real/synthetic labels to effectively distinguish real and synthetic data, minimizing the impact of imperfections in synthetic data on the generated head images. Extensive experiments show that our model outperforms existing methods in terms of realism, expressiveness, and 3D consistency. Our code, synthetic datasets, and pre-trained models will be released in our project page: https://ustc3dv.github.io/Learn2Control/

Sequential Spatial-Temporal Network for Interpretable Automatic Ultrasonic Assessment of Fetal Head during labor

Jie Gan,Zhuonan Liang,Jianan Fan,Lisa Mcguire,Caterina Watson,Jacqueline Spurway,Jillian Clarke,Weidong Cai

Task: 评估胎儿头部下降和预测分娩结果的关键指标——进展角度（AoP）和头部耻骨联合距离（HSD）的准确测量。

Motivation: 满足ISUOG指南中提出的临床需求和标准操作流程。

Details

Method: 引入顺序时空网络（SSTN），该模型首先识别超声平面，然后分割解剖结构（如耻骨联合和胎儿头部），最后检测关键标志以精确测量HSD和AoP。 Result: 在临床数据集上的实验评估表明，SSTN显著优于现有模型，AoP的平均绝对误差减少了18%，HSD的平均绝对误差减少了22%。 Conclusion: SSTN是第一个专门为分娩期超声分析视频设计的可解释模型，显著提高了AoP和HSD测量的准确性和可靠性。 Abstract: The intrapartum ultrasound guideline established by ISUOG highlights the Angle of Progression (AoP) and Head Symphysis Distance (HSD) as pivotal metrics for assessing fetal head descent and predicting delivery outcomes. Accurate measurement of the AoP and HSD requires a structured process. This begins with identifying standardized ultrasound planes, followed by the detection of specific anatomical landmarks within the regions of the pubic symphysis and fetal head that correlate with the delivery parameters AoP and HSD. Finally, these measurements are derived based on the identified anatomical landmarks. Addressing the clinical demands and standard operation process outlined in the ISUOG guideline, we introduce the Sequential Spatial-Temporal Network (SSTN), the first interpretable model specifically designed for the video of intrapartum ultrasound analysis. The SSTN operates by first identifying ultrasound planes, then segmenting anatomical structures such as the pubic symphysis and fetal head, and finally detecting key landmarks for precise measurement of HSD and AoP. Furthermore, the cohesive framework leverages task-related information to improve accuracy and reliability. Experimental evaluations on clinical datasets demonstrate that SSTN significantly surpasses existing models, reducing the mean absolute error by 18% for AoP and 22% for HSD.

SpiLiFormer: Enhancing Spiking Transformers with Lateral Inhibition

Zeqi Zheng,Yanchen Huang,Yingchao Yu,Zizheng Zhu,Junfeng Tang,Zhaofei Yu,Yaochu Jin

Task: 提出一种基于侧抑制机制的脉冲Transformer模型（SpiLiFormer），以解决现有Transformer-based SNNs在注意力分配上的问题。

Motivation: 现有的Transformer-based SNNs的脉冲注意力模块大多是从模拟Transformer中改编而来，未能完全解决对无关上下文过度分配注意力的问题。

Details

Method: 提出了一种受侧抑制机制启发的脉冲Transformer模型（SpiLiFormer），模拟大脑的侧抑制机制，增强对相关token的注意力，同时抑制对无关token的注意力。 Result: 在多个数据集上实现了最先进的性能，包括CIFAR-10（+0.45%）、CIFAR-100（+0.48%）、CIFAR10-DVS（+2.70%）、N-Caltech101（+1.94%）和ImageNet-1K（+1.6%）。在ImageNet-1K数据集上，SpiLiFormer（69.9M参数，4时间步，384分辨率）优于E-SpikeFormer（173.0M参数，8时间步，384分辨率），仅使用39%的参数和一半的时间步就实现了0.46%的提升。 Conclusion: SpiLiFormer通过模拟大脑的侧抑制机制，有效解决了现有Transformer-based SNNs在注意力分配上的问题，并在多个数据集上实现了最先进的性能。 Abstract: Spiking Neural Networks (SNNs) based on Transformers have garnered significant attention due to their superior performance and high energy efficiency. However, the spiking attention modules of most existing Transformer-based SNNs are adapted from those of analog Transformers, failing to fully address the issue of over-allocating attention to irrelevant contexts. To fix this fundamental yet overlooked issue, we propose a Lateral Inhibition-inspired Spiking Transformer (SpiLiFormer). It emulates the brain's lateral inhibition mechanism, guiding the model to enhance attention to relevant tokens while suppressing attention to irrelevant ones. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including CIFAR-10 (+0.45%), CIFAR-100 (+0.48%), CIFAR10-DVS (+2.70%), N-Caltech101 (+1.94%), and ImageNet-1K (+1.6%). Notably, on the ImageNet-1K dataset, SpiLiFormer (69.9M parameters, 4 time steps, 384 resolution) outperforms E-SpikeFormer (173.0M parameters, 8 time steps, 384 resolution), a SOTA spiking Transformer, by 0.46% using only 39% of the parameters and half the time steps. Our code and training checkpoints will be released upon acceptance.

Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models

Marc Benedí San Millán,Angela Dai,Matthias Nießner

Task: 提出一种方法，从输入的静态3D人形网格合成4D动画序列。

Motivation: 创建逼真的人形角色动画需要大量时间和成本，因此需要一种更高效和低成本的方法。

Details

Method: 利用生成视频模型中的强广义运动先验，从输入的静态3D人形网格和描述所需动画的文本提示合成相应的视频，然后使用SMPL表示根据视频生成的运动对3D网格进行动画化。 Result: 该方法能够合成多样且逼真的4D动画，提供了一种成本效益高且易于使用的解决方案。 Conclusion: 该方法为合成多样且逼真的4D动画提供了一种高效且低成本的解决方案。 Abstract: Animation of humanoid characters is essential in various graphics applications, but requires significant time and cost to create realistic animations. We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes, leveraging strong generalized motion priors from generative video models -- as such video models contain powerful motion information covering a wide variety of human motions. From an input static 3D humanoid mesh and a text prompt describing the desired animation, we synthesize a corresponding video conditioned on a rendered image of the 3D mesh. We then employ an underlying SMPL representation to animate the corresponding 3D mesh according to the video-generated motion, based on our motion optimization. This enables a cost-effective and accessible solution to enable the synthesis of diverse and realistic 4D animations.

GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions

Xiaomeng Chu,Jiajun Deng,Guoliang You,Wei Liu,Xingchen Li,Jianmin Ji,Yanyong Zhang

Task: 开发一种灵活的指令引导的6自由度抓取框架，用于现实世界中的机器人系统。

Motivation: 现有方法利用大型语言模型（LLMs）的上下文理解能力来建立表达式和目标之间的映射，但LLMs对物体物理属性的知识尚未充分探索，尽管这与抓取密切相关。

Details

Method: 提出了GraspCoT，一个集成了面向物理属性的Chain-of-Thought（CoT）推理机制的6自由度抓取检测框架，并通过辅助问答（QA）任务进行引导。设计了QA模板以实现包括目标解析、物理属性分析和抓取动作选择的分层推理。 Result: 在IntentGrasp基准上的大量实验证明了该方法的优越性，并在实际机器人应用中验证了其实用性。 Conclusion: GraspCoT框架通过结合物理属性的推理和多模态LLM架构，显著提高了6自由度抓取的准确性和实用性。 Abstract: Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. Codes and data will be released.

SALT: Singular Value Adaptation with Low-Rank Transformation

Abdelrahman Elsayed,Sarim Hashmi,Mohammed Elseiagy,Hu Wang,Mohammad Yaqub,Ibrahim Almakky

Task: 提出一种新的参数高效微调方法SALT，用于医学图像分割。

Motivation: 现有的参数高效微调方法如LoRA和SVD在捕捉领域特定细节时存在不足，需要一种更有效的方法。

Details

Method: SALT方法通过选择性地调整最有影响的奇异值，并结合低秩更新来改进模型。 Result: 在5个具有挑战性的医学数据集上，SALT在Dice指标上比现有的PEFT方法（LoRA和SVD）提高了2%到5%，且仅使用了3.9%的可训练参数。 Conclusion: SALT方法在低资源环境下表现出色，能够有效适应医学图像分割任务。 Abstract: The complex nature of medical image segmentation calls for models that are specifically designed to capture detailed, domain-specific features. Large foundation models offer considerable flexibility, yet the cost of fine-tuning these models remains a significant barrier. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), efficiently update model weights with low-rank matrices but may suffer from underfitting when the chosen rank is insufficient to capture domain-specific nuances. Conversely, full-rank Singular Value Decomposition (SVD) based methods provide comprehensive updates by modifying all singular values, yet they often lack flexibility and exhibit variable performance across datasets. We propose SALT (Singular Value Adaptation with Low-Rank Transformation), a method that selectively adapts the most influential singular values using trainable scale and shift parameters while complementing this with a low-rank update for the remaining subspace. This hybrid approach harnesses the advantages of both LoRA and SVD, enabling effective adaptation without relying on increasing model size or depth. Evaluated on 5 challenging medical datasets, ranging from as few as 20 samples to 1000, SALT outperforms state-of-the-art PEFT (LoRA and SVD) by 2% to 5% in Dice with only 3.9% trainable parameters, demonstrating robust adaptation even in low-resource settings. The code for SALT is available at: https://github.com/BioMedIA-MBZUAI/SALT

3-D Image-to-Image Fusion in Lightsheet Microscopy by Two-Step Adversarial Network: Contribution to the FuseMyCells Challenge

Marek Wodzinski,Henning Müller

Task: 提出一种基于两步法的深度学习解决方案，用于从单一3-D视图中融合高质量的3-D体积。

Motivation: 解决光片显微镜在深度成像中的低穿透深度和图像质量下降问题，同时简化程序并节省光子预算。

Details

Method: 采用两步法，第一步处理下采样图像以捕捉整个感兴趣区域，第二步使用基于补丁的方法进行高分辨率推断，并结合对抗损失以增强视觉效果。 Result: 实验结果表明该方法有效，能够提高3-D图像融合质量并扩展光片显微镜的能力。细胞核和细胞膜的平均SSIM分别大于0.85和0.91。 Conclusion: 该方法在提高3-D图像融合质量和扩展光片显微镜能力方面具有潜力。 Abstract: Lightsheet microscopy is a powerful 3-D imaging technique that addresses limitations of traditional optical and confocal microscopy but suffers from a low penetration depth and reduced image quality at greater depths. Multiview lightsheet microscopy improves 3-D resolution by combining multiple views but simultaneously increasing the complexity and the photon budget, leading to potential photobleaching and phototoxicity. The FuseMyCells challenge, organized in conjunction with the IEEE ISBI 2025 conference, aims to benchmark deep learning-based solutions for fusing high-quality 3-D volumes from single 3-D views, potentially simplifying procedures and conserving the photon budget. In this work, we propose a contribution to the FuseMyCells challenge based on a two-step procedure. The first step processes a downsampled version of the image to capture the entire region of interest, while the second step uses a patch-based approach for high-resolution inference, incorporating adversarial loss to enhance visual outcomes. This method addresses challenges related to high data resolution, the necessity of global context, and the preservation of high-frequency details. Experimental results demonstrate the effectiveness of our approach, highlighting its potential to improve 3-D image fusion quality and extend the capabilities of lightsheet microscopy. The average SSIM for the nucleus and membranes is greater than 0.85 and 0.91, respectively.

Dong Chen,Boyue Zhao,Yi Zhang,Meng Zhao

Task: 提出一种新颖的互补特征压缩交互网络（CFCI-Net），以实现脑胶质瘤的精确分割。

Motivation: 由于不同MRI模态的特异性，难以进行模态特征差异较大的跨模态融合，导致模型忽略丰富的特征信息。同时，并行网络中由于特征维度的激增，出现了多模态特征冗余交互的问题，进一步增加了底端多模态特征融合的难度。

Details

Method: 提出选择性互补特征融合（SCFF）模块，通过互补软选择权重自适应融合丰富的跨模态特征信息。提出模态特征压缩交互（MFCI）变换器，处理特征维度激增时的多模态融合冗余问题。MFCI变换器由模态特征压缩（MFC）和模态特征交互（MFI）组成，实现冗余特征压缩和多模态特征交互学习。 Result: 在BraTS2019和BraTS2020数据集上的评估表明，CFCI-Net相比最先进的模型取得了优越的结果。 Conclusion: CFCI-Net通过高效的模态融合策略，实现了多模态特征信息的互补融合和压缩交互，有效解决了脑胶质瘤分割中的多模态特征融合难题。 Abstract: Efficient modal feature fusion strategy is the key to achieve accurate segmentation of brain glioma. However, due to the specificity of different MRI modes, it is difficult to carry out cross-modal fusion with large differences in modal features, resulting in the model ignoring rich feature information. On the other hand, the problem of multi-modal feature redundancy interaction occurs in parallel networks due to the proliferation of feature dimensions, further increase the difficulty of multi-modal feature fusion at the bottom end. In order to solve the above problems, we propose a noval complementary feature compression interaction network (CFCI-Net), which realizes the complementary fusion and compression interaction of multi-modal feature information with an efficient mode fusion strategy. Firstly, we propose a selective complementary feature fusion (SCFF) module, which adaptively fuses rich cross-modal feature information by complementary soft selection weights. Secondly, a modal feature compression interaction (MFCI) transformer is proposed to deal with the multi-mode fusion redundancy problem when the feature dimension surges. The MFCI transformer is composed of modal feature compression (MFC) and modal feature interaction (MFI) to realize redundancy feature compression and multi-mode feature interactive learning. %In MFI, we propose a hierarchical interactive attention mechanism based on multi-head attention. Evaluations on the BraTS2019 and BraTS2020 datasets demonstrate that CFCI-Net achieves superior results compared to state-of-the-art models. Code: https://github.com/CDmm0/CFCI-Net

OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering

Shiyong Liu,Xiao Tang,Zhihao Li,Yingfan He,Chongjie Ye,Jianzhuang Liu,Binxiao Huang,Shunbo Zhou,Xiaofei Wu

Task: 提出一种基于遮挡感知的场景分割策略，以提高大规模场景重建的质量和渲染速度。

Motivation: 现有的场景分割方法对遮挡不敏感，导致每个区域可能包含严重遮挡的区域，从而降低了相机的相关性及其对整体重建的平均贡献。

Details

Method: 提出了一种基于相机位置和共视性的遮挡感知场景分割策略，并进一步提出了一种基于区域的渲染技术，以加速大规模场景的渲染。 Result: 在多个大规模场景上的实验表明，该方法在重建质量和渲染速度上均优于现有的最先进方法。 Conclusion: 所提出的遮挡感知场景分割策略和基于区域的渲染技术能够显著提高大规模场景重建的质量和渲染速度。 Abstract: In large-scale scene reconstruction using 3D Gaussian splatting, it is common to partition the scene into multiple smaller regions and reconstruct them individually. However, existing division methods are occlusion-agnostic, meaning that each region may contain areas with severe occlusions. As a result, the cameras within those regions are less correlated, leading to a low average contribution to the overall reconstruction. In this paper, we propose an occlusion-aware scene division strategy that clusters training cameras based on their positions and co-visibilities to acquire multiple regions. Cameras in such regions exhibit stronger correlations and a higher average contribution, facilitating high-quality scene reconstruction. We further propose a region-based rendering technique to accelerate large scene rendering, which culls Gaussians invisible to the region where the viewpoint is located. Such a technique significantly speeds up the rendering without compromising quality. Extensive experiments on multiple large scenes show that our method achieves superior reconstruction results with faster rendering speed compared to existing state-of-the-art approaches. Project page: https://occlugaussian.github.io.

Efficient Bayesian Computation Using Plug-and-Play Priors for Poisson Inverse Problems

Teresa Klatzer,Savvas Melidonis,Marcelo Pereyra,Konstantinos C. Zygalakis

Task: 提出一种新的即插即用（PnP）Langevin采样方法，用于低光子泊松成像问题中的贝叶斯推断。

Motivation: 低光子泊松成像问题在天文学、医学和生物学中有重要应用，但现有的PnP Langevin算法由于高解不确定性和不良的规律性（如梯度爆炸和非负约束）而不适用于此类问题。

Details

Method: 提出了两种策略来扩展Langevin PnP采样到泊松成像模型：（i）一种加速的PnP Langevin方法，结合边界反射和泊松似然近似；（ii）一种镜像采样算法，利用黎曼几何处理约束和似然的不良规律性。 Result: 通过广泛的数值实验和与最先进方法的比较，证明了这些方法的有效性。 Conclusion: 提出的方法在低光子泊松成像问题中表现出色，能够进行准确的点估计和高级推断任务，如不确定性量化和可视化分析。 Abstract: This paper introduces a novel plug-and-play (PnP) Langevin sampling methodology for Bayesian inference in low-photon Poisson imaging problems, a challenging class of problems with significant applications in astronomy, medicine, and biology. PnP Langevin sampling algorithms offer a powerful framework for Bayesian image restoration, enabling accurate point estimation as well as advanced inference tasks, including uncertainty quantification and visualization analyses, and empirical Bayesian inference for automatic model parameter tuning. However, existing PnP Langevin algorithms are not well-suited for low-photon Poisson imaging due to high solution uncertainty and poor regularity properties, such as exploding gradients and non-negativity constraints. To address these challenges, we propose two strategies for extending Langevin PnP sampling to Poisson imaging models: (i) an accelerated PnP Langevin method that incorporates boundary reflections and a Poisson likelihood approximation and (ii) a mirror sampling algorithm that leverages a Riemannian geometry to handle the constraints and the poor regularity of the likelihood without approximations. The effectiveness of these approaches is demonstrated through extensive numerical experiments and comparisons with state-of-the-art methods.

RESFL: An Uncertainty-Aware Framework for Responsible Federated Learning by Balancing Privacy, Fairness and Utility in Autonomous Vehicles

Dawood Wasif,Terrence J. Moore,Jin-Hee Cho

Task: 探索在基于联邦学习的自动驾驶车辆物体检测中隐私与公平性之间的权衡，并引入RESFL解决方案。

Motivation: 现有的联邦学习框架在隐私、公平性和鲁棒性之间难以平衡，导致不同人口群体之间的性能差异。

Details

Method: RESFL结合了对抗性隐私解耦和不确定性引导的公平感知聚合。对抗性组件使用梯度反转层去除敏感属性，减少隐私风险同时保持公平性。不确定性感知聚合采用证据神经网络自适应地加权客户端更新，优先考虑公平性差异较小且置信度较高的贡献。 Result: RESFL在FACET数据集和CARLA模拟器上评估，提高了检测准确性，减少了公平性差异，并降低了隐私攻击成功率，同时表现出比其他方法更强的对抗条件下的鲁棒性。 Conclusion: RESFL在隐私、公平性和鲁棒性之间实现了更好的平衡，为自动驾驶车辆的物体检测提供了有效的解决方案。 Abstract: Autonomous vehicles (AVs) increasingly rely on Federated Learning (FL) to enhance perception models while preserving privacy. However, existing FL frameworks struggle to balance privacy, fairness, and robustness, leading to performance disparities across demographic groups. Privacy-preserving techniques like differential privacy mitigate data leakage risks but worsen fairness by restricting access to sensitive attributes needed for bias correction. This work explores the trade-off between privacy and fairness in FL-based object detection for AVs and introduces RESFL, an integrated solution optimizing both. RESFL incorporates adversarial privacy disentanglement and uncertainty-guided fairness-aware aggregation. The adversarial component uses a gradient reversal layer to remove sensitive attributes, reducing privacy risks while maintaining fairness. The uncertainty-aware aggregation employs an evidential neural network to weight client updates adaptively, prioritizing contributions with lower fairness disparities and higher confidence. This ensures robust and equitable FL model updates. We evaluate RESFL on the FACET dataset and CARLA simulator, assessing accuracy, fairness, privacy resilience, and robustness under varying conditions. RESFL improves detection accuracy, reduces fairness disparities, and lowers privacy attack success rates while demonstrating superior robustness to adversarial conditions compared to other approaches.

Do image and video quality metrics model low-level human vision?

Dounia Hammou,Yancheng Cai,Pavan Madhusudanarao,Christos G. Bampis,Rafał K. Mantiuk

Task: 提出一组用于测试全参考质量指标的方法，以检验其是否能够模拟人类低层次视觉的多个方面。

Motivation: 现有的图像和视频质量指标（如SSIM、LPIPS和VMAF）虽然声称是“感知的”，但很少直接模拟人类视觉感知，大多数依赖于手工公式或训练数据集来实现与感知数据的一致性。

Details

Method: 提出一组测试方法，用于检验质量指标在对比敏感度、对比掩蔽和对比匹配等方面的表现。 Result: 分析了33种现有的图像和视频质量指标，发现LPIPS和MS-SSIM在预测对比掩蔽方面表现良好，而VMAF在此任务中表现较差。此外，SSIM在高空间频率差异上过度强调，但其多尺度版本MS-SSIM解决了这一问题。 Conclusion: 这些发现通过现有的评估协议难以轻易得出，新的测试方法为评估新提出的质量指标提供了额外的审查手段。 Abstract: Image and video quality metrics, such as SSIM, LPIPS, and VMAF, are aimed to predict the perceived quality of the evaluated content and are often claimed to be "perceptual". Yet, few metrics directly model human visual perception, and most rely on hand-crafted formulas or training datasets to achieve alignment with perceptual data. In this paper, we propose a set of tests for full-reference quality metrics that examine their ability to model several aspects of low-level human vision: contrast sensitivity, contrast masking, and contrast matching. The tests are meant to provide additional scrutiny for newly proposed metrics. We use our tests to analyze 33 existing image and video quality metrics and find their strengths and weaknesses, such as the ability of LPIPS and MS-SSIM to predict contrast masking and poor performance of VMAF in this task. We further find that the popular SSIM metric overemphasizes differences in high spatial frequencies, but its multi-scale counterpart, MS-SSIM, addresses this shortcoming. Such findings cannot be easily made using existing evaluation protocols.

Rapid patient-specific neural networks for intraoperative X-ray to volume registration

Vivek Gopalakrishnan,Neel Dey,David-Dimitris Chlorogiannis,Andrew Abumoussa,Anna M. Larson,Darren B. Orbach,Sarah Frisken,Polina Golland

Task: 提出一种全自动框架xvr，用于训练患者特定的神经网络进行2D/3D配准。

Motivation: 解决当前2D/3D配准方法在依赖X射线引导的广泛手术中失败的问题，传统优化技术需要为每个对象定制参数，而神经网络在小数据集上训练无法泛化到新患者或需要劳动密集型手动注释。

Details

Method: xvr使用基于物理的模拟从患者的术前体积成像生成大量高质量训练数据，克服了监督模型在新患者和手术中泛化能力有限的固有缺陷。 Result: xvr在真实X射线数据上进行了迄今为止最大的2D/3D配准算法评估，发现xvr在包含多种解剖结构、成像模式和医院的数据集上稳健地泛化。 Conclusion: xvr在手术任务中实现了亚毫米级精确的配准，速度达到术中要求，比现有方法提高了一个数量级，并作为开源软件发布。 Abstract: The integration of artificial intelligence in image-guided interventions holds transformative potential, promising to extract 3D geometric and quantitative information from conventional 2D imaging modalities during complex procedures. Achieving this requires the rapid and precise alignment of 2D intraoperative images (e.g., X-ray) with 3D preoperative volumes (e.g., CT, MRI). However, current 2D/3D registration methods fail across the broad spectrum of procedures dependent on X-ray guidance: traditional optimization techniques require custom parameter tuning for each subject, whereas neural networks trained on small datasets do not generalize to new patients or require labor-intensive manual annotations, increasing clinical burden and precluding application to new anatomical targets. To address these challenges, we present xvr, a fully automated framework for training patient-specific neural networks for 2D/3D registration. xvr uses physics-based simulation to generate abundant high-quality training data from a patient's own preoperative volumetric imaging, thereby overcoming the inherently limited ability of supervised models to generalize to new patients and procedures. Furthermore, xvr requires only 5 minutes of training per patient, making it suitable for emergency interventions as well as planned procedures. We perform the largest evaluation of a 2D/3D registration algorithm on real X-ray data to date and find that xvr robustly generalizes across a diverse dataset comprising multiple anatomical structures, imaging modalities, and hospitals. Across surgical tasks, xvr achieves submillimeter-accurate registration at intraoperative speeds, improving upon existing methods by an order of magnitude. xvr is released as open-source software freely available at https://github.com/eigenvivek/xvr.

CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners

Yunzhi Yao,Jizhan Fang,Jia-Chen Gu,Ningyu Zhang,Shumin Deng,Huajun Chen,Nanyun Peng

Task: 提出一种新的知识编辑方法CaKE，以更有效地将更新后的知识整合到大型语言模型中。

Motivation: 现有的知识编辑方法在更新孤立事实时表现良好，但在需要多跳推理任务时难以推广这些更新。

Details

Method: 通过分析推理电路，提出CaKE方法，利用策略性策划的数据，引导模型使用更新后的知识，并刺激模型为新整合的知识开发适当的推理电路。 Result: 实验结果表明，CaKE在多跳推理任务中能够更准确和一致地使用更新后的知识，在MQuAKE数据集上的多跳推理准确率平均提高了20%。 Conclusion: CaKE方法能够更有效地整合更新后的知识，显著提高了多跳推理任务的准确性。 Abstract: Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they struggle to generalize these updates to multi-hop reasoning tasks that depend on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we observe that current layer-localized KE approaches, such as MEMIT and WISE, which edit only single or a few model layers, struggle to effectively incorporate updated information into these reasoning pathways. To address this limitation, we propose CaKE (Circuit-aware Knowledge Editing), a novel method that enables more effective integration of updated knowledge in LLMs. CaKE leverages strategically curated data, guided by our circuits-based analysis, that enforces the model to utilize the modified knowledge, stimulating the model to develop appropriate reasoning circuits for newly integrated knowledge. Experimental results show that CaKE enables more accurate and consistent use of updated knowledge across related reasoning tasks, leading to an average of 20% improvement in multi-hop reasoning accuracy on MQuAKE dataset compared to existing KE methods. We release the code and data in https://github.com/zjunlp/CaKE.

Attentional Triple-Encoder Network in Spatiospectral Domains for Medical Image Segmentation

Kristin Qi,Xinhan Di

Task: 提出一种三重编码器网络，用于视网膜光学相干断层扫描（OCT）分割，以整合空间和光谱特征。

Motivation: 传统方法仅关注空间或光谱域，忽略了它们的联合依赖关系。

Details

Method: 提出了一种三重编码器网络，结合了用于空间特征的卷积神经网络（CNN）、用于光谱特征的快速傅里叶卷积（FFC）以及用于捕捉跨域全局关系的注意力机制。 Result: 该方法将平均Dice分数从0.855提高到0.864，优于之前的工作。 Conclusion: 提出的三重编码器网络在视网膜OCT分割中表现出色，能够有效整合空间和光谱特征。 Abstract: Retinal Optical Coherence Tomography (OCT) segmentation is essential for diagnosing pathology. Traditional methods focus on either spatial or spectral domains, overlooking their combined dependencies. We propose a triple-encoder network that integrates CNNs for spatial features, Fast Fourier Convolution (FFC) for spectral features, and attention mechanisms to capture global relationships across both domains. Attention fusion modules integrate convolution and cross-attention to further enhance features. Our method achieves an average Dice score improvement from 0.855 to 0.864, outperforming prior work.

VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

SeungJu Cha,Kwanyoung Lee,Ye-Chan Kim,Hyunwoo Oh,Dong-Jin Kim

Task: 提出VerbDiff模型以解决文本到图像扩散模型中捕捉人类与物体之间交互的挑战。

Motivation: 现有的文本到图像扩散模型在生成逼真图像时，往往难以准确描绘人类与物体之间的交互，因为它们在区分各种交互词汇方面的能力有限。

Details

Method: VerbDiff通过减弱交互词汇与对象之间的偏差，增强对交互的理解。具体来说，我们从频率基础的锚定词汇中解耦各种交互词汇，并利用生成图像中的局部交互区域来帮助模型更好地捕捉独特词汇的语义，而无需额外条件。 Result: 在HICO-DET数据集上的大量实验表明，我们的方法相比之前的方法更有效。 Conclusion: VerbDiff能够准确理解人类与物体之间的交互意图，生成与指定动词一致的高质量图像。 Abstract: Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words. In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.

RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

Yiran Qin,Li Kang,Xiufeng Song,Zhenfei Yin,Xiaohong Liu,Xihui Liu,Ruimao Zhang,Lei Bai

Task: 设计有效的多智能体系统以解决复杂现实任务。

Motivation: 由于多智能体系统的复杂性，现有方法无法自动生成安全高效的训练数据。

Details

Method: 提出了组合约束的概念，设计了针对不同类型约束的接口，开发了自动数据收集框架，并引入了RoboFactory基准。 Result: 基于RoboFactory基准，评估了模仿学习方法在不同难度任务中的表现，并探索了多智能体模仿学习的架构和训练策略。 Conclusion: 通过组合约束和专门设计的接口，成功开发了自动数据收集框架，并建立了RoboFactory基准，为构建安全高效的多智能体系统提供了基础。 Abstract: Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.

Bézier Splatting for Fast and Differentiable Vector Graphics

Xi Liu,Chaoyi Zhou,Nanxuan Zhao,Siyu Huang

Task: 提出一种新的可微分矢量图形表示方法，称为Bézier splatting，以实现快速且高保真的矢量图形栅格化。

Motivation: 现有的可微分矢量图形表示方法在优化过程中成本高，且难以在高分辨率图像上实现高质量的渲染效果。

Details

Method: Bézier splatting通过在Bézier曲线上采样2D高斯函数，自然地在物体边界处提供位置梯度，并结合基于splatting的高效可微分栅格化器。此外，引入了一种自适应修剪和密集化策略，动态调整曲线的空间分布以逃离局部最小值。 Result: 实验结果表明，Bézier splatting在视觉保真度和优化速度上显著优于现有方法，优化速度提高了10倍。 Conclusion: Bézier splatting提供了一种高效且高质量的可微分矢量图形表示方法，显著提升了矢量图形栅格化的性能。 Abstract: Differentiable vector graphics (VGs) are widely used in image vectorization and vector synthesis, while existing representations are costly to optimize and struggle to achieve high-quality rendering results for high-resolution images. This work introduces a new differentiable VG representation, dubbed B\'ezier splatting, that enables fast yet high-fidelity VG rasterization. B\'ezier splatting samples 2D Gaussians along B\'ezier curves, which naturally provide positional gradients at object boundaries. Thanks to the efficient splatting-based differentiable rasterizer, B\'ezier splatting achieves over 20x and 150x faster per forward and backward rasterization step for open curves compared to DiffVG. Additionally, we introduce an adaptive pruning and densification strategy that dynamically adjusts the spatial distribution of curves to escape local minima, further improving VG quality. Experimental results show that B\'ezier splatting significantly outperforms existing methods with better visual fidelity and 10x faster optimization speed.

XAttention: Block Sparse Attention with Antidiagonal Scoring

Ruyi Xu,Guangxuan Xiao,Haofeng Huang,Junxian Guo,Song Han

Task: 加速长上下文Transformer模型的推理过程。

Motivation: 长上下文Transformer模型在实际应用中非常重要，但由于注意力机制的二次复杂度，计算成本很高。现有的块稀疏注意力方法在平衡准确性和效率方面存在困难。

Details

Method: 提出了XAttention框架，利用稀疏注意力加速长上下文Transformer模型的推理。XAttention通过计算注意力矩阵中反对角线值的和来精确识别和剪除非关键块。 Result: 在多个长上下文基准测试中，XAttention实现了与完整注意力相当的准确性，同时显著加速了计算，注意力计算加速高达13.5倍。 Conclusion: XAttention能够释放块稀疏注意力的实际潜力，为长上下文Transformer模型的可扩展和高效部署铺平了道路。 Abstract: Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks-including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation. XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications. Code is available at https://github.com/mit-han-lab/x-attention.