2025 03 19

Context-aware Multimodal AI Reveals Hidden Pathways in Five Centuries of Art Evolution

Jin Kim,Byunghwee Lee,Taekho You,Jinhyuk Yun

Task: 使用Stable Diffusion分析500年西方绘画的潜在信息，包括形式方面（如颜色）和上下文方面（如主题）。

Motivation: 探索多模态生成AI在艺术领域的潜力，特别是在表示艺术品潜在空间方面的能力。

Details

Method: 使用Stable Diffusion模型提取西方绘画的形式和上下文信息，并通过上下文关键词展示艺术表达如何随社会变化而演变。 Result: 上下文信息在区分艺术时期、风格和个体艺术家方面比形式元素更成功，生成实验成功再现了艺术品的演变轨迹。 Conclusion: 多模态AI通过整合时间、文化和历史背景，扩展了传统的形式分析，展示了社会与艺术之间的相互影响。 Abstract: The rise of multimodal generative AI is transforming the intersection of technology and art, offering deeper insights into large-scale artwork. Although its creative capabilities have been widely explored, its potential to represent artwork in latent spaces remains underexamined. We use cutting-edge generative AI, specifically Stable Diffusion, to analyze 500 years of Western paintings by extracting two types of latent information with the model: formal aspects (e.g., colors) and contextual aspects (e.g., subject). Our findings reveal that contextual information differentiates between artistic periods, styles, and individual artists more successfully than formal elements. Additionally, using contextual keywords extracted from paintings, we show how artistic expression evolves alongside societal changes. Our generative experiment, infusing prospective contexts into historical artworks, successfully reproduces the evolutionary trajectory of artworks, highlighting the significance of mutual interaction between society and art. This study demonstrates how multimodal AI expands traditional formal analysis by integrating temporal, cultural, and historical contexts.

A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models

Ziqiang Li,Jun Li,Lizhi Xiong,Zhangjie Fu,Zechao Li

Task: 对视觉概念挖掘（VCM）技术进行系统分类和探索，以增强文本到图像扩散模型的可控性。

Motivation: 文本信号的内在局限性常常阻碍文本到图像扩散模型完全捕捉特定概念，从而降低其可控性。

Details

Method: 将现有研究分为四个关键领域：概念学习、概念擦除、概念分解和概念组合。 Result: 提供了对视觉概念挖掘技术基础原则的宝贵见解，并提出了未来的研究方向。 Conclusion: 视觉概念挖掘技术在增强文本到图像扩散模型的可控性方面具有重要潜力，未来研究应进一步探索这一领域。 Abstract: Text-to-image diffusion models have made significant advancements in generating high-quality, diverse images from text prompts. However, the inherent limitations of textual signals often prevent these models from fully capturing specific concepts, thereby reducing their controllability. To address this issue, several approaches have incorporated personalization techniques, utilizing reference images to mine visual concept representations that complement textual inputs and enhance the controllability of text-to-image diffusion models. Despite these advances, a comprehensive, systematic exploration of visual concept mining remains limited. In this paper, we categorize existing research into four key areas: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. This classification provides valuable insights into the foundational principles of Visual Concept Mining (VCM) techniques. Additionally, we identify key challenges and propose future research directions to propel this important and interesting field forward.

Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception

Dingkang Liang,Dingyuan Zhang,Xin Zhou,Sifan Tu,Tianrui Feng,Xiaofan Li,Yumeng Zhang,Mingyang Du,Xiao Tan,Xiang Bai

Task: 提出了一种名为UniFuture的驾驶世界模型，能够无缝集成未来场景生成和感知。

Motivation: 现有的模型通常只关注像素级未来预测或几何推理，而UniFuture旨在联合建模未来外观（RGB图像）和几何（深度），以确保一致的预测。

Details

Method: 在训练中引入了双潜在共享方案和多尺度潜在交互机制，以增强几何一致性和感知对齐。 Result: 在nuScenes数据集上的广泛实验表明，UniFuture在生成和感知任务上优于专门模型。 Conclusion: UniFuture展示了统一、结构感知的世界模型的优势。 Abstract: We present UniFuture, a simple yet effective driving world model that seamlessly integrates future scene generation and perception within a single framework. Unlike existing models focusing solely on pixel-level future prediction or geometric reasoning, our approach jointly models future appearance (i.e., RGB image) and geometry (i.e., depth), ensuring coherent predictions. Specifically, during the training, we first introduce a Dual-Latent Sharing scheme, which transfers image and depth sequence in a shared latent space, allowing both modalities to benefit from shared feature learning. Additionally, we propose a Multi-scale Latent Interaction mechanism, which facilitates bidirectional refinement between image and depth features at multiple spatial scales, effectively enhancing geometry consistency and perceptual alignment. During testing, our UniFuture can easily predict high-consistency future image-depth pairs by only using the current image as input. Extensive experiments on the nuScenes dataset demonstrate that UniFuture outperforms specialized models on future generation and perception tasks, highlighting the advantages of a unified, structurally-aware world model. The project page is at https://github.com/dk-liang/UniFuture.

Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization

Hao Li,Yubin Xiao,Ke Liang,Mengzhu Wang,Long Lan,Kenli Li,Xinwang Liu

Task: 单域泛化（SDG）旨在使用单一来源的数据训练模型，使其在不同场景下表现一致。

Motivation: 直接使用合成数据可能会导致性能下降，因为合成数据与真实目标域之间的特征分布存在显著差异。

Details

Method: 提出了Discriminative Domain Reassembly and Soft-Fusion (DRSF)训练框架，利用合成数据提高模型泛化能力。具体包括Discriminative Feature Decoupling and Reassembly (DFDR)模块和Multi-pseudo-domain Soft Fusion (MDSF)模块。 Result: 在目标检测和语义分割任务上的广泛实验表明，DRSF在仅增加少量计算开销的情况下实现了显著的性能提升。 Conclusion: DRSF的即插即用架构使其能够无缝集成到无监督域适应范式中，展示了其在解决多样化和现实世界域挑战中的广泛适用性。 Abstract: Single Domain Generalization (SDG) aims to train models with consistent performance across diverse scenarios using data from a single source. While using latent diffusion models (LDMs) show promise in augmenting limited source data, we demonstrate that directly using synthetic data can be detrimental due to significant feature distribution discrepancies between synthetic and real target domains, leading to performance degradation. To address this issue, we propose Discriminative Domain Reassembly and Soft-Fusion (DRSF), a training framework leveraging synthetic data to improve model generalization. We employ LDMs to produce diverse pseudo-target domain samples and introduce two key modules to handle distribution bias. First, Discriminative Feature Decoupling and Reassembly (DFDR) module uses entropy-guided attention to recalibrate channel-level features, suppressing synthetic noise while preserving semantic consistency. Second, Multi-pseudo-domain Soft Fusion (MDSF) module uses adversarial training with latent-space feature interpolation, creating continuous feature transitions between domains. Extensive SDG experiments on object detection and semantic segmentation tasks demonstrate that DRSF achieves substantial performance gains with only marginal computational overhead. Notably, DRSF's plug-and-play architecture enables seamless integration with unsupervised domain adaptation paradigms, underscoring its broad applicability in addressing diverse and real-world domain challenges.

Chiara Plizzari,Alessio Tonioni,Yongqin Xian,Achin Kulshrestha,Federico Tombari

Task: 评估自我中心视频中的时间理解能力

Motivation: 当前自我中心视频问答数据集中的问题往往可以通过少量帧或常识推理回答，而不一定基于实际视频内容。

Details

Method: 引入EgoTempo数据集，专门设计用于评估自我中心领域的时间理解能力，强调需要整合整个视频信息的任务。 Result: 实验表明，当前的多模态大语言模型在自我中心视频的时间推理上仍存在不足。 Conclusion: EgoTempo数据集有望推动该领域的新研究，并激发更好地捕捉时间动态复杂性的模型。 Abstract: Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects. In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. Our analysis shows that state-of-the-art Multi-Modal Large Language Models (MLLMs) on these benchmarks achieve remarkably high performance using just text or a single frame as input. To address these limitations, we introduce EgoTempo, a dataset specifically designed to evaluate temporal understanding in the egocentric domain. EgoTempo emphasizes tasks that require integrating information across the entire video, ensuring that models would need to rely on temporal patterns rather than static cues or pre-existing knowledge. Extensive experiments on EgoTempo show that current MLLMs still fall short in temporal reasoning on egocentric videos, and thus we hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics. Dataset and code are available at https://github.com/google-research-datasets/egotempo.git.

Web Artifact Attacks Disrupt Vision Language Models

Maan Qraitem,Piotr Teterwak,Kate Saenko,Bryan A. Plummer

Task: 引入基于伪影的攻击方法，以误导视觉语言模型。

Motivation: 现有的视觉语言模型在训练过程中学习了语义概念与无关视觉信号之间的意外关联，导致模型预测依赖于偶然模式而非真正的视觉理解。

Details

Method: 提出了一种新的攻击方法，利用不匹配的文本和图形元素来误导模型，并将这些攻击视为搜索问题。 Result: 在五个数据集上展示了攻击的有效性，某些伪影相互增强，达到了100%的攻击成功率，并且这些攻击在模型间的转移效果高达90%。 Conclusion: 通过扩展先前的伪影感知提示方法，可以在图形设置中适度降低攻击成功率，为增强模型鲁棒性提供了一个有前景的方向。 Abstract: Vision-language models (VLMs) (e.g., CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a typographic attack. These attacks succeed due to VLMs' text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them harder to defend against but also more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work's artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness.

FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models

Minghan Li,Chenxi Xie,Yichen Wu,Lei Zhang,Mengyu Wang

Task: 提出一个细粒度视频编辑基准（FiVE）用于评估新兴的扩散和修正流模型。

Motivation: 缺乏标准化的基准来公平评估文本到视频（T2V）编辑方法，导致不一致的声明和无法评估模型对超参数的敏感性。

Details

Method: 引入FiVE基准，包括74个真实世界视频和26个生成视频，涵盖6种细粒度编辑类型和420个对象级编辑提示对及其对应的掩码。同时，通过引入FlowEdit，调整最新的修正流（RF）T2V生成模型Pyramid-Flow和Wan2.1，得到无需训练和反演的视频编辑模型Pyramid-Edit和Wan-Edit。 Result: 实验结果表明，基于RF的编辑方法显著优于基于扩散的方法，Wan-Edit在整体性能上表现最佳，并且对超参数的敏感性最低。 Conclusion: FiVE基准和FlowEdit方法为细粒度视频编辑提供了有效的评估工具，RF-based方法在视频编辑任务中表现出色。 Abstract: Numerous text-to-video (T2V) editing methods have emerged recently, but the lack of a standardized benchmark for fair evaluation has led to inconsistent claims and an inability to assess model sensitivity to hyperparameters. Fine-grained video editing is crucial for enabling precise, object-level modifications while maintaining context and temporal consistency. To address this, we introduce FiVE, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models. Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks. Additionally, we adapt the latest rectified flow (RF) T2V generation models, Pyramid-Flow and Wan2.1, by introducing FlowEdit, resulting in training-free and inversion-free video editing models Pyramid-Edit and Wan-Edit. We evaluate five diffusion-based and two RF-based editing methods on our FiVE benchmark using 15 metrics, covering background preservation, text-video similarity, temporal consistency, video quality, and runtime. To further enhance object-level evaluation, we introduce FiVE-Acc, a novel metric leveraging Vision-Language Models (VLMs) to assess the success of fine-grained video editing. Experimental results demonstrate that RF-based editing significantly outperforms diffusion-based methods, with Wan-Edit achieving the best overall performance and exhibiting the least sensitivity to hyperparameters. More video demo available on the anonymous website: https://sites.google.com/view/five-benchmark

Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds

Eitan Shaar,Ariel Shaulov,Gal Chechik,Lior Wolf

Task: 提出了一种无需进一步训练的模型无关方法，用于音频-视觉事件感知，通过分数级融合技术保留更丰富的多模态交互。

Motivation: 现有方法受限于训练数据中的词汇量，难以泛化到未见过的类别，且标注过程耗时，限制了方法的可扩展性。此外，现有模型忽略了事件分布随时间的变化，导致无法适应视频动态变化。

Details

Method: 提出了Audio-Visual Adaptive Video Analysis (AV^2A)，包括分数级融合技术和视频内标签偏移算法，动态调整事件分布。 Result: AV^2A在零样本和弱监督的最先进方法上表现出显著的性能提升。 Conclusion: AV^2A在无需训练的情况下，显著提高了音频-视觉事件感知的性能，展示了其在开放词汇场景中的潜力。 Abstract: In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis ($\text{AV}^2\text{A}$), a model-agnostic approach that requires no further training and integrates a score-level fusion technique to retain richer multimodal interactions. $\text{AV}^2\text{A}$ also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that $\text{AV}^2\text{A}$ achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of $\text{AV}^2\text{A}$ on both zero-shot and weakly-supervised state-of-the-art methods, achieving notable improvements in performance metrics over existing approaches.

Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory

Saket Gurukar,Asim Kadav

Task: 提出了一种新的视频理解方法Long-VMNet，用于长视频的检索、摘要和问答。

Motivation: 传统方法需要大量计算资源，且受限于GPU内存。为了解决这一问题，提出了Long-VMNet。

Details

Method: 使用固定大小的记忆表示来存储从输入视频中采样的判别性片段，并利用神经采样器识别判别性标记。 Result: 在Rest-ADL数据集上，长视频检索和问答的推理时间提高了18倍到75倍，且预测性能具有竞争力。 Conclusion: Long-VMNet通过一次扫描视频和固定大小的记忆表示，显著提高了长视频理解的效率。 Abstract: Long-form video understanding is essential for various applications such as video retrieval, summarizing, and question answering. Yet, traditional approaches demand substantial computing power and are often bottlenecked by GPU memory. To tackle this issue, we present Long-Video Memory Network, Long-VMNet, a novel video understanding method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Additionally, Long-VMNet only needs one scan through the video, greatly boosting efficiency. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.

Improving Geometric Consistency for 360-Degree Neural Radiance Fields in Indoor Scenarios

Iryna Repinetska,Anna Hilsmann,Peter Eisert

Task: 提出一种高效且鲁棒的方法，用于计算密集深度先验，特别针对室内环境中的大面积低纹理建筑表面。

Motivation: NeRF在大面积低纹理区域（如墙壁、天花板和地板）中常常产生云状伪影，降低了场景的真实感。现有的方法在纹理较少区域的深度估计上存在挑战，尤其是在360度“由内向外”视图中。

Details

Method: 引入一种新的深度损失函数，以增强在低特征区域的渲染质量，并通过深度补丁正则化进一步优化其他区域的深度一致性。 Result: 在Instant-NGP上的实验表明，与标准的光度损失和均方误差深度监督相比，该方法在合成360度室内场景中提高了视觉保真度。 Conclusion: 该方法有效地解决了NeRF在大面积低纹理区域中的渲染问题，提高了室内场景的真实感。 Abstract: Photo-realistic rendering and novel view synthesis play a crucial role in human-computer interaction tasks, from gaming to path planning. Neural Radiance Fields (NeRFs) model scenes as continuous volumetric functions and achieve remarkable rendering quality. However, NeRFs often struggle in large, low-textured areas, producing cloudy artifacts known as ''floaters'' that reduce scene realism, especially in indoor environments with featureless architectural surfaces like walls, ceilings, and floors. To overcome this limitation, prior work has integrated geometric constraints into the NeRF pipeline, typically leveraging depth information derived from Structure from Motion or Multi-View Stereo. Yet, conventional RGB-feature correspondence methods face challenges in accurately estimating depth in textureless regions, leading to unreliable constraints. This challenge is further complicated in 360-degree ''inside-out'' views, where sparse visual overlap between adjacent images further hinders depth estimation. In order to address these issues, we propose an efficient and robust method for computing dense depth priors, specifically tailored for large low-textured architectural surfaces in indoor environments. We introduce a novel depth loss function to enhance rendering quality in these challenging, low-feature regions, while complementary depth-patch regularization further refines depth consistency across other areas. Experiments with Instant-NGP on two synthetic 360-degree indoor scenes demonstrate improved visual fidelity with our method compared to standard photometric loss and Mean Squared Error depth supervision.

SED-MVS: Segmentation-Driven and Edge-Aligned Deformation Multi-View Stereo with Depth Restoration and Occlusion Constraint

Zhenlong Yuan,Zhidong Yang,Yujun Cai,Kuangxin Wu,Mufan Liu,Dapeng Zhang,Hao Jiang,Zhaoxin Li,Zhaoqi Wang

Task: 提出了一种新的多视图立体匹配方法SED-MVS，通过全景分割和多轨迹扩散策略来解决纹理缺失区域的变形不稳定性问题。

Motivation: 现有的补丁变形方法在纹理缺失区域的重建中表现出色，但忽视了边缘跳跃引起的变形不稳定性，可能导致匹配失真。

Details

Method: 采用全景分割和多轨迹扩散策略，结合SAM2进行深度边缘引导，使用LoFTR的稀疏点和DepthAnything V2的单目深度图进行初始化，并通过分割图像和单目深度图实现遮挡感知的补丁变形。 Result: 在ETH3D、Tanks & Temples、BlendedMVS和Strecha数据集上的广泛实验验证了该方法的先进性能和强大的泛化能力。 Conclusion: SED-MVS方法在多视图立体匹配中表现出色，特别是在处理纹理缺失区域和边缘对齐方面，具有显著的性能提升和鲁棒性。 Abstract: Recently, patch-deformation methods have exhibited significant effectiveness in multi-view stereo owing to the deformable and expandable patches in reconstructing textureless areas. However, such methods primarily emphasize broadening the receptive field in textureless areas, while neglecting deformation instability caused by easily overlooked edge-skipping, potentially leading to matching distortions. To address this, we propose SED-MVS, which adopts panoptic segmentation and multi-trajectory diffusion strategy for segmentation-driven and edge-aligned patch deformation. Specifically, to prevent unanticipated edge-skipping, we first employ SAM2 for panoptic segmentation as depth-edge guidance to guide patch deformation, followed by multi-trajectory diffusion strategy to ensure patches are comprehensively aligned with depth edges. Moreover, to avoid potential inaccuracy of random initialization, we combine both sparse points from LoFTR and monocular depth map from DepthAnything V2 to restore reliable and realistic depth map for initialization and supervised guidance. Finally, we integrate segmentation image with monocular depth map to exploit inter-instance occlusion relationship, then further regard them as occlusion map to implement two distinct edge constraint, thereby facilitating occlusion-aware patch deformation. Extensive results on ETH3D, Tanks & Temples, BlendedMVS and Strecha datasets validate the state-of-the-art performance and robust generalization capability of our proposed method.

Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

Shristi Das Biswas,Efstathia Soufleri,Arani Roy,Kaushik Roy

Task: 提出一种计算高效的视频表示学习方法，以减少推理成本并提高推理速度。

Motivation: 现有的视频表示学习方法在计算上具有挑战性，主要由于解码开销大、原始视频流数据量大以及时间冗余度高。通过利用压缩视频域中的所有可用模态（I帧和P帧），可以提供一个计算高效的替代方案。

Details

Method: 提出了一种混合端到端框架，包括三个关键概念：1）专门设计的双编码器方案，带有高效的Spiking Temporal Modulators，以最小化延迟并保留跨域特征聚合；2）统一的Transformer模型，使用全局自注意力捕捉模态间依赖关系，增强I帧和P帧的上下文交互；3）多模态混合块，从联合时空令牌嵌入中建模丰富的表示。 Result: 在UCF-101、HMDB-51、K-400、K-600和SS-v2数据集上实现了最先进的视频识别性能，推理速度提高了56倍，推理成本减少了330倍。 Conclusion: 该方法为下一代高效的时空学习器提供了实用的设计选择，并展示了在视频识别任务中的显著性能提升。代码已公开。 Abstract: Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of $56$ while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330\times$ versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame -- P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs ($0.73$J/V) and fast inference ($16$V/s). Our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners. Code is available.

TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark

Forouzan Fallah,Maitreya Patel,Agneet Chatterjee,Vlad I. Morariu,Chitta Baral,Yezhou Yang

Task: 评估扩散模型在图像中嵌入文本的能力。

Motivation: 现有的基于扩散的文本到图像模型在图像中准确嵌入文本时面临拼写准确性、上下文相关性和视觉一致性等挑战，且缺乏全面的基准。

Details

Method: 引入了TextInVision，一个大规模的、基于文本和提示复杂性的基准，用于评估扩散模型在图像中有效集成视觉文本的能力。 Result: 通过广泛分析多个模型，识别了常见的错误，如拼写不准确和上下文不匹配。 Conclusion: 研究为未来AI生成的多模态内容的进步奠定了基础。 Abstract: Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.

Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes

Keqi Chen,Vinkle Srivastav,Didier Mutter,Nicolas Padoy

Task: 提出一种自监督的无校准多视角人物关联方法Self-MVA，无需使用任何标注。

Motivation: 在多视角人物关联中，人物重识别特征在人物外观相似的情况下不可靠，因此需要跨视角几何约束来提高鲁棒性。然而，现有方法大多需要地面实况标签或校准的相机参数，这些难以获取。

Details

Method: 提出一个自监督学习框架，包括一个编码器-解码器模型和一个自监督预训练任务（跨视角图像同步），通过匈牙利匹配来缩小实例距离和图像距离之间的差距，并进一步提出两种自监督线性约束（多视角重投影和成对边缘关联）来减少解空间。 Result: 在三个具有挑战性的公共基准数据集（WILDTRACK、MVOR和SOLDIERS）上的广泛实验表明，该方法达到了最先进的结果，超越了现有的无监督和全监督方法。 Conclusion: Self-MVA方法在无需标注和校准相机参数的情况下，通过自监督学习实现了多视角人物关联的鲁棒性和准确性。 Abstract: Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric and appearance features, and we train it by utilizing synchronization labels for supervision after applying Hungarian matching to bridge the gap between instance-wise and image-wise distances. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view re-projection and pairwise edge association. Extensive experiments on three challenging public benchmark datasets (WILDTRACK, MVOR, and SOLDIERS) show that our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches. Code is available at https://github.com/CAMMA-public/Self-MVA.

C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales

Yuxuan Jiang,Chengxi Zeng,Siyue Teng,Fan Zhang,Xiaoqing Zhu,Joel Sole,David Bull

Task: 优化基于注意力的图像超分辨率模型，从性能和复杂性角度进行改进。

Motivation: 现有的单图像超分辨率（SISR）方法依赖于简单的训练策略和设计用于离散上采样尺度的网络架构，限制了模型在多尺度信息捕捉上的能力。

Details

Method: 提出了一种新的框架C2D-ISR，采用两阶段训练方法和分层编码机制，实现连续尺度训练和跨尺度信息聚合。 Result: 在SwinIR-L、SRFormer-L和MambaIRv2-L三种高效注意力骨干网络上评估，C2D-ISR框架在超分辨率性能（提升至0.2dB）和计算复杂度降低（高达11%）方面显著优于现有优化框架HiT。 Conclusion: C2D-ISR框架通过连续尺度训练和分层编码机制，显著提升了图像超分辨率的性能和效率。 Abstract: In recent years, attention mechanisms have been exploited in single image super-resolution (SISR), achieving impressive reconstruction results. However, these advancements are still limited by the reliance on simple training strategies and network architectures designed for discrete up-sampling scales, which hinder the model's ability to effectively capture information across multiple scales. To address these limitations, we propose a novel framework, \textbf{C2D-ISR}, for optimizing attention-based image super-resolution models from both performance and complexity perspectives. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. The new training methodology involves continuous-scale training for discrete scale models, enabling the learning of inter-scale correlations and multi-scale feature representation. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures, which can achieve improved spatial feature fusion, cross-scale information aggregation, and more importantly, much faster inference. We have evaluated the C2D-ISR framework based on three efficient attention-based backbones, SwinIR-L, SRFormer-L and MambaIRv2-L, and demonstrated significant improvements over the other existing optimization framework, HiT, in terms of super-resolution performance (up to 0.2dB) and computational complexity reduction (up to 11%). The source code will be made publicly available at www.github.com.

MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models

Johannes Meier,Louis Inchingolo,Oussema Dhaouadi,Yan Xia,Jacques Kaiser,Daniel Cremers

Task: 解决单目3D物体检测在不同传感器、环境和相机设置中的问题。

Motivation: 准确的深度估计对于缓解领域偏移至关重要。

Details

Method: 提出了一种新的无监督领域适应方法MonoCT，包括广义深度增强（GDE）模块和伪标签评分（PLS）模块。 Result: 在六个基准测试中，MonoCT显著优于现有的SOTA领域适应方法（AP Mod.至少提高21%），并且在汽车、交通摄像头和无人机视图中表现良好。 Conclusion: MonoCT通过改进深度估计和生成高质量的伪标签，显著提升了单目3D物体检测的跨领域适应能力。 Abstract: We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.

FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

Ali Mollaahmadi Dehaghi,Hossein KhademSohi,Reza Razavi,Steve Drew,Mohammad Moshirpour

Task: 提出一种新的、与架构无关且无状态的联邦学习框架FedVSR，用于视频超分辨率（VSR）。

Motivation: 现有的深度学习VSR方法在隐私保护方面存在严重问题，而现有的联邦学习方法在低层次视觉任务上表现不佳。

Details

Method: 提出FedVSR框架，引入轻量级损失项以改进局部优化，并以最小的计算开销指导全局聚合。 Result: 实验表明，FedVSR在PSNR上平均比一般联邦学习方法高出0.85 dB。 Conclusion: FedVSR是首个联邦VSR的尝试，展示了其在视频超分辨率任务中的有效性。 Abstract: Video Super-Resolution (VSR) reconstructs high-resolution videos from low-resolution inputs to restore fine details and improve visual clarity. While deep learning-based VSR methods achieve impressive results, their centralized nature raises serious privacy concerns, particularly in applications with strict privacy requirements. Federated Learning (FL) offers an alternative approach, but existing FL methods struggle with low-level vision tasks, leading to suboptimal reconstructions. To address this, we propose FedVSR1, a novel, architecture-independent, and stateless FL framework for VSR. Our approach introduces a lightweight loss term that improves local optimization and guides global aggregation with minimal computational overhead. To the best of our knowledge, this is the first attempt at federated VSR. Extensive experiments show that FedVSR outperforms general FL methods by an average of 0.85 dB in PSNR, highlighting its effectiveness. The code is available at: https://github.com/alimd94/FedVSR

Context-aware Multimodal AI Reveals Hidden Pathways in Five Centuries of Art Evolution

Jin Kim,Byunghwee Lee,Taekho You,Jinhyuk Yun

Task: 使用生成式AI分析500年西方绘画的潜在信息，包括形式方面和上下文方面。

Motivation: 探索生成式AI在艺术分析中的潜力，特别是其在潜在空间中表示艺术作品的能力。

Details

Method: 使用Stable Diffusion模型提取西方绘画的形式和上下文信息。 Result: 上下文信息在区分艺术时期、风格和个体艺术家方面比形式元素更成功。通过提取绘画中的上下文关键词，展示了艺术表达如何随社会变化而演变。 Conclusion: 多模态AI通过整合时间、文化和历史背景，扩展了传统的形式分析，展示了社会与艺术之间的相互影响。 Abstract: The rise of multimodal generative AI is transforming the intersection of technology and art, offering deeper insights into large-scale artwork. Although its creative capabilities have been widely explored, its potential to represent artwork in latent spaces remains underexamined. We use cutting-edge generative AI, specifically Stable Diffusion, to analyze 500 years of Western paintings by extracting two types of latent information with the model: formal aspects (e.g., colors) and contextual aspects (e.g., subject). Our findings reveal that contextual information differentiates between artistic periods, styles, and individual artists more successfully than formal elements. Additionally, using contextual keywords extracted from paintings, we show how artistic expression evolves alongside societal changes. Our generative experiment, infusing prospective contexts into historical artworks, successfully reproduces the evolutionary trajectory of artworks, highlighting the significance of mutual interaction between society and art. This study demonstrates how multimodal AI expands traditional formal analysis by integrating temporal, cultural, and historical contexts.

A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models

Ziqiang Li,Jun Li,Lizhi Xiong,Zhangjie Fu,Zechao Li

Task: 对视觉概念挖掘（VCM）技术进行分类和系统探索，以增强文本到图像扩散模型的可控性。

Motivation: 文本信号的内在局限性常常阻碍模型完全捕捉特定概念，从而降低了可控性。通过结合个性化技术，利用参考图像挖掘视觉概念表示，可以增强文本到图像扩散模型的可控性。

Details

Method: 将现有研究分类为四个关键领域：概念学习、概念擦除、概念分解和概念组合。 Result: 提供了对视觉概念挖掘技术基础原则的宝贵见解，并提出了未来的研究方向。 Conclusion: 视觉概念挖掘技术在增强文本到图像扩散模型的可控性方面具有重要潜力，未来研究应进一步探索这一领域。 Abstract: Text-to-image diffusion models have made significant advancements in generating high-quality, diverse images from text prompts. However, the inherent limitations of textual signals often prevent these models from fully capturing specific concepts, thereby reducing their controllability. To address this issue, several approaches have incorporated personalization techniques, utilizing reference images to mine visual concept representations that complement textual inputs and enhance the controllability of text-to-image diffusion models. Despite these advances, a comprehensive, systematic exploration of visual concept mining remains limited. In this paper, we categorize existing research into four key areas: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. This classification provides valuable insights into the foundational principles of Visual Concept Mining (VCM) techniques. Additionally, we identify key challenges and propose future research directions to propel this important and interesting field forward.

Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception

Dingkang Liang,Dingyuan Zhang,Xin Zhou,Sifan Tu,Tianrui Feng,Xiaofan Li,Yumeng Zhang,Mingyang Du,Xiao Tan,Xiang Bai

Task: 提出了一种名为UniFuture的驾驶世界模型，能够在一个框架内无缝集成未来场景生成和感知。

Motivation: 现有的模型通常只关注像素级的未来预测或几何推理，而UniFuture则联合建模未来外观（即RGB图像）和几何（即深度），以确保一致的预测。

Details

Method: 在训练过程中，引入了双潜在共享方案，将图像和深度序列传输到共享的潜在空间中，使两种模态都能从共享特征学习中受益。此外，提出了多尺度潜在交互机制，促进图像和深度特征在多个空间尺度上的双向细化，有效增强几何一致性和感知对齐。 Result: 在nuScenes数据集上的大量实验表明，UniFuture在未来生成和感知任务上优于专用模型，突显了统一、结构感知的世界模型的优势。 Conclusion: UniFuture通过联合建模未来外观和几何，能够仅使用当前图像作为输入，轻松预测高一致性的未来图像-深度对，展示了其在未来场景生成和感知任务中的优越性。 Abstract: We present UniFuture, a simple yet effective driving world model that seamlessly integrates future scene generation and perception within a single framework. Unlike existing models focusing solely on pixel-level future prediction or geometric reasoning, our approach jointly models future appearance (i.e., RGB image) and geometry (i.e., depth), ensuring coherent predictions. Specifically, during the training, we first introduce a Dual-Latent Sharing scheme, which transfers image and depth sequence in a shared latent space, allowing both modalities to benefit from shared feature learning. Additionally, we propose a Multi-scale Latent Interaction mechanism, which facilitates bidirectional refinement between image and depth features at multiple spatial scales, effectively enhancing geometry consistency and perceptual alignment. During testing, our UniFuture can easily predict high-consistency future image-depth pairs by only using the current image as input. Extensive experiments on the nuScenes dataset demonstrate that UniFuture outperforms specialized models on future generation and perception tasks, highlighting the advantages of a unified, structurally-aware world model. The project page is at https://github.com/dk-liang/UniFuture.

Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization

Hao Li,Yubin Xiao,Ke Liang,Mengzhu Wang,Long Lan,Kenli Li,Xinwang Liu

Task: 提出一种名为Discriminative Domain Reassembly and Soft-Fusion (DRSF)的训练框架，利用合成数据提高模型在单域泛化（SDG）任务中的性能。

Motivation: 解决直接使用合成数据导致的特征分布差异问题，从而提高模型在多样化场景中的一致性表现。

Details

Method: 使用潜在扩散模型（LDMs）生成多样化的伪目标域样本，并引入两个关键模块：Discriminative Feature Decoupling and Reassembly (DFDR) 和 Multi-pseudo-domain Soft Fusion (MDSF)。DFDR模块通过熵引导的注意力机制重新校准通道级特征，抑制合成噪声并保持语义一致性；MDSF模块通过潜在空间特征插值进行对抗训练，创建域之间的连续特征过渡。 Result: 在目标检测和语义分割任务上的广泛实验表明，DRSF在仅增加少量计算开销的情况下实现了显著的性能提升。 Conclusion: DRSF的即插即用架构使其能够无缝集成到无监督域适应范式中，展示了其在解决多样化和现实世界域挑战中的广泛适用性。 Abstract: Single Domain Generalization (SDG) aims to train models with consistent performance across diverse scenarios using data from a single source. While using latent diffusion models (LDMs) show promise in augmenting limited source data, we demonstrate that directly using synthetic data can be detrimental due to significant feature distribution discrepancies between synthetic and real target domains, leading to performance degradation. To address this issue, we propose Discriminative Domain Reassembly and Soft-Fusion (DRSF), a training framework leveraging synthetic data to improve model generalization. We employ LDMs to produce diverse pseudo-target domain samples and introduce two key modules to handle distribution bias. First, Discriminative Feature Decoupling and Reassembly (DFDR) module uses entropy-guided attention to recalibrate channel-level features, suppressing synthetic noise while preserving semantic consistency. Second, Multi-pseudo-domain Soft Fusion (MDSF) module uses adversarial training with latent-space feature interpolation, creating continuous feature transitions between domains. Extensive SDG experiments on object detection and semantic segmentation tasks demonstrate that DRSF achieves substantial performance gains with only marginal computational overhead. Notably, DRSF's plug-and-play architecture enables seamless integration with unsupervised domain adaptation paradigms, underscoring its broad applicability in addressing diverse and real-world domain challenges.

Chiara Plizzari,Alessio Tonioni,Yongqin Xian,Achin Kulshrestha,Federico Tombari

Task: 评估自我中心视频中的时间理解能力

Motivation: 当前自我中心视频问答数据集中的问题往往可以通过少量帧或常识推理回答，而不一定需要基于实际视频内容。

Details

Method: 引入EgoTempo数据集，专门设计用于评估自我中心领域的时间理解能力，强调需要整合整个视频信息的任务。 Result: 实验表明，当前的多模态大语言模型在自我中心视频的时间推理上仍存在不足。 Conclusion: EgoTempo数据集有望推动该领域的新研究，并激发更好地捕捉时间动态复杂性的模型。 Abstract: Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects. In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. Our analysis shows that state-of-the-art Multi-Modal Large Language Models (MLLMs) on these benchmarks achieve remarkably high performance using just text or a single frame as input. To address these limitations, we introduce EgoTempo, a dataset specifically designed to evaluate temporal understanding in the egocentric domain. EgoTempo emphasizes tasks that require integrating information across the entire video, ensuring that models would need to rely on temporal patterns rather than static cues or pre-existing knowledge. Extensive experiments on EgoTempo show that current MLLMs still fall short in temporal reasoning on egocentric videos, and thus we hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics. Dataset and code are available at https://github.com/google-research-datasets/egotempo.git.

Web Artifact Attacks Disrupt Vision Language Models

Maan Qraitem,Piotr Teterwak,Kate Saenko,Bryan A. Plummer

Task: 提出并验证一种新的基于非匹配文本和图形元素的视觉语言模型攻击方法。

Motivation: 现有的视觉语言模型在训练过程中学习到了语义概念与无关视觉信号之间的意外关联，导致模型预测依赖于偶然模式而非真正的视觉理解。

Details

Method: 引入基于伪影的攻击方法，通过非匹配文本和图形元素误导模型，并将其框架化为搜索问题。 Result: 在五个数据集上验证了攻击的有效性，部分伪影相互强化，攻击成功率可达100%，且攻击在不同模型间的转移效果高达90%。 Conclusion: 通过扩展伪影感知提示到图形设置，可以适度降低攻击成功率，为增强模型鲁棒性提供了有前景的方向。 Abstract: Vision-language models (VLMs) (e.g., CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a typographic attack. These attacks succeed due to VLMs' text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them harder to defend against but also more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work's artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness.

FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models

Minghan Li,Chenxi Xie,Yichen Wu,Lei Zhang,Mengyu Wang

Task: 提出一个细粒度视频编辑基准（FiVE）用于评估新兴的扩散和修正流模型。

Motivation: 缺乏标准化的基准来公平评估文本到视频（T2V）编辑方法，导致不一致的声明和无法评估模型对超参数的敏感性。

Details

Method: 引入FiVE基准，包括74个真实世界视频和26个生成视频，包含6种细粒度编辑类型和420个对象级编辑提示对及其对应的掩码。同时，通过引入FlowEdit，调整最新的修正流（RF）T2V生成模型Pyramid-Flow和Wan2.1，得到无需训练和反演的视频编辑模型Pyramid-Edit和Wan-Edit。 Result: 实验结果表明，基于RF的编辑显著优于基于扩散的方法，Wan-Edit在整体性能上表现最佳，并且对超参数的敏感性最低。 Conclusion: FiVE基准和FlowEdit方法为细粒度视频编辑提供了一个有效的评估框架，并展示了RF-based方法在视频编辑中的优势。 Abstract: Numerous text-to-video (T2V) editing methods have emerged recently, but the lack of a standardized benchmark for fair evaluation has led to inconsistent claims and an inability to assess model sensitivity to hyperparameters. Fine-grained video editing is crucial for enabling precise, object-level modifications while maintaining context and temporal consistency. To address this, we introduce FiVE, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models. Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks. Additionally, we adapt the latest rectified flow (RF) T2V generation models, Pyramid-Flow and Wan2.1, by introducing FlowEdit, resulting in training-free and inversion-free video editing models Pyramid-Edit and Wan-Edit. We evaluate five diffusion-based and two RF-based editing methods on our FiVE benchmark using 15 metrics, covering background preservation, text-video similarity, temporal consistency, video quality, and runtime. To further enhance object-level evaluation, we introduce FiVE-Acc, a novel metric leveraging Vision-Language Models (VLMs) to assess the success of fine-grained video editing. Experimental results demonstrate that RF-based editing significantly outperforms diffusion-based methods, with Wan-Edit achieving the best overall performance and exhibiting the least sensitivity to hyperparameters. More video demo available on the anonymous website: https://sites.google.com/view/five-benchmark

Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds

Eitan Shaar,Ariel Shaulov,Gal Chechik,Lior Wolf

Task: 提出一种无需额外训练、能够动态调整事件分布并保留多模态交互的音频-视觉自适应视频分析模型（AV²A）。

Motivation: 现有方法在音频-视觉事件感知中存在词汇表限制、标注过程繁琐、忽略事件分布变化以及多模态交互丢失等问题。

Details

Method: 提出AV²A模型，采用分数级融合技术和视频内标签转移算法，动态调整事件分布并保留多模态交互。 Result: AV²A在零样本和弱监督的现有方法上取得了显著的性能提升。 Conclusion: AV²A模型在无需额外训练的情况下，能够有效提升音频-视觉事件感知的性能，并解决了现有方法的局限性。 Abstract: In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis ($\text{AV}^2\text{A}$), a model-agnostic approach that requires no further training and integrates a score-level fusion technique to retain richer multimodal interactions. $\text{AV}^2\text{A}$ also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that $\text{AV}^2\text{A}$ achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of $\text{AV}^2\text{A}$ on both zero-shot and weakly-supervised state-of-the-art methods, achieving notable improvements in performance metrics over existing approaches.

Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory

Saket Gurukar,Asim Kadav

Task: 提出了一种新的长视频理解方法Long-VMNet，用于视频检索、总结和问答。

Motivation: 传统方法需要大量计算资源，且受限于GPU内存。

Details

Method: 使用固定大小的记忆表示来存储从输入视频中采样的判别性片段，并利用神经采样器识别判别性标记。 Result: 在Rest-ADL数据集上，推理时间提高了18倍到75倍，且预测性能具有竞争力。 Conclusion: Long-VMNet显著提高了长视频理解的效率，同时保持了良好的预测性能。 Abstract: Long-form video understanding is essential for various applications such as video retrieval, summarizing, and question answering. Yet, traditional approaches demand substantial computing power and are often bottlenecked by GPU memory. To tackle this issue, we present Long-Video Memory Network, Long-VMNet, a novel video understanding method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Additionally, Long-VMNet only needs one scan through the video, greatly boosting efficiency. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.

Improving Geometric Consistency for 360-Degree Neural Radiance Fields in Indoor Scenarios

Iryna Repinetska,Anna Hilsmann,Peter Eisert

Task: 提出一种高效且鲁棒的方法，用于计算密集深度先验，特别适用于室内环境中的大面积低纹理建筑表面。

Motivation: NeRF在大面积低纹理区域（如墙壁、天花板和地板）中常常产生云状伪影，降低了场景的真实感。现有的方法在纹理较少区域的深度估计上存在挑战，特别是在360度“由内向外”视图中。

Details

Method: 引入了一种新的深度损失函数，以增强在低特征区域的渲染质量，并通过深度补丁正则化进一步优化其他区域的深度一致性。 Result: 在Instant-NGP上的实验表明，与标准的光度损失和均方误差深度监督相比，该方法在合成360度室内场景中提高了视觉保真度。 Conclusion: 该方法有效地解决了NeRF在大面积低纹理区域中的渲染问题，提高了室内场景的真实感。 Abstract: Photo-realistic rendering and novel view synthesis play a crucial role in human-computer interaction tasks, from gaming to path planning. Neural Radiance Fields (NeRFs) model scenes as continuous volumetric functions and achieve remarkable rendering quality. However, NeRFs often struggle in large, low-textured areas, producing cloudy artifacts known as ''floaters'' that reduce scene realism, especially in indoor environments with featureless architectural surfaces like walls, ceilings, and floors. To overcome this limitation, prior work has integrated geometric constraints into the NeRF pipeline, typically leveraging depth information derived from Structure from Motion or Multi-View Stereo. Yet, conventional RGB-feature correspondence methods face challenges in accurately estimating depth in textureless regions, leading to unreliable constraints. This challenge is further complicated in 360-degree ''inside-out'' views, where sparse visual overlap between adjacent images further hinders depth estimation. In order to address these issues, we propose an efficient and robust method for computing dense depth priors, specifically tailored for large low-textured architectural surfaces in indoor environments. We introduce a novel depth loss function to enhance rendering quality in these challenging, low-feature regions, while complementary depth-patch regularization further refines depth consistency across other areas. Experiments with Instant-NGP on two synthetic 360-degree indoor scenes demonstrate improved visual fidelity with our method compared to standard photometric loss and Mean Squared Error depth supervision.

SED-MVS: Segmentation-Driven and Edge-Aligned Deformation Multi-View Stereo with Depth Restoration and Occlusion Constraint

Zhenlong Yuan,Zhidong Yang,Yujun Cai,Kuangxin Wu,Mufan Liu,Dapeng Zhang,Hao Jiang,Zhaoxin Li,Zhaoqi Wang

Task: 提出了一种新的多视图立体匹配方法SED-MVS，通过全景分割和多轨迹扩散策略来解决纹理缺失区域的变形不稳定性问题。

Motivation: 现有的补丁变形方法在纹理缺失区域的重建中表现出色，但忽视了边缘跳变引起的变形不稳定性，可能导致匹配失真。

Details

Method: 采用SAM2进行全景分割作为深度边缘引导，结合多轨迹扩散策略确保补丁与深度边缘全面对齐，并使用LoFTR的稀疏点和DepthAnything V2的单目深度图进行初始化。 Result: 在ETH3D、Tanks & Temples、BlendedMVS和Strecha数据集上的广泛实验验证了所提出方法的最先进性能和强大的泛化能力。 Conclusion: SED-MVS通过全景分割和多轨迹扩散策略有效解决了纹理缺失区域的变形不稳定性问题，展示了其在多视图立体匹配中的优越性能。 Abstract: Recently, patch-deformation methods have exhibited significant effectiveness in multi-view stereo owing to the deformable and expandable patches in reconstructing textureless areas. However, such methods primarily emphasize broadening the receptive field in textureless areas, while neglecting deformation instability caused by easily overlooked edge-skipping, potentially leading to matching distortions. To address this, we propose SED-MVS, which adopts panoptic segmentation and multi-trajectory diffusion strategy for segmentation-driven and edge-aligned patch deformation. Specifically, to prevent unanticipated edge-skipping, we first employ SAM2 for panoptic segmentation as depth-edge guidance to guide patch deformation, followed by multi-trajectory diffusion strategy to ensure patches are comprehensively aligned with depth edges. Moreover, to avoid potential inaccuracy of random initialization, we combine both sparse points from LoFTR and monocular depth map from DepthAnything V2 to restore reliable and realistic depth map for initialization and supervised guidance. Finally, we integrate segmentation image with monocular depth map to exploit inter-instance occlusion relationship, then further regard them as occlusion map to implement two distinct edge constraint, thereby facilitating occlusion-aware patch deformation. Extensive results on ETH3D, Tanks & Temples, BlendedMVS and Strecha datasets validate the state-of-the-art performance and robust generalization capability of our proposed method.

Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

Shristi Das Biswas,Efstathia Soufleri,Arani Roy,Kaushik Roy

Task: 提出一种计算高效的深度视频表示学习方法，通过利用压缩视频域中的所有可用模态（I帧和P帧）来减少推理成本。

Motivation: 现有的方法在处理视频表示时忽略了P帧之间的时间相关性和隐式稀疏性，导致训练和泛化困难。通过重新设计视频理解骨干网络，可以在保持性能的同时大幅提高推理速度。

Details

Method: 提出了一种混合端到端框架，包括双编码器方案、统一的Transformer模型和多模态混合块，以减少推理成本并增强I帧和P帧之间的上下文交互。 Result: 在UCF-101、HMDB-51、K-400、K-600和SS-v2数据集上实现了最先进的视频识别性能，推理速度提高了56倍，推理成本减少了330倍。 Conclusion: 该方法为下一代高效时空学习者的实际设计选择提供了新的见解，代码已公开。 Abstract: Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of $56$ while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330\times$ versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame -- P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs ($0.73$J/V) and fast inference ($16$V/s). Our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners. Code is available.

TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark

Forouzan Fallah,Maitreya Patel,Agneet Chatterjee,Vlad I. Morariu,Chitta Baral,Yezhou Yang

Task: 评估扩散模型在图像中嵌入文本的能力。

Motivation: 现有的基于扩散的文本到图像模型在准确嵌入文本方面存在困难，缺乏全面的基准测试。

Details

Method: 引入了TextInVision，一个大规模的、基于文本和提示复杂性的基准，用于评估扩散模型在图像中有效集成视觉文本的能力。 Result: 通过分析多个模型，识别了常见的错误，如拼写不准确和上下文不匹配。 Conclusion: 研究为未来AI生成的多模态内容的进步奠定了基础。 Abstract: Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.

Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes

Keqi Chen,Vinkle Srivastav,Didier Mutter,Nicolas Padoy

Task: 提出一种自监督的无校准多视角人物关联方法Self-MVA，用于多视角人类活动分析。

Motivation: 在具有相似外观的挑战性场景中，人物重识别特征变得不可靠，因此需要跨视角几何约束以实现更稳健的关联。

Details

Method: 提出了一种自监督学习框架，包括编码器-解码器模型和自监督前置任务（跨视角图像同步），并利用同步标签进行监督训练。 Result: 在三个具有挑战性的公共基准数据集（WILDTRACK、MVOR和SOLDIERS）上进行了广泛实验，结果表明该方法达到了最先进的性能，超越了现有的无监督和全监督方法。 Conclusion: Self-MVA方法在不使用任何标注的情况下，通过自监督学习和跨视角几何约束，实现了高效的多视角人物关联。 Abstract: Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric and appearance features, and we train it by utilizing synchronization labels for supervision after applying Hungarian matching to bridge the gap between instance-wise and image-wise distances. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view re-projection and pairwise edge association. Extensive experiments on three challenging public benchmark datasets (WILDTRACK, MVOR, and SOLDIERS) show that our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches. Code is available at https://github.com/CAMMA-public/Self-MVA.

C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales

Yuxuan Jiang,Chengxi Zeng,Siyue Teng,Fan Zhang,Xiaoqing Zhu,Joel Sole,David Bull

Task: 优化基于注意力的图像超分辨率模型的性能和复杂性

Motivation: 现有的单图像超分辨率方法依赖于简单的训练策略和离散上采样尺度的网络架构，限制了模型在多尺度信息捕捉方面的能力。

Details

Method: 提出了一种新的框架C2D-ISR，采用两阶段训练方法和分层编码机制，实现连续尺度训练和跨尺度信息聚合。 Result: 在SwinIR-L、SRFormer-L和MambaIRv2-L三种高效注意力骨干网络上评估，C2D-ISR框架在超分辨率性能（提升高达0.2dB）和计算复杂度降低（高达11%）方面显著优于现有优化框架HiT。 Conclusion: C2D-ISR框架通过连续尺度训练和分层编码机制，显著提升了图像超分辨率的性能和效率。 Abstract: In recent years, attention mechanisms have been exploited in single image super-resolution (SISR), achieving impressive reconstruction results. However, these advancements are still limited by the reliance on simple training strategies and network architectures designed for discrete up-sampling scales, which hinder the model's ability to effectively capture information across multiple scales. To address these limitations, we propose a novel framework, \textbf{C2D-ISR}, for optimizing attention-based image super-resolution models from both performance and complexity perspectives. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. The new training methodology involves continuous-scale training for discrete scale models, enabling the learning of inter-scale correlations and multi-scale feature representation. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures, which can achieve improved spatial feature fusion, cross-scale information aggregation, and more importantly, much faster inference. We have evaluated the C2D-ISR framework based on three efficient attention-based backbones, SwinIR-L, SRFormer-L and MambaIRv2-L, and demonstrated significant improvements over the other existing optimization framework, HiT, in terms of super-resolution performance (up to 0.2dB) and computational complexity reduction (up to 11%). The source code will be made publicly available at www.github.com.

MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models

Johannes Meier,Louis Inchingolo,Oussema Dhaouadi,Yan Xia,Jacques Kaiser,Daniel Cremers

Task: 解决单目3D物体检测在不同传感器、环境和相机设置下的问题。

Motivation: 准确的深度估计对于缓解领域偏移至关重要。

Details

Method: 提出了一种新颖的无监督领域适应方法MonoCT，包括广义深度增强（GDE）模块和伪标签评分（PLS）模块。 Result: 在六个基准测试中，MonoCT显著优于现有的最先进领域适应方法（AP Mod.至少提高21%），并且在汽车、交通摄像头和无人机视图中表现良好。 Conclusion: MonoCT通过改进深度估计和生成高质量伪标签，显著提升了单目3D物体检测的性能和泛化能力。 Abstract: We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.

FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

Ali Mollaahmadi Dehaghi,Hossein KhademSohi,Reza Razavi,Steve Drew,Mohammad Moshirpour

Task: 提出一种新的、独立于架构且无状态的联邦学习框架FedVSR，用于视频超分辨率。

Motivation: 现有的深度学习视频超分辨率方法存在隐私问题，而现有的联邦学习方法在低层次视觉任务上表现不佳。

Details

Method: 提出FedVSR框架，引入轻量级损失项以改进局部优化并指导全局聚合，计算开销最小。 Result: 实验表明，FedVSR在PSNR上平均比一般联邦学习方法高出0.85 dB。 Conclusion: FedVSR是第一个联邦视频超分辨率的尝试，展示了其有效性。 Abstract: Video Super-Resolution (VSR) reconstructs high-resolution videos from low-resolution inputs to restore fine details and improve visual clarity. While deep learning-based VSR methods achieve impressive results, their centralized nature raises serious privacy concerns, particularly in applications with strict privacy requirements. Federated Learning (FL) offers an alternative approach, but existing FL methods struggle with low-level vision tasks, leading to suboptimal reconstructions. To address this, we propose FedVSR1, a novel, architecture-independent, and stateless FL framework for VSR. Our approach introduces a lightweight loss term that improves local optimization and guides global aggregation with minimal computational overhead. To the best of our knowledge, this is the first attempt at federated VSR. Extensive experiments show that FedVSR outperforms general FL methods by an average of 0.85 dB in PSNR, highlighting its effectiveness. The code is available at: https://github.com/alimd94/FedVSR

Context-aware Multimodal AI Reveals Hidden Pathways in Five Centuries of Art Evolution

Jin Kim,Byunghwee Lee,Taekho You,Jinhyuk Yun

Task: 使用生成式AI分析西方绘画的潜在空间信息，揭示艺术表达与社会变化的互动关系。

Motivation: 探索生成式AI在艺术分析中的潜力，特别是其在潜在空间中表示艺术作品的能力。

Details

Method: 使用Stable Diffusion模型分析500年西方绘画，提取形式（如颜色）和上下文（如主题）两类潜在信息。 Result: 上下文信息在区分艺术时期、风格和个体艺术家方面比形式元素更成功。通过从绘画中提取的上下文关键词，展示了艺术表达如何随社会变化而演变。 Conclusion: 多模态AI通过整合时间、文化和历史背景，扩展了传统的形式分析，展示了社会与艺术之间的相互影响。 Abstract: The rise of multimodal generative AI is transforming the intersection of technology and art, offering deeper insights into large-scale artwork. Although its creative capabilities have been widely explored, its potential to represent artwork in latent spaces remains underexamined. We use cutting-edge generative AI, specifically Stable Diffusion, to analyze 500 years of Western paintings by extracting two types of latent information with the model: formal aspects (e.g., colors) and contextual aspects (e.g., subject). Our findings reveal that contextual information differentiates between artistic periods, styles, and individual artists more successfully than formal elements. Additionally, using contextual keywords extracted from paintings, we show how artistic expression evolves alongside societal changes. Our generative experiment, infusing prospective contexts into historical artworks, successfully reproduces the evolutionary trajectory of artworks, highlighting the significance of mutual interaction between society and art. This study demonstrates how multimodal AI expands traditional formal analysis by integrating temporal, cultural, and historical contexts.

A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models

Ziqiang Li,Jun Li,Lizhi Xiong,Zhangjie Fu,Zechao Li

Task: 对视觉概念挖掘（VCM）技术进行分类和系统探索。

Motivation: 文本到图像扩散模型在生成高质量、多样化的图像方面取得了显著进展，但文本信号的固有局限性往往使这些模型无法完全捕捉特定概念，从而降低了其可控性。

Details

Method: 将现有研究分为四个关键领域：概念学习、概念擦除、概念分解和概念组合。 Result: 提供了对视觉概念挖掘（VCM）技术基础原理的宝贵见解，并提出了未来的研究方向。 Conclusion: 视觉概念挖掘（VCM）技术是一个重要且有趣的领域，未来的研究应关注解决现有挑战并推动该领域的发展。 Abstract: Text-to-image diffusion models have made significant advancements in generating high-quality, diverse images from text prompts. However, the inherent limitations of textual signals often prevent these models from fully capturing specific concepts, thereby reducing their controllability. To address this issue, several approaches have incorporated personalization techniques, utilizing reference images to mine visual concept representations that complement textual inputs and enhance the controllability of text-to-image diffusion models. Despite these advances, a comprehensive, systematic exploration of visual concept mining remains limited. In this paper, we categorize existing research into four key areas: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. This classification provides valuable insights into the foundational principles of Visual Concept Mining (VCM) techniques. Additionally, we identify key challenges and propose future research directions to propel this important and interesting field forward.

Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception

Dingkang Liang,Dingyuan Zhang,Xin Zhou,Sifan Tu,Tianrui Feng,Xiaofan Li,Yumeng Zhang,Mingyang Du,Xiao Tan,Xiang Bai

Task: 提出一个简单而有效的驾驶世界模型UniFuture，将未来场景生成和感知无缝集成在一个框架中。

Motivation: 现有模型仅关注像素级未来预测或几何推理，而UniFuture联合建模未来外观（RGB图像）和几何（深度），确保一致的预测。

Details

Method: 在训练中引入双潜在共享方案，将图像和深度序列转移到共享潜在空间，并提出多尺度潜在交互机制，促进图像和深度特征在多个空间尺度上的双向细化。 Result: 在nuScenes数据集上的大量实验表明，UniFuture在未来生成和感知任务上优于专用模型。 Conclusion: UniFuture展示了统一、结构感知的世界模型的优势。 Abstract: We present UniFuture, a simple yet effective driving world model that seamlessly integrates future scene generation and perception within a single framework. Unlike existing models focusing solely on pixel-level future prediction or geometric reasoning, our approach jointly models future appearance (i.e., RGB image) and geometry (i.e., depth), ensuring coherent predictions. Specifically, during the training, we first introduce a Dual-Latent Sharing scheme, which transfers image and depth sequence in a shared latent space, allowing both modalities to benefit from shared feature learning. Additionally, we propose a Multi-scale Latent Interaction mechanism, which facilitates bidirectional refinement between image and depth features at multiple spatial scales, effectively enhancing geometry consistency and perceptual alignment. During testing, our UniFuture can easily predict high-consistency future image-depth pairs by only using the current image as input. Extensive experiments on the nuScenes dataset demonstrate that UniFuture outperforms specialized models on future generation and perception tasks, highlighting the advantages of a unified, structurally-aware world model. The project page is at https://github.com/dk-liang/UniFuture.

Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization

Hao Li,Yubin Xiao,Ke Liang,Mengzhu Wang,Long Lan,Kenli Li,Xinwang Liu

Task: 提出一种名为Discriminative Domain Reassembly and Soft-Fusion (DRSF)的训练框架，利用合成数据提高模型在单域泛化（SDG）任务中的泛化能力。

Motivation: 直接使用合成数据可能会导致特征分布差异，从而影响模型性能。为了解决这一问题，提出了DRSF框架。

Details

Method: DRSF框架包括两个关键模块：1) Discriminative Feature Decoupling and Reassembly (DFDR) 模块，使用熵引导的注意力重新校准通道级特征；2) Multi-pseudo-domain Soft Fusion (MDSF) 模块，使用对抗训练和潜在空间特征插值创建连续的特征过渡。 Result: 在目标检测和语义分割任务上的广泛实验表明，DRSF在仅增加少量计算开销的情况下实现了显著的性能提升。 Conclusion: DRSF的即插即用架构使其能够无缝集成到无监督域适应范式中，展示了其在解决多样化和现实世界域挑战中的广泛适用性。 Abstract: Single Domain Generalization (SDG) aims to train models with consistent performance across diverse scenarios using data from a single source. While using latent diffusion models (LDMs) show promise in augmenting limited source data, we demonstrate that directly using synthetic data can be detrimental due to significant feature distribution discrepancies between synthetic and real target domains, leading to performance degradation. To address this issue, we propose Discriminative Domain Reassembly and Soft-Fusion (DRSF), a training framework leveraging synthetic data to improve model generalization. We employ LDMs to produce diverse pseudo-target domain samples and introduce two key modules to handle distribution bias. First, Discriminative Feature Decoupling and Reassembly (DFDR) module uses entropy-guided attention to recalibrate channel-level features, suppressing synthetic noise while preserving semantic consistency. Second, Multi-pseudo-domain Soft Fusion (MDSF) module uses adversarial training with latent-space feature interpolation, creating continuous feature transitions between domains. Extensive SDG experiments on object detection and semantic segmentation tasks demonstrate that DRSF achieves substantial performance gains with only marginal computational overhead. Notably, DRSF's plug-and-play architecture enables seamless integration with unsupervised domain adaptation paradigms, underscoring its broad applicability in addressing diverse and real-world domain challenges.

Chiara Plizzari,Alessio Tonioni,Yongqin Xian,Achin Kulshrestha,Federico Tombari

Task: 设计一个数据集EgoTempo，用于评估自我中心视频中的时间理解能力。

Motivation: 当前自我中心视频问答数据集中的问题往往可以通过少量帧或常识推理回答，而不一定基于实际视频内容。

Details

Method: 引入EgoTempo数据集，强调需要整合整个视频信息的任务，确保模型依赖时间模式而非静态线索或已有知识。 Result: 实验表明，当前的多模态大语言模型在自我中心视频的时间推理上仍存在不足。 Conclusion: EgoTempo数据集有望推动该领域的新研究，并激发更好地捕捉时间动态复杂性的模型。 Abstract: Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects. In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. Our analysis shows that state-of-the-art Multi-Modal Large Language Models (MLLMs) on these benchmarks achieve remarkably high performance using just text or a single frame as input. To address these limitations, we introduce EgoTempo, a dataset specifically designed to evaluate temporal understanding in the egocentric domain. EgoTempo emphasizes tasks that require integrating information across the entire video, ensuring that models would need to rely on temporal patterns rather than static cues or pre-existing knowledge. Extensive experiments on EgoTempo show that current MLLMs still fall short in temporal reasoning on egocentric videos, and thus we hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics. Dataset and code are available at https://github.com/google-research-datasets/egotempo.git.

Web Artifact Attacks Disrupt Vision Language Models

Maan Qraitem,Piotr Teterwak,Kate Saenko,Bryan A. Plummer

Task: 研究视觉语言模型（VLMs）中的意外关联及其对模型准确性的影响，并提出一种新的攻击方法——基于伪影的攻击。

Motivation: 视觉语言模型在大规模、轻度策划的网络数据集上训练，导致模型学习到语义概念与无关视觉信号之间的意外关联，这些关联降低了模型的准确性。

Details

Method: 引入基于伪影的攻击，利用不匹配的文本和图形元素误导模型，并将其视为搜索问题来解决。 Result: 在五个数据集上展示了基于伪影的攻击的有效性，某些伪影相互增强，攻击成功率可达100%。这些攻击在模型间的转移效果高达90%。 Conclusion: 通过扩展先前的伪影感知提示到图形设置，可以适度降低攻击成功率，为增强模型鲁棒性提供了一个有前景的方向。 Abstract: Vision-language models (VLMs) (e.g., CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a typographic attack. These attacks succeed due to VLMs' text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them harder to defend against but also more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work's artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness.

FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models

Minghan Li,Chenxi Xie,Yichen Wu,Lei Zhang,Mengyu Wang

Task: 提出FiVE基准，用于评估细粒度视频编辑方法。

Motivation: 缺乏标准化的基准来公平评估文本到视频（T2V）编辑方法，导致不一致的声明和无法评估模型对超参数的敏感性。

Details

Method: 引入FiVE基准，包括74个真实世界视频和26个生成视频，涵盖6种细粒度编辑类型和420个对象级编辑提示对。并提出了FlowEdit方法，将最新的校正流（RF）T2V生成模型Pyramid-Flow和Wan2.1适应为训练自由和反演自由的视频编辑模型Pyramid-Edit和Wan-Edit。 Result: 实验结果表明，基于RF的编辑方法显著优于基于扩散的方法，Wan-Edit在整体性能上表现最佳，并且对超参数的敏感性最低。 Conclusion: FiVE基准和FlowEdit方法为细粒度视频编辑提供了有效的评估工具，并展示了RF方法在该领域的优越性。 Abstract: Numerous text-to-video (T2V) editing methods have emerged recently, but the lack of a standardized benchmark for fair evaluation has led to inconsistent claims and an inability to assess model sensitivity to hyperparameters. Fine-grained video editing is crucial for enabling precise, object-level modifications while maintaining context and temporal consistency. To address this, we introduce FiVE, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models. Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks. Additionally, we adapt the latest rectified flow (RF) T2V generation models, Pyramid-Flow and Wan2.1, by introducing FlowEdit, resulting in training-free and inversion-free video editing models Pyramid-Edit and Wan-Edit. We evaluate five diffusion-based and two RF-based editing methods on our FiVE benchmark using 15 metrics, covering background preservation, text-video similarity, temporal consistency, video quality, and runtime. To further enhance object-level evaluation, we introduce FiVE-Acc, a novel metric leveraging Vision-Language Models (VLMs) to assess the success of fine-grained video editing. Experimental results demonstrate that RF-based editing significantly outperforms diffusion-based methods, with Wan-Edit achieving the best overall performance and exhibiting the least sensitivity to hyperparameters. More video demo available on the anonymous website: https://sites.google.com/view/five-benchmark

Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds

Eitan Shaar,Ariel Shaulov,Gal Chechik,Lior Wolf

Task: 提出一种无需额外训练的模型无关方法（AV²A），用于音频-视觉事件感知，解决现有方法在泛化到新事件类别和跨模态交互方面的局限性。

Motivation: 现有方法在音频-视觉事件感知中存在词汇表限制、标注过程繁琐、事件分布随时间变化忽略以及跨模态交互丢失等问题，限制了其泛化能力和扩展性。

Details

Method: 提出AV²A方法，采用分数级融合技术保留多模态交互，并引入视频内标签偏移算法动态调整事件分布。 Result: AV²A在零样本和弱监督的现有方法上表现出显著的性能提升。 Conclusion: AV²A方法在无需额外训练的情况下，显著提高了音频-视觉事件感知的性能，特别是在泛化能力和跨模态交互方面。 Abstract: In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis ($\text{AV}^2\text{A}$), a model-agnostic approach that requires no further training and integrates a score-level fusion technique to retain richer multimodal interactions. $\text{AV}^2\text{A}$ also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that $\text{AV}^2\text{A}$ achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of $\text{AV}^2\text{A}$ on both zero-shot and weakly-supervised state-of-the-art methods, achieving notable improvements in performance metrics over existing approaches.

Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory

Saket Gurukar,Asim Kadav

Task: 提出了一种新的长视频理解方法Long-VMNet，用于视频检索、总结和问答。

Motivation: 传统方法需要大量计算资源，且受限于GPU内存。

Details

Method: 使用固定大小的记忆表示来存储从输入视频中采样的判别性片段，并通过神经采样器识别判别性标记。 Result: 在Rest-ADL数据集上，推理时间提高了18倍到75倍，且预测性能具有竞争力。 Conclusion: Long-VMNet显著提高了长视频理解的效率，同时保持了良好的预测性能。 Abstract: Long-form video understanding is essential for various applications such as video retrieval, summarizing, and question answering. Yet, traditional approaches demand substantial computing power and are often bottlenecked by GPU memory. To tackle this issue, we present Long-Video Memory Network, Long-VMNet, a novel video understanding method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Additionally, Long-VMNet only needs one scan through the video, greatly boosting efficiency. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.

Improving Geometric Consistency for 360-Degree Neural Radiance Fields in Indoor Scenarios

Iryna Repinetska,Anna Hilsmann,Peter Eisert

Task: 提出一种高效且鲁棒的方法，用于计算密集深度先验，特别针对室内环境中的大面积低纹理建筑表面。

Motivation: NeRF在大面积低纹理区域（如墙壁、天花板和地板）中常常产生云状伪影（称为“浮游物”），降低了场景的真实感。现有的深度估计方法在无纹理区域中难以准确估计深度，导致约束不可靠。

Details

Method: 引入一种新的深度损失函数，以增强这些具有挑战性的低特征区域的渲染质量，同时通过深度补丁正则化进一步优化其他区域的深度一致性。 Result: 在Instant-NGP上的实验表明，与标准光度损失和均方误差深度监督相比，该方法在合成360度室内场景中提高了视觉保真度。 Conclusion: 该方法有效解决了NeRF在大面积低纹理区域中的渲染问题，提高了室内场景的视觉质量。 Abstract: Photo-realistic rendering and novel view synthesis play a crucial role in human-computer interaction tasks, from gaming to path planning. Neural Radiance Fields (NeRFs) model scenes as continuous volumetric functions and achieve remarkable rendering quality. However, NeRFs often struggle in large, low-textured areas, producing cloudy artifacts known as ''floaters'' that reduce scene realism, especially in indoor environments with featureless architectural surfaces like walls, ceilings, and floors. To overcome this limitation, prior work has integrated geometric constraints into the NeRF pipeline, typically leveraging depth information derived from Structure from Motion or Multi-View Stereo. Yet, conventional RGB-feature correspondence methods face challenges in accurately estimating depth in textureless regions, leading to unreliable constraints. This challenge is further complicated in 360-degree ''inside-out'' views, where sparse visual overlap between adjacent images further hinders depth estimation. In order to address these issues, we propose an efficient and robust method for computing dense depth priors, specifically tailored for large low-textured architectural surfaces in indoor environments. We introduce a novel depth loss function to enhance rendering quality in these challenging, low-feature regions, while complementary depth-patch regularization further refines depth consistency across other areas. Experiments with Instant-NGP on two synthetic 360-degree indoor scenes demonstrate improved visual fidelity with our method compared to standard photometric loss and Mean Squared Error depth supervision.

SED-MVS: Segmentation-Driven and Edge-Aligned Deformation Multi-View Stereo with Depth Restoration and Occlusion Constraint

Zhenlong Yuan,Zhidong Yang,Yujun Cai,Kuangxin Wu,Mufan Liu,Dapeng Zhang,Hao Jiang,Zhaoxin Li,Zhaoqi Wang

Task: 提出一种新的多视图立体匹配方法SED-MVS，通过全景分割和多轨迹扩散策略来解决纹理缺失区域的变形不稳定性问题。

Motivation: 现有的基于补丁变形的方法在纹理缺失区域的重建中表现出色，但忽视了边缘跳过导致的变形不稳定性，可能导致匹配失真。

Details

Method: 采用全景分割和多轨迹扩散策略，结合SAM2进行深度边缘引导的补丁变形，并使用LoFTR和DepthAnything V2的稀疏点和单目深度图进行初始化和监督指导。 Result: 在ETH3D、Tanks & Temples、BlendedMVS和Strecha数据集上的广泛实验验证了所提出方法的先进性能和鲁棒泛化能力。 Conclusion: SED-MVS方法通过全景分割和多轨迹扩散策略有效解决了纹理缺失区域的变形不稳定性问题，展示了其在多视图立体匹配中的优越性能。 Abstract: Recently, patch-deformation methods have exhibited significant effectiveness in multi-view stereo owing to the deformable and expandable patches in reconstructing textureless areas. However, such methods primarily emphasize broadening the receptive field in textureless areas, while neglecting deformation instability caused by easily overlooked edge-skipping, potentially leading to matching distortions. To address this, we propose SED-MVS, which adopts panoptic segmentation and multi-trajectory diffusion strategy for segmentation-driven and edge-aligned patch deformation. Specifically, to prevent unanticipated edge-skipping, we first employ SAM2 for panoptic segmentation as depth-edge guidance to guide patch deformation, followed by multi-trajectory diffusion strategy to ensure patches are comprehensively aligned with depth edges. Moreover, to avoid potential inaccuracy of random initialization, we combine both sparse points from LoFTR and monocular depth map from DepthAnything V2 to restore reliable and realistic depth map for initialization and supervised guidance. Finally, we integrate segmentation image with monocular depth map to exploit inter-instance occlusion relationship, then further regard them as occlusion map to implement two distinct edge constraint, thereby facilitating occlusion-aware patch deformation. Extensive results on ETH3D, Tanks & Temples, BlendedMVS and Strecha datasets validate the state-of-the-art performance and robust generalization capability of our proposed method.

Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

Shristi Das Biswas,Efstathia Soufleri,Arani Roy,Kaushik Roy

Task: 提出一种计算高效的视频表示学习方法，以减少推理成本并提高推理速度。

Motivation: 现有的视频表示学习方法存在计算开销大、视频流数据量大和时间冗余高的问题，需要一种更高效的方法来处理这些挑战。

Details

Method: 提出了一种混合端到端框架，包括双编码器方案、统一Transformer模型和多模态混合块，以减少推理成本并提高推理速度。 Result: 该方法在UCF-101、HMDB-51、K-400、K-600和SS-v2数据集上实现了最先进的视频识别性能，推理速度提高了56倍，推理成本降低了330倍。 Conclusion: 该方法为下一代高效时空学习器的设计提供了新的见解，代码已公开。 Abstract: Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of $56$ while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330\times$ versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame -- P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs ($0.73$J/V) and fast inference ($16$V/s). Our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners. Code is available.

TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark

Forouzan Fallah,Maitreya Patel,Agneet Chatterjee,Vlad I. Morariu,Chitta Baral,Yezhou Yang

Task: 评估扩散模型在图像中嵌入文本的能力。

Motivation: 现有的基于扩散的文本到图像模型在准确嵌入文本方面存在困难，缺乏全面的基准测试。

Details

Method: 引入TextInVision基准，设计多样化的提示和文本，准备图像数据集测试VAE模型。 Result: 通过分析多个模型，识别出拼写错误和上下文不匹配等常见问题。 Conclusion: 研究为未来AI生成的多模态内容的发展奠定了基础。 Abstract: Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.

Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes

Keqi Chen,Vinkle Srivastav,Didier Mutter,Nicolas Padoy

Task: 提出一种自监督的无校准多视角人物关联方法Self-MVA，无需使用任何标注。

Motivation: 在多视角人物关联中，人物重识别特征在人物外观相似的情况下不可靠，需要跨视角几何约束来提高鲁棒性。现有方法大多需要地面真实身份标签或校准的相机参数，难以获取。

Details

Method: 提出一个自监督学习框架，包括一个编码器-解码器模型和一个自监督前置任务（跨视角图像同步），通过匈牙利匹配来缩小实例距离和图像距离之间的差距，并进一步提出两种自监督线性约束（多视角重投影和成对边缘关联）来减少解空间。 Result: 在三个具有挑战性的公共基准数据集（WILDTRACK、MVOR和SOLDIERS）上的实验表明，该方法达到了最先进的性能，超越了现有的无监督和全监督方法。 Conclusion: Self-MVA方法在无需标注和校准相机参数的情况下，通过自监督学习实现了多视角人物关联的鲁棒性和准确性。 Abstract: Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric and appearance features, and we train it by utilizing synchronization labels for supervision after applying Hungarian matching to bridge the gap between instance-wise and image-wise distances. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view re-projection and pairwise edge association. Extensive experiments on three challenging public benchmark datasets (WILDTRACK, MVOR, and SOLDIERS) show that our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches. Code is available at https://github.com/CAMMA-public/Self-MVA.

C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales

Yuxuan Jiang,Chengxi Zeng,Siyue Teng,Fan Zhang,Xiaoqing Zhu,Joel Sole,David Bull

Task: 提出了一种新的框架C2D-ISR，用于从性能和复杂性角度优化基于注意力的图像超分辨率模型。

Motivation: 现有的单图像超分辨率（SISR）方法在依赖于简单的训练策略和离散上采样尺度的网络架构方面存在局限性，影响了模型在多尺度信息捕捉上的能力。

Details

Method: 提出了一种基于两阶段训练方法和分层编码机制的新框架C2D-ISR。新的训练方法包括对离散尺度模型进行连续尺度训练，以学习尺度间相关性和多尺度特征表示。此外，将分层编码机制与现有的基于注意力的网络结构相结合，以实现改进的空间特征融合、跨尺度信息聚合和更快的推理速度。 Result: 在SwinIR-L、SRFormer-L和MambaIRv2-L三种高效基于注意力的骨干网络上评估了C2D-ISR框架，并展示了在超分辨率性能（高达0.2dB）和计算复杂性降低（高达11%）方面相对于现有优化框架HiT的显著改进。 Conclusion: C2D-ISR框架通过两阶段训练方法和分层编码机制，显著提升了基于注意力的图像超分辨率模型的性能和效率。 Abstract: In recent years, attention mechanisms have been exploited in single image super-resolution (SISR), achieving impressive reconstruction results. However, these advancements are still limited by the reliance on simple training strategies and network architectures designed for discrete up-sampling scales, which hinder the model's ability to effectively capture information across multiple scales. To address these limitations, we propose a novel framework, \textbf{C2D-ISR}, for optimizing attention-based image super-resolution models from both performance and complexity perspectives. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. The new training methodology involves continuous-scale training for discrete scale models, enabling the learning of inter-scale correlations and multi-scale feature representation. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures, which can achieve improved spatial feature fusion, cross-scale information aggregation, and more importantly, much faster inference. We have evaluated the C2D-ISR framework based on three efficient attention-based backbones, SwinIR-L, SRFormer-L and MambaIRv2-L, and demonstrated significant improvements over the other existing optimization framework, HiT, in terms of super-resolution performance (up to 0.2dB) and computational complexity reduction (up to 11%). The source code will be made publicly available at www.github.com.

MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models

Johannes Meier,Louis Inchingolo,Oussema Dhaouadi,Yan Xia,Jacques Kaiser,Daniel Cremers

Task: 解决单目3D物体检测在不同传感器、环境和相机设置中的问题。

Motivation: 准确的深度估计对于缓解领域偏移至关重要。

Details

Method: 提出了一种新颖的无监督领域适应方法MonoCT，包括广义深度增强（GDE）模块和伪标签评分（PLS）模块。 Result: 在六个基准测试中，MonoCT显著优于现有的SOTA领域适应方法（AP Mod.至少提高21%），并且在汽车、交通摄像头和无人机视图中表现良好。 Conclusion: MonoCT通过改进深度估计和生成高质量伪标签，显著提升了单目3D物体检测的跨领域性能。 Abstract: We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.

FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

Ali Mollaahmadi Dehaghi,Hossein KhademSohi,Reza Razavi,Steve Drew,Mohammad Moshirpour

Task: 提出一种新的、与架构无关且无状态的联邦学习框架FedVSR，用于视频超分辨率重建。

Motivation: 现有的深度学习视频超分辨率方法存在隐私问题，而现有的联邦学习方法在低层次视觉任务上表现不佳。

Details

Method: 提出FedVSR框架，引入轻量级损失项以改进局部优化并指导全局聚合。 Result: FedVSR在PSNR上平均比一般联邦学习方法高出0.85 dB。 Conclusion: FedVSR是第一个联邦视频超分辨率方法，实验证明其有效性。 Abstract: Video Super-Resolution (VSR) reconstructs high-resolution videos from low-resolution inputs to restore fine details and improve visual clarity. While deep learning-based VSR methods achieve impressive results, their centralized nature raises serious privacy concerns, particularly in applications with strict privacy requirements. Federated Learning (FL) offers an alternative approach, but existing FL methods struggle with low-level vision tasks, leading to suboptimal reconstructions. To address this, we propose FedVSR1, a novel, architecture-independent, and stateless FL framework for VSR. Our approach introduces a lightweight loss term that improves local optimization and guides global aggregation with minimal computational overhead. To the best of our knowledge, this is the first attempt at federated VSR. Extensive experiments show that FedVSR outperforms general FL methods by an average of 0.85 dB in PSNR, highlighting its effectiveness. The code is available at: https://github.com/alimd94/FedVSR

Fast alignment of heterogeneous images in sliced Wasserstein distance

Yunpeng Shi,Amit Singer,Eric J. Verbeke

Task: Error

Motivation: Error

Details

Method: Error Result: Error Conclusion: Error Abstract: Many applications of computer vision rely on the alignment of similar but non-identical images. We present a fast algorithm for aligning heterogeneous images based on optimal transport. Our approach combines the speed of fast Fourier methods with the robustness of sliced probability metrics and allows us to efficiently compute the alignment between two $L \times L$ images using the sliced 2-Wasserstein distance in $O(L^2 \log L)$ operations. We show that our method is robust to translations, rotations and deformations in the images.

Continual Unlearning for Foundational Text-to-Image Models without Generalization Erosion

Kartik Thakral,Tamar Glaser,Tal Hassner,Mayank Vatsa,Richa Singh

Task: 从预训练的生成基础模型中有效去除特定概念而不需要大量重新训练。

Motivation: 解决在生成模型中去除特定概念的需求，避免版权侵权、个人或授权材料滥用以及独特艺术风格的复制。

Details

Method: 提出了Decremental Unlearning without Generalization Erosion (DUGE)算法，通过三种损失函数（交叉注意力损失、先验保留损失和正则化损失）来选择性去除不需要的概念。 Result: 实验结果表明，该方法能够在不影响模型整体完整性和性能的情况下去除特定概念。 Conclusion: DUGE提供了一种实用的解决方案，能够在保持模型核心能力和有效性的同时，处理模型训练和概念管理的复杂性。 Abstract: How can we effectively unlearn selected concepts from pre-trained generative foundation models without resorting to extensive retraining? This research introduces `continual unlearning', a novel paradigm that enables the targeted removal of multiple specific concepts from foundational generative models, incrementally. We propose Decremental Unlearning without Generalization Erosion (DUGE) algorithm which selectively unlearns the generation of undesired concepts while preserving the generation of related, non-targeted concepts and alleviating generalization erosion. For this, DUGE targets three losses: a cross-attention loss that steers the focus towards images devoid of the target concept; a prior-preservation loss that safeguards knowledge related to non-target concepts; and a regularization loss that prevents the model from suffering from generalization erosion. Experimental results demonstrate the ability of the proposed approach to exclude certain concepts without compromising the overall integrity and performance of the model. This offers a pragmatic solution for refining generative models, adeptly handling the intricacies of model training and concept management lowering the risks of copyright infringement, personal or licensed material misuse, and replication of distinctive artistic styles. Importantly, it maintains the non-targeted concepts, thereby safeguarding the model's core capabilities and effectiveness.

8-Calves Image dataset

Xuyang Fang,Sion Hannuna,Neill Campbell

Task: 评估在遮挡丰富、时间一致的环境中的目标检测和身份分类。

Motivation: 提供一个实用的基准数据集，用于测试遮挡处理、时间一致性和效率。

Details

Method: 使用8-Calves数据集，包括1小时的视频和900张静态帧，对28个模型进行微调，并使用23个预训练视觉模型进行身份分类评估。 Result: 较小的YOLO模型在目标检测中表现更好，现代架构如ConvNextV2在身份分类中表现出色，但较大的模型容易过拟合。 Conclusion: 最小化、有针对性的增强策略在简单数据集上表现更好，预训练策略显著提高了身份识别，时间连续性和自然运动模式提供了独特的挑战。 Abstract: We introduce the 8-Calves dataset, a benchmark for evaluating object detection and identity classification in occlusion-rich, temporally consistent environments. The dataset comprises a 1-hour video (67,760 frames) of eight Holstein Friesian calves in a barn, with ground truth bounding boxes and identities, alongside 900 static frames for detection tasks. Each calf exhibits a unique coat pattern, enabling precise identity distinction. For cow detection, we fine-tuned 28 models (25 YOLO variants, 3 transformers) on 600 frames, testing on the full video. Results reveal smaller YOLO models (e.g., YOLOV9c) outperform larger counterparts despite potential bias from a YOLOv8m-based labeling pipeline. For identity classification, embeddings from 23 pretrained vision models (ResNet, ConvNextV2, ViTs) were evaluated via linear classifiers and KNN. Modern architectures like ConvNextV2 excelled, while larger models frequently overfit, highlighting inefficiencies in scaling. Key findings include: (1) Minimal, targeted augmentations (e.g., rotation) outperform complex strategies on simpler datasets; (2) Pretraining strategies (e.g., BEiT, DinoV2) significantly boost identity recognition; (3) Temporal continuity and natural motion patterns offer unique challenges absent in synthetic or domain-specific benchmarks. The dataset's controlled design and extended sequences (1 hour vs. prior 10-minute benchmarks) make it a pragmatic tool for stress-testing occlusion handling, temporal consistency, and efficiency. The link to the dataset is https://github.com/tonyFang04/8-calves.

Using 3D reconstruction from image motion to predict total leaf area in dwarf tomato plants

Dmitrii Usenko,David Helman,Chen Giladi

Task: 评估一种结合RGB图像序列3D重建和机器学习的方法，用于估算矮番茄品种的总叶面积（TLA）。

Motivation: 准确估算总叶面积对于评估植物生长、光合作用和蒸腾作用至关重要，但对于矮番茄等灌木植物来说，由于其复杂的冠层结构，传统方法往往劳动密集、对植物有损害或难以捕捉冠层复杂性。

Details

Method: 本研究采用非破坏性方法，结合RGB图像的序列3D重建和机器学习，对三种矮番茄品种（Mohamed、Hahms Gelbe Topftomate和Red Robin）在受控温室条件下的TLA进行估算。通过高分辨率视频记录，每株植物使用500帧进行3D重建，并使用四种算法（Alpha Shape、Marching Cubes、Poisson's、Ball Pivoting）处理点云，七种回归模型进行评估。 Result: Alpha Shape重建（α=3）结合Extreme Gradient Boosting模型表现最佳（R²=0.80，MAE=489 cm²）。跨实验验证显示结果稳健（R²=0.56，MAE=579 cm²）。特征重要性分析确定高度、宽度和表面积为关键预测因子。 Conclusion: 这种可扩展的自动化TLA估算方法适用于城市农业和精准农业，可用于自动修剪、资源效率和可持续食品生产。该方法在不同环境条件和冠层结构下表现出稳健性。 Abstract: Accurate estimation of total leaf area (TLA) is crucial for evaluating plant growth, photosynthetic activity, and transpiration. However, it remains challenging for bushy plants like dwarf tomatoes due to their complex canopies. Traditional methods are often labor-intensive, damaging to plants, or limited in capturing canopy complexity. This study evaluated a non-destructive method combining sequential 3D reconstructions from RGB images and machine learning to estimate TLA for three dwarf tomato cultivars: Mohamed, Hahms Gelbe Topftomate, and Red Robin -- grown under controlled greenhouse conditions. Two experiments (spring-summer and autumn-winter) included 73 plants, yielding 418 TLA measurements via an "onion" approach. High-resolution videos were recorded, and 500 frames per plant were used for 3D reconstruction. Point clouds were processed using four algorithms (Alpha Shape, Marching Cubes, Poisson's, Ball Pivoting), and meshes were evaluated with seven regression models: Multivariable Linear Regression, Lasso Regression, Ridge Regression, Elastic Net Regression, Random Forest, Extreme Gradient Boosting, and Multilayer Perceptron. The Alpha Shape reconstruction ($\alpha = 3$) with Extreme Gradient Boosting achieved the best performance ($R^2 = 0.80$, $MAE = 489 cm^2$). Cross-experiment validation showed robust results ($R^2 = 0.56$, $MAE = 579 cm^2$). Feature importance analysis identified height, width, and surface area as key predictors. This scalable, automated TLA estimation method is suited for urban farming and precision agriculture, offering applications in automated pruning, resource efficiency, and sustainable food production. The approach demonstrated robustness across variable environmental conditions and canopy structures.

Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

Xinyu Tian,Shu Zou,Zhaoyuan Yang,Jing Zhang

Task: 量化大型视觉语言模型在多图像推理中的位置偏差

Motivation: 尽管大型视觉语言模型在多图像推理方面取得了进展，但它们在不同图像位置上的推理能力存在显著偏差，影响了预测的鲁棒性。

Details

Method: 提出了位置问题回答（PQA）任务来量化每个位置的推理能力，并提出了SoFt Attention（SoFA）方法，通过线性插值来减轻位置偏差。 Result: 实验结果表明，SoFA减少了位置偏差，并提高了现有大型视觉语言模型的推理性能。 Conclusion: SoFA方法有效地减轻了大型视觉语言模型在多图像推理中的位置偏差，提升了其推理能力。 Abstract: The evolution of Large Vision-Language Models (LVLMs) has progressed from single to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by this, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.

LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

Yang Zhou,Shiyu Zhao,Yuxiao Chen,Zhenting Wang,Dimitris N. Metaxas

Task: 通过利用大型语言模型（LLM）的隐藏状态来增强开放词汇对象检测（OVD）的视觉基础。

Motivation: 大规模视觉-文本数据训练的大型基础模型可以通过数据生成显著增强开放词汇对象检测（OVD），但可能导致偏见的合成数据和过度拟合特定配置。直接利用大型语言模型（LLM）的隐藏状态可以避免手动生成数据的偏见。

Details

Method: 提出了一种系统方法，通过利用多模态大型语言模型（MLLM）的LLM解码器层来增强视觉基础。引入了一种零初始化的交叉注意力适配器，以实现从LLM到对象检测器的高效知识转移，称为LED（LLM增强的开放词汇对象检测）。 Result: 实验表明，早期LLM层的中间隐藏状态保留了强大的空间-语义相关性，对基础任务有益。适应策略显著提高了复杂自由形式文本查询的性能，同时在普通类别上保持不变。Qwen2-0.5B与Swin-T作为视觉编码器在Omnilabel上将GroundingDINO提高了2.33%，计算开销增加了8.7%。Qwen2-0.5B与更大的视觉编码器可以进一步提高性能6.22%。 Conclusion: 通过消融实验验证了设计，包括不同的适配器架构、LLM的大小以及添加适应的层。 Abstract: Large foundation models trained on large-scale visual-text data can significantly enhance Open Vocabulary Object Detection (OVD) through data generation. However, this may lead to biased synthetic data and overfitting to specific configurations. It can sidestep biases of manually curated data generation by directly leveraging hidden states of Large Language Models (LLMs), which is surprisingly rarely explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of a MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge transfer from LLMs to object detectors, an new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We demonstrate that intermediate hidden states from early LLM layers retain strong spatial-semantic correlations that are beneficial to grounding tasks. Experiments show that our adaptation strategy significantly enhances the performance on complex free-form text queries while remaining the same on plain categories. With our adaptation, Qwen2-0.5B with Swin-T as the vision encoder improves GroundingDINO by 2.33% on Omnilabel, at the overhead of 8.7% more GFLOPs. Qwen2-0.5B with a larger vision encoder can further boost the performance by 6.22%. We further validate our design by ablating on varied adapter architectures, sizes of LLMs, and which layers to add adaptation.

SMILE: a Scale-aware Multiple Instance Learning Method for Multicenter STAS Lung Cancer Histopathology Diagnosis

Liangrui Pan,Xiaoyu Li,Yutao Dou,Qiya Song,Jiadi Luo,Qingchun Liang,Shaoliang Peng

Task: 开发一种自动化且精确的诊断方法用于肺癌中的空气传播（STAS）检测。

Motivation: 目前病理学家依赖耗时且主观性强的手动评估，存在较大差异，迫切需要自动化且精确的诊断解决方案。

Details

Method: 提出了一种尺度感知的多实例学习（SMILE）方法，通过引入尺度自适应注意力机制，自适应调整高注意力实例，减少对局部区域的过度依赖，促进STAS病变的一致检测。 Result: SMILE在STAS CSU数据集上取得了竞争性的诊断结果，分别在CPTAC和TCGA数据集中诊断了251和319个STAS样本，超过了临床平均AUC。 Conclusion: SMILE方法为STAS研究建立了首个开放的基线结果，为未来计算病理学技术的扩展、可解释性和临床整合奠定了基础。 Abstract: Spread through air spaces (STAS) represents a newly identified aggressive pattern in lung cancer, which is known to be associated with adverse prognostic factors and complex pathological features. Pathologists currently rely on time consuming manual assessments, which are highly subjective and prone to variation. This highlights the urgent need for automated and precise diag nostic solutions. 2,970 lung cancer tissue slides are comprised from multiple centers, re-diagnosed them, and constructed and publicly released three lung cancer STAS datasets: STAS CSU (hospital), STAS TCGA, and STAS CPTAC. All STAS datasets provide corresponding pathological feature diagnoses and related clinical data. To address the bias, sparse and heterogeneous nature of STAS, we propose an scale-aware multiple instance learning(SMILE) method for STAS diagnosis of lung cancer. By introducing a scale-adaptive attention mechanism, the SMILE can adaptively adjust high attention instances, reducing over-reliance on local regions and promoting consistent detection of STAS lesions. Extensive experiments show that SMILE achieved competitive diagnostic results on STAS CSU, diagnosing 251 and 319 STAS samples in CPTAC andTCGA,respectively, surpassing clinical average AUC. The 11 open baseline results are the first to be established for STAS research, laying the foundation for the future expansion, interpretability, and clinical integration of computational pathology technologies. The datasets and code are available at https://anonymous.4open.science/r/IJCAI25-1DA1.

Text-Guided Image Invariant Feature Learning for Robust Image Watermarking

Muhammad Ahtesham,Xin Zhong

Task: 提出一种基于文本引导的不变特征学习框架，用于鲁棒的图像水印。

Motivation: 确保图像水印的鲁棒性对于在多种变换下保持内容完整性至关重要。

Details

Method: 利用CLIP的多模态能力，使用文本嵌入作为稳定的语义锚点，以在失真下强制特征不变性。 Result: 在多个数据集上评估了所提出的方法，展示了在各种图像变换下的优越鲁棒性。与最先进的自监督学习方法相比，我们的模型在特征一致性测试中实现了更高的余弦相似度，并在严重失真下的提取精度上优于现有的水印方案。 Conclusion: 结果表明，我们的方法在学习针对鲁棒深度学习水印的不变表示方面具有高效性。 Abstract: Ensuring robustness in image watermarking is crucial for and maintaining content integrity under diverse transformations. Recent self-supervised learning (SSL) approaches, such as DINO, have been leveraged for watermarking but primarily focus on general feature representation rather than explicitly learning invariant features. In this work, we propose a novel text-guided invariant feature learning framework for robust image watermarking. Our approach leverages CLIP's multimodal capabilities, using text embeddings as stable semantic anchors to enforce feature invariance under distortions. We evaluate the proposed method across multiple datasets, demonstrating superior robustness against various image transformations. Compared to state-of-the-art SSL methods, our model achieves higher cosine similarity in feature consistency tests and outperforms existing watermarking schemes in extraction accuracy under severe distortions. These results highlight the efficacy of our method in learning invariant representations tailored for robust deep learning-based watermarking.

Organ-aware Multi-scale Medical Image Segmentation Using Text Prompt Engineering

Wenjie Zhang,Ziyang Zhang,Mengnan He,Jiancheng Ye

Task: 开发一种用于多器官分割的器官感知多尺度文本引导医学图像分割模型（OMT-SAM）。

Motivation: 现有的医学图像分割方法主要依赖于单模态视觉输入，需要大量的人工标注，且医学成像技术在同一扫描中捕捉多个交织的器官，进一步增加了分割的复杂性。

Details

Method: 引入CLIP编码器作为新颖的图像-文本提示编码器，与几何提示编码器一起操作，提供信息丰富的上下文指导。通过预训练的CLIP编码器和交叉注意力机制生成融合的图像-文本嵌入，并从MedSAM中提取多尺度视觉特征。 Result: 在FLARE 2021数据集上评估，OMT-SAM的平均Dice相似系数为0.937，优于MedSAM（0.893）和其他分割模型。 Conclusion: OMT-SAM在处理复杂的医学图像分割任务方面表现出色，展示了其在多器官分割中的优越能力。 Abstract: Accurate segmentation is essential for effective treatment planning and disease monitoring. Existing medical image segmentation methods predominantly rely on uni-modal visual inputs, such as images or videos, requiring labor-intensive manual annotations. Additionally, medical imaging techniques capture multiple intertwined organs within a single scan, further complicating segmentation accuracy. To address these challenges, MedSAM, a large-scale medical segmentation model based on the Segment Anything Model (SAM), was developed to enhance segmentation accuracy by integrating image features with user-provided prompts. While MedSAM has demonstrated strong performance across various medical segmentation tasks, it primarily relies on geometric prompts (e.g., points and bounding boxes) and lacks support for text-based prompts, which could help specify subtle or ambiguous anatomical structures. To overcome these limitations, we propose the Organ-aware Multi-scale Text-guided Medical Image Segmentation Model (OMT-SAM) for multi-organ segmentation. Our approach introduces CLIP encoders as a novel image-text prompt encoder, operating with the geometric prompt encoder to provide informative contextual guidance. We pair descriptive textual prompts with corresponding images, processing them through pre-trained CLIP encoders and a cross-attention mechanism to generate fused image-text embeddings. Additionally, we extract multi-scale visual features from MedSAM, capturing fine-grained anatomical details at different levels of granularity. We evaluate OMT-SAM on the FLARE 2021 dataset, benchmarking its performance against existing segmentation methods. Empirical results demonstrate that OMT-SAM achieves a mean Dice Similarity Coefficient of 0.937, outperforming MedSAM (0.893) and other segmentation models, highlighting its superior capability in handling complex medical image segmentation tasks.

FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification

Jinping Wang,Weiwei Song,Hao Chen,Jinchang Ren,Huimin Zhao

Task: 提出一种标签高效的遥感世界模型（FusDreamer）用于多模态数据融合。

Motivation: 探索世界模型在遥感领域的潜力，提升数据整合和学习效率。

Details

Method: 使用世界模型作为统一的表示容器，采用新的潜在扩散融合和多模态生成范式（LaMG），并结合开放世界知识引导的一致性投影（OK-CP）模块和多任务组合优化（MuCO）策略。 Result: 在四个典型数据集上的实验表明FusDreamer的有效性和优势。 Conclusion: FusDreamer通过多模态数据融合和世界模型的结合，显著提升了遥感数据的理解和学习效率。 Abstract: World models significantly enhance hierarchical understanding, improving data integration and learning efficiency. To explore the potential of the world model in the remote sensing (RS) field, this paper proposes a label-efficient remote sensing world model for multimodal data fusion (FusDreamer). The FusDreamer uses the world model as a unified representation container to abstract common and high-level knowledge, promoting interactions across different types of data, \emph{i.e.}, hyperspectral (HSI), light detection and ranging (LiDAR), and text data. Initially, a new latent diffusion fusion and multimodal generation paradigm (LaMG) is utilized for its exceptional information integration and detail retention capabilities. Subsequently, an open-world knowledge-guided consistency projection (OK-CP) module incorporates prompt representations for visually described objects and aligns language-visual features through contrastive learning. In this way, the domain gap can be bridged by fine-tuning the pre-trained world models with limited samples. Finally, an end-to-end multitask combinatorial optimization (MuCO) strategy can capture slight feature bias and constrain the diffusion process in a collaboratively learnable direction. Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer. The corresponding code will be released at https://github.com/Cimy-wang/FusDreamer.

MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

Zhixuan Liu,Haokun Zhu,Rui Chen,Jonathan Francis,Soonmin Hwang,Ji Zhang,Jean Oh

Task: 生成隐私保护的多房间室内环境的数字孪生体

Motivation: 现有的全景方法在序列或单房间约束中容易产生误差累积，且缺乏对同一场景内跨视图依赖性的考虑。

Details

Method: 提出了一种新颖的多视图重叠场景对齐与隐式一致性模型（MOSAIC），通过推理时优化避免误差累积，并在去噪过程中减少方差。 Result: MOSAIC在重建复杂多房间环境时，在图像保真度指标上优于现有最先进的基线方法。 Conclusion: MOSAIC模型能够在不增加额外训练的情况下扩展到复杂场景，并通过增加重叠视图提高生成质量。 Abstract: We introduce a novel diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense. MOSAIC operates through a novel inference-time optimization that avoids error accumulation common in sequential or single-room constraint in panorama-based approaches. MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising processes when more overlapping views are added, leading to improved generation quality. Experiments show that MOSAIC outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments. Project page is available at: https://mosaic-cmubig.github.io

Stitch-a-Recipe: Video Demonstration from Multistep Descriptions

Chi Hsuan Wu,Kumar Ashutosh,Kristen Grauman

Task: 从多步骤文本描述中生成视觉连贯的视频演示。

Motivation: 现有的方法只能处理单一文本上下文的描述，无法处理多步骤描述（如烹饪食谱），且单独处理每个步骤会导致视觉上的不连贯。

Details

Method: 提出了一种名为Stitch-a-Recipe的基于检索的方法，从多步骤描述中组装视频演示。该方法通过训练管道创建大规模弱监督数据，并注入硬负样本以促进正确性和连贯性。 Result: 在野外教学视频上的验证表明，Stitch-a-Recipe达到了最先进的性能，定量增益高达24%，并在人类偏好研究中取得了显著胜利。 Conclusion: Stitch-a-Recipe能够从多步骤描述中生成视觉连贯的视频演示，显著优于现有方法。 Abstract: When obtaining visual illustrations from text descriptions, today's methods take a description with-a single text context caption, or an action description-and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe composed of multiple steps. Furthermore, simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Recipe, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse and novel recipes and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Recipe achieves state-of-the-art performance, with quantitative gains up to 24% as well as dramatic wins in a human preference study.

Scale-Aware Contrastive Reverse Distillation for Unsupervised Medical Anomaly Detection

Chunlei Li,Yilei Shi,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou

Task: 提出一种新的尺度感知对比反向蒸馏模型，用于无监督异常检测。

Motivation: 由于在医学影像等领域中标记的异常数据稀缺，无监督异常检测具有广泛的应用前景。现有的生成模型方法存在过度泛化的问题，而反向蒸馏方法在特征区分性和处理异常尺度变化方面存在不足。

Details

Method: 提出了一种对比师生学习方法，通过生成和探索超出正常分布的样本来获得更具区分性的表示，并设计了一种尺度适应机制来软加权不同尺度的对比蒸馏损失。 Result: 在基准数据集上的广泛实验表明，该方法达到了最先进的性能。 Conclusion: 所提出的方法在无监督异常检测中表现出色，特别是在处理特征区分性和异常尺度变化方面具有优势。 Abstract: Unsupervised anomaly detection using deep learning has garnered significant research attention due to its broad applicability, particularly in medical imaging where labeled anomalous data are scarce. While earlier approaches leverage generative models like autoencoders and generative adversarial networks (GANs), they often fall short due to overgeneralization. Recent methods explore various strategies, including memory banks, normalizing flows, self-supervised learning, and knowledge distillation, to enhance discrimination. Among these, knowledge distillation, particularly reverse distillation, has shown promise. Following this paradigm, we propose a novel scale-aware contrastive reverse distillation model that addresses two key limitations of existing reverse distillation methods: insufficient feature discriminability and inability to handle anomaly scale variations. Specifically, we introduce a contrastive student-teacher learning approach to derive more discriminative representations by generating and exploring out-of-normal distributions. Further, we design a scale adaptation mechanism to softly weight contrastive distillation losses at different scales to account for the scale variation issue. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, validating the efficacy of the proposed method. Code is available at https://github.com/MedAITech/SCRD4AD.

See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias

JuneHyoung Kwon,MiHyeon Kim,Eunju Lee,Juhwan Choi,YoungBin Kim

Task: 提出一种新的框架BalGrad，以减轻视觉-语言模型中的主导模态偏差。

Motivation: 视觉-语言模型在多种任务中表现出色，但往往依赖于特定模态进行预测，导致主导模态偏差，显著影响性能，尤其是在某一模态受损时。

Details

Method: 提出BalGrad框架，包括模态间梯度重新加权、基于各模态贡献调整KL散度的梯度以及任务间梯度投影，以非冲突方式对齐任务方向。 Result: 在UPMC Food-101、Hateful Memes和MM-IMDb数据集上的实验证实，BalGrad有效缓解了在预测时对特定模态的过度依赖。 Conclusion: BalGrad框架能够有效减轻视觉-语言模型中的主导模态偏差，提升模型性能。 Abstract: Vision-language (VL) models have demonstrated strong performance across various tasks. However, these models often rely on a specific modality for predictions, leading to "dominant modality bias.'' This bias significantly hurts performance, especially when one modality is impaired. In this study, we analyze model behavior under dominant modality bias and theoretically show that unaligned gradients or differences in gradient magnitudes prevent balanced convergence of the loss. Based on these findings, we propose a novel framework, BalGrad to mitigate dominant modality bias. Our approach includes inter-modality gradient reweighting, adjusting the gradient of KL divergence based on each modality's contribution, and inter-task gradient projection to align task directions in a non-conflicting manner. Experiments on UPMC Food-101, Hateful Memes, and MM-IMDb datasets confirm that BalGrad effectively alleviates over-reliance on specific modalities when making predictions.

SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing

Seokhyeon Hong,Chaelin Kim,Serin Yoon,Junghyun Nam,Sihun Cha,Junyong Noh

Task: 提出一种骨架感知的潜在扩散模型（SALAD），用于文本驱动的运动生成和编辑。

Motivation: 现有的方法在表示骨骼关节、时间帧和文本单词时过于简化，限制了它们捕捉每种模态信息及其交互的能力。此外，使用预训练模型进行下游任务（如编辑）通常需要额外的努力，包括手动干预、优化或微调。

Details

Method: 引入骨架感知的潜在扩散模型（SALAD），明确捕捉关节、帧和单词之间的复杂相互关系。利用生成过程中产生的交叉注意力图，实现基于注意力的零样本文本驱动运动编辑。 Result: SALAD模型在文本-运动对齐方面显著优于以前的方法，且不损害生成质量。展示了在生成之外的多样化编辑能力。 Conclusion: SALAD模型在文本驱动的运动生成和编辑方面表现出色，具有实际应用的多样性。 Abstract: Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.

Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic

Monika Shah,Somdeb Sarkhel,Deepak Venugopal

Task: 研究多模态系统中微调过程对模型知识获取的影响。

Motivation: 由于多模态系统在微调前已经进行了大规模预训练，很难区分模型在微调过程中学到了什么以及它已经知道什么。

Details

Method: 使用混合马尔可夫逻辑网络（HMLNs）在训练样本上学习概率模型，将符号知识（从标题中提取）与视觉特征（从图像中提取）关联起来。 Result: 在MSCOCO数据集上评估了不同类型的推理过程，结果表明对于使用LLM的BLIP2模型，微调对其知识获取的影响可能较小。 Conclusion: 使用LLM的模型在视觉描述任务中可能具有更广泛的知识，因此微调对其知识获取的影响较小。 Abstract: Multimodal systems have highly complex processing pipelines and are pretrained over large datasets before being fine-tuned for specific tasks such as visual captioning. However, it becomes hard to disentangle what the model learns during the fine-tuning process from what it already knows due to its pretraining. In this work, we learn a probabilistic model using Hybrid Markov Logic Networks (HMLNs) over the training examples by relating symbolic knowledge (extracted from the caption) with visual features (extracted from the image). For a generated caption, we quantify the influence of training examples based on the HMLN distribution using probabilistic inference. We evaluate two types of inference procedures on the MSCOCO dataset for different types of captioning models. Our results show that for BLIP2 (a model that uses a LLM), the fine-tuning may have smaller influence on the knowledge the model has acquired since it may have more general knowledge to perform visual captioning as compared to models that do not use a LLM

MamBEV: Enabling State Space Models to Learn Birds-Eye-View Representations

Hongyu Ke,Jack Morris,Kentaro Oguchi,Xiaofei Cao,Yongkang Liu,Haoxin Wang,Yi Ding

Task: 提出一种基于Mamba的框架MamBEV，用于从多摄像头图像中进行3D检测。

Motivation: 设计计算效率高的方法对于自动驾驶和辅助系统至关重要。

Details

Method: 使用线性时空SSM-based注意力学习统一的鸟瞰图（BEV）表示，并引入SSM-based交叉注意力。 Result: MamBEV在多种视觉感知指标上表现出色，计算和内存效率显著提高。 Conclusion: MamBEV在输入扩展效率方面优于现有基准模型。 Abstract: 3D visual perception tasks, such as 3D detection from multi-camera images, are essential components of autonomous driving and assistance systems. However, designing computationally efficient methods remains a significant challenge. In this paper, we propose a Mamba-based framework called MamBEV, which learns unified Bird's Eye View (BEV) representations using linear spatio-temporal SSM-based attention. This approach supports multiple 3D perception tasks with significantly improved computational and memory efficiency. Furthermore, we introduce SSM based cross-attention, analogous to standard cross attention, where BEV query representations can interact with relevant image features. Extensive experiments demonstrate MamBEV's promising performance across diverse visual perception metrics, highlighting its advantages in input scaling efficiency compared to existing benchmark models.

Less is More: Improving Motion Diffusion Models with Sparse Keyframes

Jinseok Bae,Inwoo Hwang,Young Yoon Lee,Ziyu Guo,Joseph Liu,Yizhak Ben-Shabat,Young Min Kim,Mubbasir Kapadia

Task: 提出一种基于稀疏关键帧的扩散框架，用于高效生成多样化的运动序列。

Motivation: 现有方法将运动表示为密集帧序列，导致模型需要处理冗余或信息量较少的帧，增加了训练复杂性，限制了生成模型在下游任务中的性能。

Details

Method: 提出一种新颖的扩散框架，围绕稀疏且几何意义明确的关键帧设计，通过屏蔽非关键帧并高效插值缺失帧来减少计算量，并在推理过程中动态优化关键帧掩码。 Result: 实验表明，该方法在文本对齐和运动真实性方面始终优于现有方法，并且在显著减少扩散步骤的情况下仍能保持高性能。 Conclusion: 该框架通过使用稀疏关键帧显著提高了运动生成的效率和性能，并展示了其在不同下游任务中的鲁棒性。 Abstract: Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks. Source code and pre-trained models will be released upon acceptance.

RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving

Yujin Wang,Quanfeng Liu,Zhengxin Jiang,Tianyi Wang,Junfeng Jiao,Hongqing Chu,Bingzhao Gao,Hong Chen

Task: 提出一种检索增强决策（RAD）框架，以增强视觉语言模型（VLMs）在自动驾驶场景中生成元动作的能力。

Motivation: 视觉语言模型在自动驾驶任务中表现出潜力，但在复杂场景中存在空间感知不足和幻觉问题，影响了其有效性。

Details

Method: 提出了一种检索增强生成（RAG）管道，通过嵌入流、检索流和生成流的三阶段过程动态提高决策准确性，并在NuScenes数据集上微调VLMs以增强其空间感知和鸟瞰图理解能力。 Result: 在基于NuScenes的数据集上的实验评估表明，RAD在匹配准确率、F1分数和自定义总体分数等关键评估指标上优于基线方法。 Conclusion: RAD框架有效提高了自动驾驶任务中的元动作决策能力。 Abstract: Accurately understanding and deciding high-level meta-actions is essential for ensuring reliable and safe autonomous driving systems. While vision-language models (VLMs) have shown significant potential in various autonomous driving tasks, they often suffer from limitations such as inadequate spatial perception and hallucination, reducing their effectiveness in complex autonomous driving scenarios. To address these challenges, we propose a retrieval-augmented decision-making (RAD) framework, a novel architecture designed to enhance VLMs' capabilities to reliably generate meta-actions in autonomous driving scenes. RAD leverages a retrieval-augmented generation (RAG) pipeline to dynamically improve decision accuracy through a three-stage process consisting of the embedding flow, retrieving flow, and generating flow. Additionally, we fine-tune VLMs on a specifically curated dataset derived from the NuScenes dataset to enhance their spatial perception and bird's-eye view image comprehension capabilities. Extensive experimental evaluations on the curated NuScenes-based dataset demonstrate that RAD outperforms baseline methods across key evaluation metrics, including match accuracy, and F1 score, and self-defined overall score, highlighting its effectiveness in improving meta-action decision-making for autonomous driving tasks.

HySurvPred: Multimodal Hyperbolic Embedding with Angle-Aware Hierarchical Contrastive Learning and Uncertainty Constraints for Survival Prediction

Jiaqi Yang,Wenting Chen,Xiaohan Xing,Sean He,Xiaoling Luo,Xinheng Lyu,Linlin Shen,Guoping Qiu

Task: 整合组织病理学图像和基因组数据进行癌症生存预测

Motivation: 现有方法在欧几里得空间中的多模态映射和度量无法完全捕捉组织病理学和基因组数据的层次结构，且将生存时间离散化为独立的风险区间，忽略了其连续性和有序性，同时未能充分利用审查数据。

Details

Method: 提出了HySurvPred框架，包括多模态双曲映射（MHM）、角度感知排序对比损失（ARCL）和审查条件不确定性约束（CUC）三个关键模块。 Result: 在五个基准数据集上的实验表明，该方法优于现有的最先进方法。 Conclusion: HySurvPred框架通过双曲空间中的层次结构探索和多模态特征整合，有效提升了癌症生存预测的准确性。 Abstract: Multimodal learning that integrates histopathology images and genomic data holds great promise for cancer survival prediction. However, existing methods face key limitations: 1) They rely on multimodal mapping and metrics in Euclidean space, which cannot fully capture the hierarchical structures in histopathology (among patches from different resolutions) and genomics data (from genes to pathways). 2) They discretize survival time into independent risk intervals, which ignores its continuous and ordinal nature and fails to achieve effective optimization. 3) They treat censorship as a binary indicator, excluding censored samples from model optimization and not making full use of them. To address these challenges, we propose HySurvPred, a novel framework for survival prediction that integrates three key modules: Multimodal Hyperbolic Mapping (MHM), Angle-aware Ranking-based Contrastive Loss (ARCL) and Censor-Conditioned Uncertainty Constraint (CUC). Instead of relying on Euclidean space, we design the MHM module to explore the inherent hierarchical structures within each modality in hyperbolic space. To better integrate multimodal features in hyperbolic space, we introduce the ARCL module, which uses ranking-based contrastive learning to preserve the ordinal nature of survival time, along with the CUC module to fully explore the censored data. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on five benchmark datasets. The source code is to be released.

Robust3D-CIL: Robust Class-Incremental Learning for 3D Perception

Jinge Ma,Jiangpeng He,Fengqing Zhu

Task: 研究如何在存在数据损坏的情况下，通过类增量学习（CIL）方法持续适应新的3D点云数据。

Motivation: 在实际应用中，3D感知模型需要不断适应新数据和新兴对象类别，但从头开始重新训练成本过高。此外，现实世界的3D点云数据通常包含损坏样本，这对现有的CIL方法提出了重大挑战。

Details

Method: 提出了一种基于最远点采样的新颖样本选择策略，以在选择重放样本时有效保留类内多样性，并引入了一种基于点云下采样的重放方法，以更有效地利用有限的重放缓冲区内存。 Result: 实验表明，该方法将基于重放的CIL基线的性能提高了2%到11%。 Conclusion: 该方法在处理数据损坏的情况下，显著提高了3D点云数据的类增量学习性能，展示了其在现实世界3D应用中的潜力和有效性。 Abstract: 3D perception plays a crucial role in real-world applications such as autonomous driving, robotics, and AR/VR. In practical scenarios, 3D perception models must continuously adapt to new data and emerging object categories, but retraining from scratch incurs prohibitive costs. Therefore, adopting class-incremental learning (CIL) becomes particularly essential. However, real-world 3D point cloud data often include corrupted samples, which poses significant challenges for existing CIL methods and leads to more severe forgetting on corrupted data. To address these challenges, we consider the scenario in which a CIL model can be updated using point clouds with unknown corruption to better simulate real-world conditions. Inspired by Farthest Point Sampling, we propose a novel exemplar selection strategy that effectively preserves intra-class diversity when selecting replay exemplars, mitigating forgetting induced by data corruption. Furthermore, we introduce a point cloud downsampling-based replay method to utilize the limited replay buffer memory more efficiently, thereby further enhancing the model's continual learning ability. Extensive experiments demonstrate that our method improves the performance of replay-based CIL baselines by 2% to 11%, proving its effectiveness and promising potential for real-world 3D applications.

MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation

Donggon Jang,Yucheol Cho,Suin Lee,Taehyeon Kim,Dae-Shik Kim

Task: 构建一个大规模的多目标和多粒度推理分割数据集（MMR），并提出一个有效的框架来解决多目标和多粒度推理分割问题。

Motivation: 当前的推理分割数据集主要集中在单一目标对象级别的推理上，限制了在多目标情境下对对象部分的详细识别。为了填补这一空白，构建了一个新的数据集并提出了一种新的框架。

Details

Method: 构建了一个包含194K复杂和隐含指令的大规模数据集MMR，并提出了一个简单但有效的多目标、对象级别和部分级别的推理分割框架。 Result: 实验结果表明，所提出的方法在多目标和多粒度场景中能够有效推理，而现有的推理分割模型仍有改进空间。 Conclusion: MMR数据集和提出的框架为多目标和多粒度推理分割提供了新的可能性，有助于实现更灵活和优化的场景交互。 Abstract: The fusion of Large Language Models with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. However, seamless human-AI interaction demands more than just object-level recognition; it requires understanding both objects and the functions of their detailed parts, particularly in multi-target scenarios. For example, when instructing a robot to \textit{turn on the TV"}, there could be various ways to accomplish this command. Recognizing multiple objects capable of turning on the TV, such as the TV itself or a remote control (multi-target), provides more flexible options and aids in finding the optimized scenario. Furthermore, understanding specific parts of these objects, like the TV's button or the remote's button (part-level), is important for completing the action. Unfortunately, current reasoning segmentation datasets predominantly focus on a single target object-level reasoning, which limits the detailed recognition of an object's parts in multi-target contexts. To address this gap, we construct a large-scale dataset called Multi-target and Multi-granularity Reasoning (MMR). MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects, based on pre-existing image-mask sets. This dataset supports diverse and context-aware interactions by hierarchically providing object and part information. Moreover, we propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation. Experimental results on MMR show that the proposed method can reason effectively in multi-target and multi-granularity scenarios, while the existing reasoning segmentation model still has room for improvement.

YOLO-LLTS: Real-Time Low-Light Traffic Sign Detection via Prior-Guided Enhancement and Multi-Branch Feature Interaction

Ziyu Lin,Yunfan Wu,Yuhang Ma,Junzhou Chen,Ronghui Zhang,Jiaming Wu,Guodong Yin,Liang Lin

Task: 提出一种用于低光环境下实时交通标志检测的算法YOLO-LLTS。

Motivation: 低光环境下交通标志检测存在显著挑战，需要一种有效的解决方案。

Details

Method: 引入HRFM-TOD模块解决小物体特征不明显的问题，开发MFIA模块增强特征交互，提出PGFE模块改善图像质量。 Result: YOLO-LLTS在多个数据集上实现了最先进的性能，并在边缘设备上验证了实时应用性。 Conclusion: YOLO-LLTS在低光环境下的交通标志检测中表现出色，具有实际应用价值。 Abstract: Detecting traffic signs effectively under low-light conditions remains a significant challenge. To address this issue, we propose YOLO-LLTS, an end-to-end real-time traffic sign detection algorithm specifically designed for low-light environments. Firstly, we introduce the High-Resolution Feature Map for Small Object Detection (HRFM-TOD) module to address indistinct small-object features in low-light scenarios. By leveraging high-resolution feature maps, HRFM-TOD effectively mitigates the feature dilution problem encountered in conventional PANet frameworks, thereby enhancing both detection accuracy and inference speed. Secondly, we develop the Multi-branch Feature Interaction Attention (MFIA) module, which facilitates deep feature interaction across multiple receptive fields in both channel and spatial dimensions, significantly improving the model's information extraction capabilities. Finally, we propose the Prior-Guided Enhancement Module (PGFE) to tackle common image quality challenges in low-light environments, such as noise, low contrast, and blurriness. This module employs prior knowledge to enrich image details and enhance visibility, substantially boosting detection performance. To support this research, we construct a novel dataset, the Chinese Nighttime Traffic Sign Sample Set (CNTSSS), covering diverse nighttime scenarios, including urban, highway, and rural environments under varying weather conditions. Experimental evaluations demonstrate that YOLO-LLTS achieves state-of-the-art performance, outperforming the previous best methods by 2.7% mAP50 and 1.6% mAP50:95 on TT100K-night, 1.3% mAP50 and 1.9% mAP50:95 on CNTSSS, and achieving superior results on the CCTSDB2021 dataset. Moreover, deployment experiments on edge devices confirm the real-time applicability and effectiveness of our proposed approach.

Where do Large Vision-Language Models Look at when Answering Questions?

Xiaoying Xing,Chia-Wen Kuo,Li Fuxin,Yulei Niu,Fan Chen,Ming Li,Ying Wu,Longyin Wen,Sijie Zhu

Task: 研究大型视觉语言模型（LVLMs）在视觉语言理解和推理任务中的视觉理解行为。

Motivation: 探索LVLMs在多大程度上依赖视觉输入，以及哪些图像区域对其响应有贡献。

Details

Method: 扩展现有的热图可视化方法（如iGOS++）以支持LVLMs的开放式视觉问答，并提出一种选择视觉相关标记的方法。 Result: 对最先进的LVLMs进行了全面分析，揭示了焦点区域与答案正确性之间的关系、不同架构之间的视觉注意力差异以及LLM规模对视觉理解的影响。 Conclusion: 研究提供了对LVLM行为的多个见解，包括焦点区域与答案正确性之间的关系、不同架构之间的视觉注意力差异以及LLM规模对视觉理解的影响。 Abstract: Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.

Exploiting Inherent Class Label: Towards Robust Scribble Supervised Semantic Segmentation

Xinliang Zhang,Lei Zhu,Shuang Zeng,Hangzhou He,Ourui Fu,Zhengjian Yao,Zhaoheng Xie,Yanye Lu

Task: 利用涂鸦标注进行弱监督语义分割，以减少人工标注的工作量。

Motivation: 涂鸦标注的稀疏性和变异性可能导致模型预测不一致和不稳定，因此需要一种更稳健的方法来解决这些问题。

Details

Method: 提出了一个整体框架，即类驱动的涂鸦提升网络，通过利用涂鸦标注及其相关的类别标签生成可靠的伪标签，并引入了定位校正模块和距离感知模块来减少噪声标签并识别可靠区域。 Result: 该方法在准确性和稳健性方面表现出色，优于现有方法。 Conclusion: 提出的框架在涂鸦监督的语义分割中表现出色，新的基准数据集和代码将公开提供。 Abstract: Scribble-based weakly supervised semantic segmentation leverages only a few annotated pixels as labels to train a segmentation model, presenting significant potential for reducing the human labor involved in the annotation process. This approach faces two primary challenges: first, the sparsity of scribble annotations can lead to inconsistent predictions due to limited supervision; second, the variability in scribble annotations, reflecting differing human annotator preferences, can prevent the model from consistently capturing the discriminative regions of objects, potentially leading to unstable predictions. To address these issues, we propose a holistic framework, the class-driven scribble promotion network, for robust scribble-supervised semantic segmentation. This framework not only utilizes the provided scribble annotations but also leverages their associated class labels to generate reliable pseudo-labels. Within the network, we introduce a localization rectification module to mitigate noisy labels and a distance perception module to identify reliable regions surrounding scribble annotations and pseudo-labels. In addition, we introduce new large-scale benchmarks, ScribbleCOCO and ScribbleCityscapes, accompanied by a scribble simulation algorithm that enables evaluation across varying scribble styles. Our method demonstrates competitive performance in both accuracy and robustness, underscoring its superiority over existing approaches. The datasets and the codes will be made publicly available.

TGBFormer: Transformer-GraphFormer Blender Network for Video Object Detection

Qiang Qi,Xiao Wang

Task: 提出一种用于视频目标检测的Transformer-GraphFormer混合网络（TGBFormer），以充分利用Transformer和图卷积网络的优点并弥补其局限性。

Motivation: 现有的视频目标检测方法仅依赖CNN或ViT进行特征聚合，无法同时利用全局和局部信息，导致检测性能有限。

Details

Method: 提出了三个关键技术改进：1）开发了一个时空Transformer模块来聚合全局上下文信息；2）引入了一个时空GraphFormer模块来利用局部空间和时间关系聚合特征；3）设计了一个全局-局部特征混合模块来自适应地结合基于Transformer的全局表示和基于GraphFormer的局部表示。 Result: 在ImageNet VID数据集上，TGBFormer取得了86.5%的mAP，并在单个Tesla A100 GPU上以约41.0 FPS的速度运行。 Conclusion: TGBFormer在视频目标检测中取得了新的最先进成果，证明了其在同时利用全局和局部信息方面的有效性。 Abstract: Video object detection has made significant progress in recent years thanks to convolutional neural networks (CNNs) and vision transformers (ViTs). Typically, CNNs excel at capturing local features but struggle to model global representations. Conversely, ViTs are adept at capturing long-range global features but face challenges in representing local feature details. Off-the-shelf video object detection methods solely rely on CNNs or ViTs to conduct feature aggregation, which hampers their capability to simultaneously leverage global and local information, thereby resulting in limited detection performance. In this paper, we propose a Transformer-GraphFormer Blender Network (TGBFormer) for video object detection, with three key technical improvements to fully exploit the advantages of transformers and graph convolutional networks while compensating for their limitations. First, we develop a spatial-temporal transformer module to aggregate global contextual information, constituting global representations with long-range feature dependencies. Second, we introduce a spatial-temporal GraphFormer module that utilizes local spatial and temporal relationships to aggregate features, generating new local representations that are complementary to the transformer outputs. Third, we design a global-local feature blender module to adaptively couple transformer-based global representations and GraphFormer-based local representations. Extensive experiments demonstrate that our TGBFormer establishes new state-of-the-art results on the ImageNet VID dataset. Particularly, our TGBFormer achieves 86.5% mAP while running at around 41.0 FPS on a single Tesla A100 GPU.

HSOD-BIT-V2: A New Challenging Benchmarkfor Hyperspectral Salient Object Detection

Yuhao Qiu,Shuyan Bai,Tingfa Xu,Peifu Liu,Haolin Qin,Jianan Li

Task: 提出了一种新的高分辨率高光谱显著目标检测网络（Hyper-HRNet）并引入了最大的高光谱显著目标检测基准数据集（HSOD-BIT-V2）。

Motivation: RGB-based methods face limitations in challenging scenes, such as small objects and similar color features. Hyperspectral images provide a promising solution for more accurate Hyperspectral Salient Object Detection (HSOD) by abundant spectral information, while HSOD methods are hindered by the lack of extensive and available datasets.

Details

Method: We propose Hyper-HRNet, a high-resolution HSOD network. Hyper-HRNet effectively extracts, integrates, and preserves effective spectral information while reducing dimensionality by capturing the self-similar spectral features. Additionally, it conveys fine details and precisely locates object contours by incorporating comprehensive global information and detailed object saliency representations. Result: Experimental analysis demonstrates that Hyper-HRNet outperforms existing models, especially in challenging scenarios. Conclusion: Hyper-HRNet and HSOD-BIT-V2 dataset provide significant advancements in hyperspectral salient object detection, particularly in challenging scenarios. Abstract: Salient Object Detection (SOD) is crucial in computer vision, yet RGB-based methods face limitations in challenging scenes, such as small objects and similar color features. Hyperspectral images provide a promising solution for more accurate Hyperspectral Salient Object Detection (HSOD) by abundant spectral information, while HSOD methods are hindered by the lack of extensive and available datasets. In this context, we introduce HSOD-BIT-V2, the largest and most challenging HSOD benchmark dataset to date. Five distinct challenges focusing on small objects and foreground-background similarity are designed to emphasize spectral advantages and real-world complexity. To tackle these challenges, we propose Hyper-HRNet, a high-resolution HSOD network. Hyper-HRNet effectively extracts, integrates, and preserves effective spectral information while reducing dimensionality by capturing the self-similar spectral features. Additionally, it conveys fine details and precisely locates object contours by incorporating comprehensive global information and detailed object saliency representations. Experimental analysis demonstrates that Hyper-HRNet outperforms existing models, especially in challenging scenarios.

PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds

Barza Nisar,Steven L. Waslander

Task: 提出了一种新的自监督学习方法PSA-SSL，用于3D点云数据，以学习对象姿态和大小感知的特征。

Motivation: 现有的自监督学习方法在3D点云数据上未能保留几何信息（如对象姿态和尺度），这对下游的定位和几何敏感的3D场景理解任务（如3D语义分割和3D目标检测）的性能有不利影响。

Details

Method: 提出了一种自监督的边界框回归预训练任务，保留了对象姿态和大小信息，并结合了LiDAR光束模式增强，以鼓励学习传感器无关的特征。 Result: 实验表明，使用单个预训练模型，该方法在有限的标签下在流行的自动驾驶数据集（Waymo、nuScenes、SemanticKITTI）上显著提高了3D语义分割的性能，并且在3D语义分割和3D目标检测上优于其他最先进的自监督学习方法。 Conclusion: PSA-SSL方法通过保留几何信息和学习传感器无关的特征，显著提高了3D语义分割和3D目标检测的性能，尤其是在标签有限的情况下。 Abstract: Self-supervised learning (SSL) on 3D point clouds has the potential to learn feature representations that can transfer to diverse sensors and multiple downstream perception tasks. However, recent SSL approaches fail to define pretext tasks that retain geometric information such as object pose and scale, which can be detrimental to the performance of downstream localization and geometry-sensitive 3D scene understanding tasks, such as 3D semantic segmentation and 3D object detection. We propose PSA-SSL, a novel extension to point cloud SSL that learns object pose and size-aware (PSA) features. Our approach defines a self-supervised bounding box regression pretext task, which retains object pose and size information. Furthermore, we incorporate LiDAR beam pattern augmentation on input point clouds, which encourages learning sensor-agnostic features. Our experiments demonstrate that with a single pretrained model, our light-weight yet effective extensions achieve significant improvements on 3D semantic segmentation with limited labels across popular autonomous driving datasets (Waymo, nuScenes, SemanticKITTI). Moreover, our approach outperforms other state-of-the-art SSL methods on 3D semantic segmentation (using up to 10 times less labels), as well as on 3D object detection. Our code will be released on https://github.com/TRAILab/PSA-SSL.

Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization

Dongkwan Lee,Kyomin Hwang,Nojun Kwak

Task: 解决半监督领域泛化（SSDG）问题，其中训练和测试数据的分布不同，且在训练期间只有少量标记数据和大量未标记数据可用。

Motivation: 现有的SSDG方法仅利用模型预测高度自信的未标记样本（自信未标记样本），限制了可用未标记数据的充分利用。本文首次探索了在SSDG设置中利用之前被忽视的不自信未标记样本的方法。

Details

Method: 提出了UPCSC方法，包括两个模块：1）未标记代理对比学习（UPC）模块，将不自信未标记样本视为额外的负对；2）代理类学习（SC）模块，使用混淆类集为不自信未标记样本生成正对。这些模块是即插即用的，不需要任何领域标签，可以轻松集成到现有方法中。 Result: 在四个广泛使用的SSDG基准测试上的实验表明，当附加到基线时，我们的方法一致提高了性能，并优于竞争的即插即用方法。 Conclusion: 本文提出的方法增强了类级可区分性并缓解了领域差距，代码已在https://github.com/dongkwani/UPCSC上公开。 Abstract: We address the problem of semi-supervised domain generalization (SSDG), where the distributions of train and test data differ, and only a small amount of labeled data along with a larger amount of unlabeled data are available during training. Existing SSDG methods that leverage only the unlabeled samples for which the model's predictions are highly confident (confident-unlabeled samples), limit the full utilization of the available unlabeled data. To the best of our knowledge, we are the first to explore a method for incorporating the unconfident-unlabeled samples that were previously disregarded in SSDG setting. To this end, we propose UPCSC to utilize these unconfident-unlabeled samples in SSDG that consists of two modules: 1) Unlabeled Proxy-based Contrastive learning (UPC) module, treating unconfident-unlabeled samples as additional negative pairs and 2) Surrogate Class learning (SC) module, generating positive pairs for unconfident-unlabeled samples using their confusing class set. These modules are plug-and-play and do not require any domain labels, which can be easily integrated into existing approaches. Experiments on four widely used SSDG benchmarks demonstrate that our approach consistently improves performance when attached to baselines and outperforms competing plug-and-play methods. We also analyze the role of our method in SSDG, showing that it enhances class-level discriminability and mitigates domain gaps. The code is available at https://github.com/dongkwani/UPCSC.

Learning Shape-Independent Transformation via Spherical Representations for Category-Level Object Pose Estimation

Huan Ren,Wenfei Yang,Xiang Liu,Shifeng Zhang,Tianzhu Zhang

Task: 提出一种新的方法SpherePose，用于类别级物体姿态估计，通过球形表示学习形状无关的变换。

Motivation: 现有基于对应关系的方法在处理不同形状物体时存在语义不一致的问题，因此需要一种形状无关的表示方法。

Details

Method: 提出SpherePose架构，包括SO(3)不变性的点特征提取、球形注意力机制和双曲对应损失函数。 Result: 在CAMERA25、REAL275和HouseCat6D基准测试中表现出优越性能，验证了球形表示和架构创新的有效性。 Conclusion: 球形表示和SpherePose架构在类别级物体姿态估计中具有显著优势，能够有效解决形状依赖性问题。 Abstract: Category-level object pose estimation aims to determine the pose and size of novel objects in specific categories. Existing correspondence-based approaches typically adopt point-based representations to establish the correspondences between primitive observed points and normalized object coordinates. However, due to the inherent shape-dependence of canonical coordinates, these methods suffer from semantic incoherence across diverse object shapes. To resolve this issue, we innovatively leverage the sphere as a shared proxy shape of objects to learn shape-independent transformation via spherical representations. Based on this insight, we introduce a novel architecture called SpherePose, which yields precise correspondence prediction through three core designs. Firstly, We endow the point-wise feature extraction with SO(3)-invariance, which facilitates robust mapping between camera coordinate space and object coordinate space regardless of rotation transformation. Secondly, the spherical attention mechanism is designed to propagate and integrate features among spherical anchors from a comprehensive perspective, thus mitigating the interference of noise and incomplete point cloud. Lastly, a hyperbolic correspondence loss function is designed to distinguish subtle distinctions, which can promote the precision of correspondence prediction. Experimental results on CAMERA25, REAL275 and HouseCat6D benchmarks demonstrate the superior performance of our method, verifying the effectiveness of spherical representations and architectural innovations.

SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization

Bowen Yuan,Yuxia Fu,Zijian Wang,Yadan Luo,Zi Huang

Task: 提出一种基于软标签压缩的数据集压缩框架SCORE，以解决数据集压缩中的性能与存储成本之间的权衡问题。

Motivation: 现有的数据集压缩方法在ImageNet规模的数据集上表现出色，但存储成本显著增加。本文旨在通过信息量、区分性和可压缩性三个关键属性来缓解这一性能与存储成本的困境。

Details

Method: 提出SCORE框架，将数据集压缩问题形式化为一个最小最大优化问题，从信息论的角度平衡三个关键属性。理论证明了目标函数的子模性，并通过优化自然地在软标签集中强制执行低秩结构。 Result: 在ImageNet-1K和Tiny-ImageNet等大规模数据集上的实验表明，SCORE在大多数情况下优于现有方法。即使软标签压缩30倍，性能下降仅为5.5%和2.7%。 Conclusion: SCORE框架在保持高性能的同时显著降低了存储成本，为解决数据集压缩中的性能与存储成本问题提供了有效的方法。 Abstract: Dataset Condensation (DC) aims to obtain a condensed dataset that allows models trained on the condensed dataset to achieve performance comparable to those trained on the full dataset. Recent DC approaches increasingly focus on encoding knowledge into realistic images with soft labeling, for their scalability to ImageNet-scale datasets and strong capability of cross-domain generalization. However, this strong performance comes at a substantial storage cost which could significantly exceed the storage cost of the original dataset. We argue that the three key properties to alleviate this performance-storage dilemma are informativeness, discriminativeness, and compressibility of the condensed data. Towards this end, this paper proposes a \textbf{S}oft label compression-centric dataset condensation framework using \textbf{CO}ding \textbf{R}at\textbf{E} (SCORE). SCORE formulates dataset condensation as a min-max optimization problem, which aims to balance the three key properties from an information-theoretic perspective. In particular, we theoretically demonstrate that our coding rate-inspired objective function is submodular, and its optimization naturally enforces low-rank structure in the soft label set corresponding to each condensed data. Extensive experiments on large-scale datasets, including ImageNet-1K and Tiny-ImageNet, demonstrate that SCORE outperforms existing methods in most cases. Even with 30$\times$ compression of soft labels, performance decreases by only 5.5\% and 2.7\% for ImageNet-1K with IPC 10 and 50, respectively. Code will be released upon paper acceptance.

ChatBEV: A Visual Language Model that Understands BEV Maps

Qingyao Xu,Siheng Chen,Guang Chen,Yanfeng Wang,Ya Zhang

Task: 提出了一种新的BEV VQA基准ChatBEV-QA，用于交通场景理解。

Motivation: 现有的方法在任务设计和数据量上存在局限，阻碍了全面的场景理解。

Details

Method: 引入了一个包含超过137k个问题的BEV VQA基准ChatBEV-QA，并开发了一个新的数据收集管道来生成可扩展且信息丰富的VQA数据。进一步微调了一个专门的视觉语言模型ChatBEV，使其能够解释多样的问题提示并从BEV地图中提取相关的上下文信息。 Result: 提出了一个语言驱动的交通场景生成管道，显著增强了生成现实和一致的交通场景的能力。 Conclusion: ChatBEV-QA基准和微调的ChatBEV模型将有助于更全面的交通场景理解，并促进智能交通系统和自动驾驶的发展。 Abstract: Traffic scene understanding is essential for intelligent transportation systems and autonomous driving, ensuring safe and efficient vehicle operation. While recent advancements in VLMs have shown promise for holistic scene understanding, the application of VLMs to traffic scenarios, particularly using BEV maps, remains under explored. Existing methods often suffer from limited task design and narrow data amount, hindering comprehensive scene understanding. To address these challenges, we introduce ChatBEV-QA, a novel BEV VQA benchmark contains over 137k questions, designed to encompass a wide range of scene understanding tasks, including global scene understanding, vehicle-lane interactions, and vehicle-vehicle interactions. This benchmark is constructed using an novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We further fine-tune a specialized vision-language model ChatBEV, enabling it to interpret diverse question prompts and extract relevant context-aware information from BEV maps. Additionally, we propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance, significantly enhancing the generation of realistic and consistent traffic scenarios. The dataset, code and the fine-tuned model will be released.

Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

Yuxiang Lai,Jike Zhong,Ming Li,Shitian Zhao,Xiaofeng Yang

Task: 探索强化学习（RL）在增强视觉语言模型（VLMs）在医学推理中的泛化能力和可信度。

Motivation: 医学推理任务需要强大的图像分析和合理的答案，但由于医学图像的复杂性，这带来了挑战。透明度和可信度对于临床采用和法规遵从至关重要。

Details

Method: 引入Med-R1框架，利用DeepSeek策略，采用Group Relative Policy Optimization（GRPO）通过奖励信号引导推理路径。 Result: Med-R1在八种医学成像模态上进行了评估，相比其基础模型Qwen2-VL-2B，准确率提高了29.94%，并且在问题类型泛化上超过了Qwen2-VL-72B。 Conclusion: RL提高了医学推理能力，使参数效率模型能够显著优于更大的模型。Med-R1代表了朝着可泛化、可信赖且临床可行的医学VLMs迈出的有希望的一步。 Abstract: Vision-language models (VLMs) have advanced reasoning in natural scenes, but their role in medical imaging remains underexplored. Medical reasoning tasks demand robust image analysis and well-justified answers, posing challenges due to the complexity of medical images. Transparency and trustworthiness are essential for clinical adoption and regulatory compliance. We introduce Med-R1, a framework exploring reinforcement learning (RL) to enhance VLMs' generalizability and trustworthiness in medical reasoning. Leveraging the DeepSeek strategy, we employ Group Relative Policy Optimization (GRPO) to guide reasoning paths via reward signals. Unlike supervised fine-tuning (SFT), which often overfits and lacks generalization, RL fosters robust and diverse reasoning. Med-R1 is evaluated across eight medical imaging modalities: CT, MRI, Ultrasound, Dermoscopy, Fundus Photography, Optical Coherence Tomography (OCT), Microscopy, and X-ray Imaging. Compared to its base model, Qwen2-VL-2B, Med-R1 achieves a 29.94% accuracy improvement and outperforms Qwen2-VL-72B, which has 36 times more parameters. Testing across five question types-modality recognition, anatomy identification, disease diagnosis, lesion grading, and biological attribute analysis Med-R1 demonstrates superior generalization, exceeding Qwen2-VL-2B by 32.06% and surpassing Qwen2-VL-72B in question-type generalization. These findings show that RL improves medical reasoning and enables parameter-efficient models to outperform significantly larger ones. With interpretable reasoning outputs, Med-R1 represents a promising step toward generalizable, trustworthy, and clinically viable medical VLMs.

Hang Zhao,Hongru Li,Dongfang Xu,Shenghui Song,Khaled B. Letaief

Task: 提出一种多模态语义通信系统，利用多模态自监督学习增强任务无关特征提取。

Motivation: 当前研究主要关注减少语义通信开销，但往往忽略了训练阶段在动态无线环境中可能产生的高通信成本。

Details

Method: 在预训练阶段采用自监督学习提取任务无关的语义特征，然后在下游任务中进行监督微调。 Result: 在NYU Depth V2数据集上的实验结果表明，所提出的方法显著减少了训练相关的通信开销，同时保持或超过了现有监督学习方法的性能。 Conclusion: 多模态自监督学习在语义通信中具有优势，为更高效和可扩展的边缘推理系统铺平了道路。 Abstract: Semantic communication is emerging as a promising paradigm that focuses on the extraction and transmission of semantic meanings using deep learning techniques. While current research primarily addresses the reduction of semantic communication overhead, it often overlooks the training phase, which can incur significant communication costs in dynamic wireless environments. To address this challenge, we propose a multi-modal semantic communication system that leverages multi-modal self-supervised learning to enhance task-agnostic feature extraction. The proposed approach employs self-supervised learning during the pre-training phase to extract task-agnostic semantic features, followed by supervised fine-tuning for downstream tasks. This dual-phase strategy effectively captures both modality-invariant and modality-specific features while minimizing training-related communication overhead. Experimental results on the NYU Depth V2 dataset demonstrate that the proposed method significantly reduces training-related communication overhead while maintaining or exceeding the performance of existing supervised learning approaches. The findings underscore the advantages of multi-modal self-supervised learning in semantic communication, paving the way for more efficient and scalable edge inference systems.

Make the Most of Everything: Further Considerations on Disrupting Diffusion-based Customization

Long Tang,Dengpan Ye,Sirun Chen,Xiuwen Shi,Yunna Lv,Ziyi Liu

Task: 提出一种针对文本到图像扩散模型的双阶段对抗攻击方法，以提升反定制性能。

Motivation: 现有的反定制方法主要集中在提示或图像级别的对抗攻击，忽略了这两个级别之间的相关性以及内部模块与输入之间的关系，这在实际威胁场景中限制了反定制性能。

Details

Method: 提出Dual Anti-Diffusion (DADiff)方法，包括两个阶段：1) 生成提示级别的对抗向量以指导后续的图像级别攻击；2) 对UNet模型进行端到端攻击，并破坏其自注意力和交叉注意力模块。此外，引入局部随机时间步梯度集成策略来更新对抗扰动。 Result: 在各种主流面部数据集上的实验结果表明，DADiff在跨提示、关键词不匹配、跨模型和跨机制反定制方面比现有方法提高了10%-30%。 Conclusion: DADiff方法通过整合提示级别和图像级别的对抗攻击，显著提升了反定制性能，为实际威胁场景中的反定制提供了有效解决方案。 Abstract: The fine-tuning technique for text-to-image diffusion models facilitates image customization but risks privacy breaches and opinion manipulation. Current research focuses on prompt- or image-level adversarial attacks for anti-customization, yet it overlooks the correlation between these two levels and the relationship between internal modules and inputs. This hinders anti-customization performance in practical threat scenarios. We propose Dual Anti-Diffusion (DADiff), a two-stage adversarial attack targeting diffusion customization, which, for the first time, integrates the adversarial prompt-level attack into the generation process of image-level adversarial examples. In stage 1, we generate prompt-level adversarial vectors to guide the subsequent image-level attack. In stage 2, besides conducting the end-to-end attack on the UNet model, we disrupt its self- and cross-attention modules, aiming to break the correlations between image pixels and align the cross-attention results computed using instance prompts and adversarial prompt vectors within the images. Furthermore, we introduce a local random timestep gradient ensemble strategy, which updates adversarial perturbations by integrating random gradients from multiple segmented timesets. Experimental results on various mainstream facial datasets demonstrate 10%-30% improvements in cross-prompt, keyword mismatch, cross-model, and cross-mechanism anti-customization with DADiff compared to existing methods.

Is Discretization Fusion All You Need for Collaborative Perception?

Kang Yang,Tianci Bu,Lantao Li,Chunxu Li,Yongcai Wang,Deying Li

Task: 提出一种新的基于锚点的协作目标检测范式（ACCO），以解决现有协作感知方法在特征提取和传输中的灵活性问题。

Motivation: 当前主流的协作感知方法依赖于离散化的特征图进行融合，缺乏灵活性，难以在融合过程中专注于信息丰富的特征。

Details

Method: ACCO由三个主要组件组成：锚点特征块（AFB）、锚点置信度生成器（ACG）和局部-全局融合模块（LAAF和SACA）。 Result: 在OPV2V和Dair-V2X数据集上的实验表明，ACCO在减少通信量、提高感知范围和检测性能方面具有优越性。 Conclusion: ACCO通过锚点中心的通信和融合，解决了现有方法的灵活性问题，并在多个方面表现出优越性。 Abstract: Collaborative perception in multi-agent system enhances overall perceptual capabilities by facilitating the exchange of complementary information among agents. Current mainstream collaborative perception methods rely on discretized feature maps to conduct fusion, which however, lacks flexibility in extracting and transmitting the informative features and can hardly focus on the informative features during fusion. To address these problems, this paper proposes a novel Anchor-Centric paradigm for Collaborative Object detection (ACCO). It avoids grid precision issues and allows more flexible and efficient anchor-centric communication and fusion. ACCO is composed by three main components: (1) Anchor featuring block (AFB) that targets to generate anchor proposals and projects prepared anchor queries to image features. (2) Anchor confidence generator (ACG) is designed to minimize communication by selecting only the features in the confident anchors to transmit. (3) A local-global fusion module, in which local fusion is anchor alignment-based fusion (LAAF) and global fusion is conducted by spatial-aware cross-attention (SACA). LAAF and SACA run in multi-layers, so agents conduct anchor-centric fusion iteratively to adjust the anchor proposals. Comprehensive experiments are conducted to evaluate ACCO on OPV2V and Dair-V2X datasets, which demonstrate ACCO's superiority in reducing the communication volume, and in improving the perception range and detection performances. Code can be found at: \href{https://github.com/sidiangongyuan/ACCO}{https://github.com/sidiangongyuan/ACCO}.

Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation

Sayak Nag,Udita Ghosh,Sarosij Bose,Calvin-Khang Ta,Jiachen Li,Amit K Roy Chowdhury

Task: 通过识别对象及其成对关系来表示视觉场景，提供图像内容的结构化理解。

Motivation: 由于长尾类分布和预测可变性等固有挑战，需要在场景图生成（SGG）中进行不确定性量化以确保其实用性。

Details

Method: 引入了一种基于Conformal Prediction（CP）的新框架，适用于任何现有的SGG方法，通过构建良好校准的预测集来量化其预测不确定性。 Result: 所提出的方法能够从图像中生成多样化的可能场景图，评估SGG方法的可靠性，并提高整体SGG性能。 Conclusion: 通过引入CP框架和MLLM后处理策略，能够生成具有统计严格覆盖保证的场景图预测集，并选择视觉和语义上最合理的场景图。 Abstract: Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content. However, inherent challenges like long-tailed class distributions and prediction variability necessitate uncertainty quantification in SGG for its practical viability. In this paper, we introduce a novel Conformal Prediction (CP) based framework, adaptive to any existing SGG method, for quantifying their predictive uncertainty by constructing well-calibrated prediction sets over their generated scene graphs. These scene graph prediction sets are designed to achieve statistically rigorous coverage guarantees. Additionally, to ensure these prediction sets contain the most practically interpretable scene graphs, we design an effective MLLM-based post-processing strategy for selecting the most visually and semantically plausible scene graphs within these prediction sets. We show that our proposed approach can produce diverse possible scene graphs from an image, assess the reliability of SGG methods, and improve overall SGG performance.

Light4GS: Lightweight Compact 4D Gaussian Splatting Generation via Context Model

Mufan Liu,Qi Yang,He Huang,Wenjie Huang,Zhenlong Yuan,Zhu Li,Yiling Xu

Task: 提出一种轻量级的4DGS框架（Light4GS），用于动态3D高斯泼溅（3DGS）的存储高效表示。

Motivation: 为了解决动态3DGS中高维嵌入和大量基元导致的存储需求问题。

Details

Method: 采用显著性剪枝和深度上下文模型，结合时空显著性剪枝策略和熵约束球谐压缩，以及多尺度潜在嵌入压缩的深度上下文模型。 Result: 实现了超过120倍的压缩，并将渲染FPS提高了20%，优于现有的3DGS压缩方法。 Conclusion: Light4GS在保持渲染质量的同时，显著提高了存储效率和渲染速度。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as an efficient and high-fidelity paradigm for novel view synthesis. To adapt 3DGS for dynamic content, deformable 3DGS incorporates temporally deformable primitives with learnable latent embeddings to capture complex motions. Despite its impressive performance, the high-dimensional embeddings and vast number of primitives lead to substantial storage requirements. In this paper, we introduce a \textbf{Light}weight \textbf{4}D\textbf{GS} framework, called Light4GS, that employs significance pruning with a deep context model to provide a lightweight storage-efficient dynamic 3DGS representation. The proposed Light4GS is based on 4DGS that is a typical representation of deformable 3DGS. Specifically, our framework is built upon two core components: (1) a spatio-temporal significance pruning strategy that eliminates over 64\% of the deformable primitives, followed by an entropy-constrained spherical harmonics compression applied to the remainder; and (2) a deep context model that integrates intra- and inter-prediction with hyperprior into a coarse-to-fine context structure to enable efficient multiscale latent embedding compression. Our approach achieves over 120x compression and increases rendering FPS up to 20\% compared to the baseline 4DGS, and also superior to frame-wise state-of-the-art 3DGS compression methods, revealing the effectiveness of our Light4GS in terms of both intra- and inter-prediction methods without sacrificing rendering quality.

FrustumFusionNets: A Three-Dimensional Object Detection Network Based on Tractor Road Scene

Lili Yang,Mengshuai Chang,Xiao Guo,Yuxin Feng,Yiwen Mei,Caicong Wu

Task: 提出一种新的网络FrustumFusionNets（FFNets），用于复杂拖拉机道路场景中的三维物体检测。

Motivation: 解决现有基于视锥体方法在道路三维物体检测中图像信息利用不足的问题，并填补农业场景研究的空白。

Details

Method: 利用图像二维物体检测结果缩小点云三维空间的搜索区域，引入高斯掩码增强点云信息，分别通过点云特征提取管道和图像特征提取管道提取视锥体点云和裁剪图像的特征，最后将两种模态的数据特征连接和融合以实现三维物体检测。 Result: 在拖拉机道路数据的测试集上，FrustumFusionNetv2在汽车和行人两个主要道路物体的三维物体检测中分别达到82.28%和95.68%的准确率，比原始模型分别提高了1.83%和2.33%。 Conclusion: FrustumFusionNetv2为无人农业机械在拖拉机道路场景中提供了一种基于混合融合的多物体、高精度、实时的三维物体检测技术，并在KITTI基准测试集上展示了其在检测道路行人物体方面的显著优势。 Abstract: To address the issues of the existing frustum-based methods' underutilization of image information in road three-dimensional object detection as well as the lack of research on agricultural scenes, we constructed an object detection dataset using an 80-line Light Detection And Ranging (LiDAR) and a camera in a complex tractor road scene and proposed a new network called FrustumFusionNets (FFNets). Initially, we utilize the results of image-based two-dimensional object detection to narrow down the search region in the three-dimensional space of the point cloud. Next, we introduce a Gaussian mask to enhance the point cloud information. Then, we extract the features from the frustum point cloud and the crop image using the point cloud feature extraction pipeline and the image feature extraction pipeline, respectively. Finally, we concatenate and fuse the data features from both modalities to achieve three-dimensional object detection. Experiments demonstrate that on the constructed test set of tractor road data, the FrustumFusionNetv2 achieves 82.28% and 95.68% accuracy in the three-dimensional object detection of the two main road objects, cars and people, respectively. This performance is 1.83% and 2.33% better than the original model. It offers a hybrid fusion-based multi-object, high-precision, real-time three-dimensional object detection technique for unmanned agricultural machines in tractor road scenarios. On the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) Benchmark Suite validation set, the FrustumFusionNetv2 also demonstrates significant superiority in detecting road pedestrian objects compared with other frustum-based three-dimensional object detection methods.

SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model

Xinqing Li,Ruiqi Song,Qingyu Xie,Ye Wu,Nanxin Zeng,Yunfeng Ai

Task: 提出一种基于世界模型的模拟器条件场景生成引擎，用于大规模生成具有挑战性的场景数据。

Motivation: 随着自动驾驶技术的快速发展，数据不足成为提高感知模型准确性的主要障碍。研究人员正在探索使用世界模型进行可控数据生成以多样化数据集。

Details

Method: 通过构建与真实场景一致的模拟系统，收集模拟数据和标签作为世界模型中数据生成的条件，结合模拟引擎的强大场景模拟能力和世界模型的稳健数据生成能力，提出了一种新的数据生成管道。 Result: 定量结果表明，生成的图像显著提高了下游感知模型的性能。 Conclusion: 本文提出的方法在城市场景中展示了世界模型的生成性能，所有数据和代码将公开。 Abstract: With the rapid advancement of autonomous driving technology, a lack of data has become a major obstacle to enhancing perception model accuracy. Researchers are now exploring controllable data generation using world models to diversify datasets. However, previous work has been limited to studying image generation quality on specific public datasets. There is still relatively little research on how to build data generation engines for real-world application scenes to achieve large-scale data generation for challenging scenes. In this paper, a simulator-conditioned scene generation engine based on world model is proposed. By constructing a simulation system consistent with real-world scenes, simulation data and labels, which serve as the conditions for data generation in the world model, for any scenes can be collected. It is a novel data generation pipeline by combining the powerful scene simulation capabilities of the simulation engine with the robust data generation capabilities of the world model. In addition, a benchmark with proportionally constructed virtual and real data, is provided for exploring the capabilities of world models in real-world scenes. Quantitative results show that these generated images significantly improve downstream perception models performance. Finally, we explored the generative performance of the world model in urban autonomous driving scenarios. All the data and code will be available at https://github.com/Li-Zn-H/SimWorld.

Improving LLM Video Understanding with 16 Frames Per Second

Yixuan Li,Changli Tang,Jimin Zhuang,Yudong Yang,Guangzhi Sun,Wei Li,Zejun Ma,Chao Zhang

Task: 设计并实现一个用于高帧率视频理解的多模态大语言模型F-16。

Motivation: 现有方法主要依赖于从固定低帧率（≤2 FPS）采样的图像中提取静态特征，导致关键视觉信息丢失。

Details

Method: 通过将帧率提高到16 FPS并在每个1秒的剪辑中压缩视觉标记，F-16有效地捕捉动态视觉特征，同时保留关键语义信息。 Result: 实验结果表明，更高的帧率显著提高了多个基准测试中的视频理解能力，F-16在7亿参数视频大语言模型中实现了最先进的性能。 Conclusion: F-16不仅在一般和细粒度视频理解基准测试中表现出色，还在复杂的时空任务中表现出色，提供了一种超越模型规模或训练数据扩展的改进视频大语言模型的新方法。 Abstract: Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS) $\leqslant$2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information. Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis (\textit{e.g.}, basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro. Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. Upon acceptance, we will release the source code, model checkpoints, and data.

DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation

Mu Chen,Liulei Li,Wenguan Wang,Yi Yang

Task: 提出一种在线视频场景图生成（VSGG）解决方案，将任务定义为迭代场景图更新问题。

Motivation: 现有的VSGG解决方案通常采用离线管道，无法处理实时视频流且消耗大量GPU内存，且在时间推理方面表现不足。

Details

Method: 利用潜在扩散模型（LDMs）统一解码对象分类、边界框回归和图生成任务，通过逐步去噪生成清晰的嵌入，用于对象分类和场景图生成等任务。 Result: 在Action Genome的三个设置上进行了广泛实验，证明了DIFFVSGG的优越性。 Conclusion: DIFFVSGG通过在线处理和连续时间推理，显著提升了视频场景图生成的性能。 Abstract: Top-leading solutions for Video Scene Graph Generation (VSGG) typically adopt an offline pipeline. Though demonstrating promising performance, they remain unable to handle real-time video streams and consume large GPU memory. Moreover, these approaches fall short in temporal reasoning, merely aggregating frame-level predictions over a temporal context. In response, we introduce DIFFVSGG, an online VSGG solution that frames this task as an iterative scene graph update problem. Drawing inspiration from Latent Diffusion Models (LDMs) which generate images via denoising a latent feature embedding, we unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding. Then, given an embedding containing unified features of object pairs, we conduct a step-wise Denoising on it within LDMs, so as to deliver a clean embedding which clearly indicates the relationships between objects. This embedding then serves as the input to task-specific heads for object classification, scene graph generation, etc. DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs, to guide the reverse diffusion process for current frames. Extensive experiments on three setups of Action Genome demonstrate the superiority of DIFFVSGG.

Survey of Adversarial Robustness in Multimodal Large Language Models

Chengze Jiang,Zhuangzhuang Wang,Minjing Dong,Jie Gui

Task: 综述多模态大语言模型（MLLMs）的对抗鲁棒性，涵盖不同模态的攻击、数据集、评估指标以及未来研究方向。

Motivation: 多模态大语言模型在人工智能领域表现出色，但其在现实世界应用中的部署引发了对抗性漏洞的担忧，可能影响其安全性和可靠性。

Details

Method: 本文首先概述了MLLMs，并针对每种模态提出了对抗攻击的分类。接着，回顾了用于评估MLLMs鲁棒性的关键数据集和评估指标。然后，深入分析了针对不同模态的MLLMs攻击。 Result: 本文系统地回顾了MLLMs的对抗鲁棒性，识别了关键挑战，并提出了未来研究的方向。 Conclusion: 多模态大语言模型在对抗性攻击方面面临独特挑战，需要进一步研究以提高其安全性和可靠性。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance in artificial intelligence by facilitating integrated understanding across diverse modalities, including text, images, video, audio, and speech. However, their deployment in real-world applications raises significant concerns about adversarial vulnerabilities that could compromise their safety and reliability. Unlike unimodal models, MLLMs face unique challenges due to the interdependencies among modalities, making them susceptible to modality-specific threats and cross-modal adversarial manipulations. This paper reviews the adversarial robustness of MLLMs, covering different modalities. We begin with an overview of MLLMs and a taxonomy of adversarial attacks tailored to each modality. Next, we review key datasets and evaluation metrics used to assess the robustness of MLLMs. After that, we provide an in-depth review of attacks targeting MLLMs across different modalities. Our survey also identifies critical challenges and suggests promising future research directions.

Siqi Zhang,Yanyuan Qiao,Qunbo Wang,Longteng Guo,Zhihua Wei,Jing Liu

Task: 开发一种能够在不同任务之间无缝转移导航能力的视觉与语言导航（VLN）任务中的智能体。

Motivation: 尽管近年来取得了显著进展，但大多数方法需要特定数据集的训练，缺乏跨不同数据集和指令类型的泛化能力。

Details

Method: 提出FlexVLN，一种创新的分层方法，结合了基于监督学习的指令跟随者的基本导航能力和LLM规划器的强大泛化能力。 Result: FlexVLN在REVERIE、SOON和CVDN-target等跨域数据集上的泛化性能大幅超越了所有先前的方法。 Conclusion: FlexVLN通过结合指令跟随者和LLM规划器的优势，有效提升了跨数据集泛化能力，并通过验证机制和多模型集成机制减少了LLM规划器的幻觉，提高了指令跟随者的执行准确性。 Abstract: The aspiration of the Vision-and-Language Navigation (VLN) task has long been to develop an embodied agent with robust adaptability, capable of seamlessly transferring its navigation capabilities across various tasks. Despite remarkable advancements in recent years, most methods necessitate dataset-specific training, thereby lacking the capability to generalize across diverse datasets encompassing distinct types of instructions. Large language models (LLMs) have demonstrated exceptional reasoning and generalization abilities, exhibiting immense potential in robot action planning. In this paper, we propose FlexVLN, an innovative hierarchical approach to VLN that integrates the fundamental navigation ability of a supervised-learning-based Instruction Follower with the robust generalization ability of the LLM Planner, enabling effective generalization across diverse VLN datasets. Moreover, a verification mechanism and a multi-model integration mechanism are proposed to mitigate potential hallucinations by the LLM Planner and enhance execution accuracy of the Instruction Follower. We take REVERIE, SOON, and CVDN-target as out-of-domain datasets for assessing generalization ability. The generalization performance of FlexVLN surpasses that of all the previous methods to a large extent.

SoccerSynth Field: enhancing field detection with synthetic data from virtual soccer simulator

HaoBin Qin,Jiale Fang,Keisuke Fujii

Task: 研究使用合成数据集（SoccerSynth-Field）进行足球场检测的有效性。

Motivation: 由于收集大规模和多样化的真实世界数据集用于训练检测模型成本高且耗时，合成数据集在光照、纹理和摄像机角度等方面具有可控的变异性，是解决这些问题的有前途的替代方案。

Details

Method: 创建了一个合成足球场数据集用于预训练模型，并将这些模型与在真实世界数据集上训练的模型进行了性能比较。 Result: 结果表明，使用合成数据集预训练的模型在检测足球场方面表现出更优越的性能。 Conclusion: 合成数据在增强模型鲁棒性和准确性方面具有显著效果，为体育场检测任务提供了一种成本效益高且可扩展的解决方案。 Abstract: Field detection in team sports is an essential task in sports video analysis. However, collecting large-scale and diverse real-world datasets for training detection models is often cost and time-consuming. Synthetic datasets, which allow controlled variability in lighting, textures, and camera angles, will be a promising alternative for addressing these problems. This study addresses the challenges of high costs and difficulties in collecting real-world datasets by investigating the effectiveness of pretraining models using synthetic datasets. In this paper, we propose the effectiveness of using a synthetic dataset (SoccerSynth-Field) for soccer field detection. A synthetic soccer field dataset was created to pretrain models, and the performance of these models was compared with models trained on real-world datasets. The results demonstrate that models pretrained on the synthetic dataset exhibit superior performance in detecting soccer fields. This highlights the effectiveness of synthetic data in enhancing model robustness and accuracy, offering a cost-effective and scalable solution for advancing detection tasks in sports field detection.

A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios

Huy-Hoang Bui,Bach-Thuan Bui,Quang-Vinh Tran,Yasuyuki Fujii,Joo-Ho Lee

Task: 提出一种新的基于注意力机制的场景坐标回归（SCR）架构A-SCoRe，用于视觉定位。

Motivation: 现有的基于特征匹配的视觉定位方法虽然准确，但对存储和计算资源要求较高。场景坐标回归（SCR）通过将2D像素映射到3D场景坐标来减少存储需求，但现有的SCR方法使用卷积神经网络（CNN）提取2D描述符，忽略了像素之间的空间关系。

Details

Method: 提出A-SCoRe模型，利用注意力机制在描述符图级别生成有意义且高语义的2D描述符。该模型可以在多种数据模态上工作，包括密集或稀疏的深度图、SLAM和Structure-from-Motion（SfM）。 Result: A-SCoRe在多个基准测试中达到了与最先进方法相当的性能，同时更加轻量级和灵活。 Conclusion: A-SCoRe模型在视觉定位中表现出色，具有较高的灵活性和适应性，适用于移动机器人等多种环境。 Abstract: Visual localization is considered to be one of the crucial parts in many robotic and vision systems. While state-of-the art methods that relies on feature matching have proven to be accurate for visual localization, its requirements for storage and compute are burdens. Scene coordinate regression (SCR) is an alternative approach that remove the barrier for storage by learning to map 2D pixels to 3D scene coordinates. Most popular SCR use Convolutional Neural Network (CNN) to extract 2D descriptor, which we would argue that it miss the spatial relationship between pixels. Inspired by the success of vision transformer architecture, we present a new SCR architecture, called A-ScoRe, an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Since the operation is performed on descriptor map, our model can work with multiple data modality whether it is a dense or sparse from depth-map, SLAM to Structure-from-Motion (SfM). This versatility allows A-SCoRe to operate in different kind of environments, conditions and achieve the level of flexibility that is important for mobile robots. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible. Code and pre-trained models are public in our repository: https://github.com/ais-lab/A-SCoRe.

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Jiankang Wang,Zhihan zhang,Zhihang Liu,Yang Li,Jiannan Ge,Hongtao Xie,Yongdong Zhang

Task: 提出一种具有时空视频定位能力的多模态大语言模型（SpaceVLLM）。

Motivation: 解决多模态大语言模型在时空视频定位中的两大挑战：准确提取视频帧的时空信息和将视觉标记精确映射到空间坐标。

Details

Method: 采用一组交错的时空感知查询来捕捉时间感知和动态空间信息，并提出查询引导的空间解码器来建立查询与空间坐标之间的对应关系。 Result: SpaceVLLM在11个涵盖时间、空间、时空和视频理解任务的基准测试中取得了最先进的性能。 Conclusion: SpaceVLLM展示了在时空视频定位中的有效性，并构建了Uni-STG数据集以支持进一步研究。 Abstract: Multimodal large language models (MLLMs) have made remarkable progress in either temporal or spatial localization. However, they struggle to perform spatio-temporal video grounding. This limitation stems from two major challenges. Firstly, it is difficult to extract accurate spatio-temporal information of each frame in the video. Secondly, the substantial number of visual tokens makes it challenging to precisely map visual tokens of each frame to their corresponding spatial coordinates. To address these issues, we introduce SpaceVLLM, a MLLM endowed with spatio-temporal video grounding capability. Specifically, we adopt a set of interleaved Spatio-Temporal Aware Queries to capture temporal perception and dynamic spatial information. Moreover, we propose a Query-Guided Space Decoder to establish a corresponding connection between the queries and spatial coordinates. Additionally, due to the lack of spatio-temporal datasets, we construct the Unified Spatio-Temporal Grounding (Uni-STG) dataset, comprising 480K instances across three tasks. This dataset fully exploits the potential of MLLM to simultaneously facilitate localization in both temporal and spatial dimensions. Extensive experiments demonstrate that SpaceVLLM achieves the state-of-the-art performance across 11 benchmarks covering temporal, spatial, spatio-temporal and video understanding tasks, highlighting the effectiveness of our approach. Our code, datasets and model will be released.

DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection

Jaewoo Song,Daemin Park,Kanghyun Baek,Sangyub Lee,Jooyoung Choi,Eunji Kim,Sungroh Yoon

Task: 提出了一种新的缺陷生成方法DefectFill，用于生成高质量的缺陷图像。

Motivation: 由于缺陷数据的稀缺性，开发有效的视觉检测模型具有挑战性。现有的图像生成模型在生成高度逼真的缺陷图像方面存在困难。

Details

Method: DefectFill利用微调的修复扩散模型，结合自定义的损失函数（包括缺陷、对象和注意力项），能够精确捕捉局部缺陷特征并将其无缝集成到无缺陷对象中。此外，还提出了低保真选择方法以进一步提高缺陷样本质量。 Result: 实验表明，DefectFill生成的缺陷图像质量高，使视觉检测模型在MVTec AD数据集上达到了最先进的性能。 Conclusion: DefectFill是一种有效的缺陷生成方法，能够显著提升视觉检测模型的性能。 Abstract: Developing effective visual inspection models remains challenging due to the scarcity of defect data. While image generation models have been used to synthesize defect images, producing highly realistic defects remains difficult. We propose DefectFill, a novel method for realistic defect generation that requires only a few reference defect images. It leverages a fine-tuned inpainting diffusion model, optimized with our custom loss functions incorporating defect, object, and attention terms. It enables precise capture of detailed, localized defect features and their seamless integration into defect-free objects. Additionally, our Low-Fidelity Selection method further enhances the defect sample quality. Experiments show that DefectFill generates high-quality defect images, enabling visual inspection models to achieve state-of-the-art performance on the MVTec AD dataset.

Rethinking Cell Counting Methods: Decoupling Counting and Localization

Zixuan Zheng,Yilei Shi,Chunlei Li,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou

Task: 提出一种用于自动细胞计数的解耦学习方案，包括独立的计数器和定位器网络。

Motivation: 显微镜图像中的细胞计数在医学和生物学中至关重要，但手动执行非常繁琐且耗时。尽管自动化方法近年来有所进展，但最先进的方法往往设计复杂。

Details

Method: 提出了一种解耦学习方案，包括独立的计数器和定位器网络。计数器在中间特征图上操作以利用全局上下文并生成计数估计，同时生成粗略的密度图。定位器则根据原始图像和计数器的粗略密度图重建高分辨率密度图，精确定位单个细胞。此外，还引入了全局消息传递模块以整合跨区域模式。 Result: 在四个数据集上的大量实验表明，尽管方法简单，但挑战了常见做法，并以显著优势实现了最先进的性能。 Conclusion: 解耦学习减轻了直接在高分辨率密度图上学习计数的需求，使模型能够专注于对准确估计至关重要的全局特征。 Abstract: Cell counting in microscopy images is vital in medicine and biology but extremely tedious and time-consuming to perform manually. While automated methods have advanced in recent years, state-of-the-art approaches tend to increasingly complex model designs. In this paper, we propose a conceptually simple yet effective decoupled learning scheme for automated cell counting, consisting of separate counter and localizer networks. In contrast to jointly learning counting and density map estimation, we show that decoupling these objectives surprisingly improves results. The counter operates on intermediate feature maps rather than pixel space to leverage global context and produce count estimates, while also generating coarse density maps. The localizer then reconstructs high-resolution density maps that precisely localize individual cells, conditional on the original images and coarse density maps from the counter. Besides, to boost counting accuracy, we further introduce a global message passing module to integrate cross-region patterns. Extensive experiments on four datasets demonstrate that our approach, despite its simplicity, challenges common practice and achieves state-of-the-art performance by significant margins. Our key insight is that decoupled learning alleviates the need to learn counting on high-resolution density maps directly, allowing the model to focus on global features critical for accurate estimates. Code is available at https://github.com/MedAITech/DCL.

GraphTEN: Graph Enhanced Texture Encoding Network

Bo Peng,Jintao Chen,Mufeng Yao,Chenhao Zhang,Jianghui Zhang,Mingmin Chi,Jiang Tao

Task: 提出一种图增强纹理编码网络（GraphTEN）来解决纹理识别中的局部和全局特征捕捉问题。

Motivation: 由于纹理基元在空间分布上的变异性和随机性，通过视觉基元建模非局部上下文关系仍然具有挑战性。

Details

Method: GraphTEN通过全连接图建模全局关联，并通过二分图捕捉纹理基元的跨尺度依赖关系。此外，引入了一个利用码本的多尺度补丁编码模块，将多尺度补丁特征编码到统一的特征空间中。 Result: GraphTEN在五个公开数据集上相比现有方法取得了优越的性能。 Conclusion: GraphTEN能够有效捕捉纹理基元的局部和全局特征，并在多个数据集上表现出色。 Abstract: Texture recognition is a fundamental problem in computer vision and pattern recognition. Recent progress leverages feature aggregation into discriminative descriptions based on convolutional neural networks (CNNs). However, modeling non-local context relations through visual primitives remains challenging due to the variability and randomness of texture primitives in spatial distributions. In this paper, we propose a graph-enhanced texture encoding network (GraphTEN) designed to capture both local and global features of texture primitives. GraphTEN models global associations through fully connected graphs and captures cross-scale dependencies of texture primitives via bipartite graphs. Additionally, we introduce a patch encoding module that utilizes a codebook to achieve an orderless representation of texture by encoding multi-scale patch features into a unified feature space. The proposed GraphTEN achieves superior performance compared to state-of-the-art methods across five publicly available datasets.

BI-RADS prediction of mammographic masses using uncertainty information extracted from a Bayesian Deep Learning model

Mohaddeseh Chegini,Ali Mahloojifar

Task: 利用贝叶斯深度学习模型预测BI_RADS评分，以支持放射科医生的最终决策。

Motivation: 由于放射科医生在描述肿块时存在显著差异，导致BI_RADS分类错误，因此需要一种BI_RADS预测系统来支持放射科医生的决策。

Details

Method: 使用贝叶斯深度学习模型提取不确定性信息来预测BI_RADS评分。 Result: 模型在BI_RADS 2、3和5数据集样本中的f1分数分别为73.33%、59.60%和59.26%，优于放射科医生的预测结果。模型在BI_RADS 0类别中区分恶性和良性样本的准确率为75.86%，并能正确识别所有恶性样本为BI_RADS 5。 Conclusion: 研究表明，不确定性感知的贝叶斯深度学习模型可以像放射科医生一样基于形态学特征报告病变的恶性不确定性。 Abstract: The BI_RADS score is a probabilistic reporting tool used by radiologists to express the level of uncertainty in predicting breast cancer based on some morphological features in mammography images. There is a significant variability in describing masses which sometimes leads to BI_RADS misclassification. Using a BI_RADS prediction system is required to support the final radiologist decisions. In this study, the uncertainty information extracted by a Bayesian deep learning model is utilized to predict the BI_RADS score. The investigation results based on the pathology information demonstrate that the f1-scores of the predictions of the radiologist are 42.86%, 48.33% and 48.28%, meanwhile, the f1-scores of the model performance are 73.33%, 59.60% and 59.26% in the BI_RADS 2, 3 and 5 dataset samples, respectively. Also, the model can distinguish malignant from benign samples in the BI_RADS 0 category of the used dataset with an accuracy of 75.86% and correctly identify all malignant samples as BI_RADS 5. The Grad-CAM visualization shows the model pays attention to the morphological features of the lesions. Therefore, this study shows the uncertainty-aware Bayesian Deep Learning model can report his uncertainty about the malignancy of a lesion based on morphological features, like a radiologist.

Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and Weight

Yi Xiao,Qiannan Han,Guiping Liang,Hongyan Zhang,Song Wang,Zhihao Xu,Weican Wan,Chuang Li,Guitao Jiang,Wenbo Xiao

Task: 利用多模态数据（2D RGB图像、深度图像和3D点云）进行鸭子体尺和体重的非侵入式估计。

Motivation: 准确的体尺和体重测量对于优化家禽管理、健康评估和经济效率至关重要。

Details

Method: 提出了一种基于深度学习的模型，利用PointNet++从点云中提取关键特征点，计算相应的3D几何特征，并将其与多视角卷积2D特征融合，使用Transformer编码器捕捉长距离依赖关系并优化特征交互。 Result: 模型在八个形态参数上的平均绝对百分比误差（MAPE）为6.33%，R2为0.953，表现出强大的预测能力。 Conclusion: 该研究首次将深度学习技术应用于家禽体尺和体重估计，为畜牧业的智能化和精确管理提供了有价值的参考，具有深远的实际意义。 Abstract: Accurate body dimension and weight measurements are critical for optimizing poultry management, health assessment, and economic efficiency. This study introduces an innovative deep learning-based model leveraging multimodal data-2D RGB images from different views, depth images, and 3D point clouds-for the non-invasive estimation of duck body dimensions and weight. A dataset of 1,023 Linwu ducks, comprising over 5,000 samples with diverse postures and conditions, was collected to support model training. The proposed method innovatively employs PointNet++ to extract key feature points from point clouds, extracts and computes corresponding 3D geometric features, and fuses them with multi-view convolutional 2D features. A Transformer encoder is then utilized to capture long-range dependencies and refine feature interactions, thereby enhancing prediction robustness. The model achieved a mean absolute percentage error (MAPE) of 6.33% and an R2 of 0.953 across eight morphometric parameters, demonstrating strong predictive capability. Unlike conventional manual measurements, the proposed model enables high-precision estimation while eliminating the necessity for physical handling, thereby reducing animal stress and broadening its application scope. This study marks the first application of deep learning techniques to poultry body dimension and weight estimation, providing a valuable reference for the intelligent and precise management of the livestock industry with far-reaching practical significance.

MeshFleet: Filtered and Annotated 3D Vehicle Dataset for Domain Specific Generative Modeling

Damian Boborzi,Phillip Mueller,Jonas Emrich,Dominik Schmid,Sebastian Mueller,Lars Mikelsons

Task: 提出一种基于质量分类器的自动化数据过滤方法，用于创建高质量、特定领域的3D车辆数据集。

Motivation: 生成模型在3D对象领域取得了显著进展，但在工程等领域的实际应用中仍受限于准确性、质量和可控性。

Details

Method: 提出了一种基于质量分类器的自动化数据过滤管道，该分类器在Objaverse的手动标记子集上训练，结合了DINOv2和SigLIP嵌入，并通过基于标题的分析和不确定性估计进行优化。 Result: 通过比较分析和SV3D的微调实验，证明了过滤方法的有效性，强调了针对特定领域的3D生成建模中数据选择的重要性。 Conclusion: MeshFleet数据集和自动化过滤方法为特定领域的3D生成建模提供了高质量的数据支持。 Abstract: Generative models have recently made remarkable progress in the field of 3D objects. However, their practical application in fields like engineering remains limited since they fail to deliver the accuracy, quality, and controllability needed for domain-specific tasks. Fine-tuning large generative models is a promising perspective for making these models available in these fields. Creating high-quality, domain-specific 3D datasets is crucial for fine-tuning large generative models, yet the data filtering and annotation process remains a significant bottleneck. We present MeshFleet, a filtered and annotated 3D vehicle dataset extracted from Objaverse-XL, the most extensive publicly available collection of 3D objects. Our approach proposes a pipeline for automated data filtering based on a quality classifier. This classifier is trained on a manually labeled subset of Objaverse, incorporating DINOv2 and SigLIP embeddings, refined through caption-based analysis and uncertainty estimation. We demonstrate the efficacy of our filtering method through a comparative analysis against caption and image aesthetic score-based techniques and fine-tuning experiments with SV3D, highlighting the importance of targeted data selection for domain-specific 3D generative modeling.

LEGNet: Lightweight Edge-Gaussian Driven Network for Low-Quality Remote Sensing Image Object Detection

Wei Lu,Si-Bao Chen,Hui-Dong Li,Qing-Ling Shu,Chris H. Q. Ding,Jin Tang,Bin Luo

Task: 提出了一种轻量级网络LEGNet，用于低质量遥感图像中的目标检测。

Motivation: 遥感目标检测在复杂视觉环境中面临诸多挑战，如低空间分辨率、传感器噪声、模糊对象、低光退化和部分遮挡等问题，这些问题降低了检测模型的特征可区分性。

Details

Method: LEGNet结合了边缘-高斯聚合（EGA）模块，利用Scharr算子进行边缘先验和不确定性感知的高斯建模，以增强特征精度。 Result: 在四个遥感目标检测基准数据集和一个无人机视角数据集上，LEGNet表现出显著的性能提升，达到了最先进的水平。 Conclusion: LEGNet在计算效率上表现出色，适合在资源受限的边缘设备上部署，适用于实际遥感应用。 Abstract: Remote sensing object detection (RSOD) faces formidable challenges in complex visual environments. Aerial and satellite images inherently suffer from limitations such as low spatial resolution, sensor noise, blurred objects, low-light degradation, and partial occlusions. These degradation factors collectively compromise the feature discriminability in detection models, resulting in three key issues: (1) reduced contrast that hampers foreground-background separation, (2) structural discontinuities in edge representations, and (3) ambiguous feature responses caused by variations in illumination. These collectively weaken model robustness and deployment feasibility. To address these challenges, we propose LEGNet, a lightweight network that incorporates a novel edge-Gaussian aggregation (EGA) module specifically designed for low-quality remote sensing images. Our key innovation lies in the synergistic integration of Scharr operator-based edge priors with uncertainty-aware Gaussian modeling: (a) The orientation-aware Scharr filters preserve high-frequency edge details with rotational invariance; (b) The uncertainty-aware Gaussian layers probabilistically refine low-confidence features through variance estimation. This design enables precision enhancement while maintaining architectural simplicity. Comprehensive evaluations across four RSOD benchmarks (DOTA-v1.0, v1.5, DIOR-R, FAIR1M-v1.0) and a UAV-view dataset (VisDrone2019) demonstrate significant improvements. LEGNet achieves state-of-the-art performance across five benchmark datasets while ensuring computational efficiency, making it well-suited for deployment on resource-constrained edge devices in real-world remote sensing applications. The code is available at https://github.com/lwCVer/LEGNet.

Boosting Semi-Supervised Medical Image Segmentation via Masked Image Consistency and Discrepancy Learning

Pengcheng Zhou,Lantian Zhang,Wei Li

Task: 提出了一种用于医学图像分割的半监督学习框架，名为Masked Image Consistency and Discrepancy Learning (MICD)。

Motivation: 现有的协同训练框架主要关注网络初始化差异和伪标签生成，忽视了信息交换与模型多样性保持之间的平衡。

Details

Method: 提出了三个关键模块：Masked Cross Pseudo Consistency (MCPC)模块通过跨掩码输入分支的伪标签丰富上下文感知和小样本学习；Cross Feature Consistency (CFC)模块通过确保解码器特征一致性来加强信息交换和模型鲁棒性；Cross Model Discrepancy (CMD)模块利用EMA教师网络监督输出并保持分支多样性。 Result: 在两个公开的医学图像数据集AMOS和Synapse上的实验表明，该方法优于现有的最先进方法。 Conclusion: MICD框架通过关注细粒度局部信息并在异构框架中保持多样性，解决了现有方法的局限性。 Abstract: Semi-supervised learning is of great significance in medical image segmentation by exploiting unlabeled data. Among its strategies, the co-training framework is prominent. However, previous co-training studies predominantly concentrate on network initialization variances and pseudo-label generation, while overlooking the equilibrium between information interchange and model diversity preservation. In this paper, we propose the Masked Image Consistency and Discrepancy Learning (MICD) framework with three key modules. The Masked Cross Pseudo Consistency (MCPC) module enriches context perception and small sample learning via pseudo-labeling across masked-input branches. The Cross Feature Consistency (CFC) module fortifies information exchange and model robustness by ensuring decoder feature consistency. The Cross Model Discrepancy (CMD) module utilizes EMA teacher networks to oversee outputs and preserve branch diversity. Together, these modules address existing limitations by focusing on fine-grained local information and maintaining diversity in a heterogeneous framework. Experiments on two public medical image datasets, AMOS and Synapse, demonstrate that our approach outperforms state-of-the-art methods.

MP-GUI: Modality Perception with MLLMs for GUI Understanding

Ziwei Wang,Weizhi Chen,Leyang Yang,Sheng Zhou,Shengchu Zhao,Hanbei Zhan,Jiongchao Jin,Liangcheng Li,Zirui Shao,Jiajun Bu

Task: 设计一个专门用于图形用户界面（GUI）理解的多模态大语言模型（MLLM）。

Motivation: 当前的多模态大语言模型在处理图形和文本组件方面已经非常熟练，但在GUI理解方面存在障碍，主要是由于缺乏明确的空间结构建模。此外，由于隐私问题和噪声环境，获取高质量的空间结构数据具有挑战性。

Details

Method: 提出了MP-GUI，一个专门设计的MLLM，用于GUI理解。MP-GUI具有三个精确的感知器，用于从屏幕中提取图形、文本和空间模态作为GUI定制的视觉线索，并通过空间结构细化策略和自适应融合门结合，以满足不同GUI理解任务的特定需求。 Result: MP-GUI在有限的数据下在各种GUI理解任务中取得了令人印象深刻的结果。 Conclusion: MP-GUI通过专门设计的感知器和数据收集管道，有效地解决了GUI理解中的挑战，并在实验中展示了其优越性。 Abstract: Graphical user interface (GUI) has become integral to modern society, making it crucial to be understood for human-centric systems. However, unlike natural images or documents, GUIs comprise artificially designed graphical elements arranged to convey specific semantic meanings. Current multi-modal large language models (MLLMs) already proficient in processing graphical and textual components suffer from hurdles in GUI understanding due to the lack of explicit spatial structure modeling. Moreover, obtaining high-quality spatial structure data is challenging due to privacy issues and noisy environments. To address these challenges, we present MP-GUI, a specially designed MLLM for GUI understanding. MP-GUI features three precisely specialized perceivers to extract graphical, textual, and spatial modalities from the screen as GUI-tailored visual clues, with spatial structure refinement strategy and adaptively combined via a fusion gate to meet the specific preferences of different GUI understanding tasks. To cope with the scarcity of training data, we also introduce a pipeline for automatically data collecting. Extensive experiments demonstrate that MP-GUI achieves impressive results on various GUI understanding tasks with limited data.

Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting

Runsong Zhu,Shi Qiu,Zhengzhe Liu,Ka-Hei Hui,Qianyi Wu,Pheng-Ann Heng,Chi-Wing Fu

Task: 将多视图2D实例分割提升到辐射场以增强3D理解。

Motivation: 现有方法依赖于直接匹配进行端到端提升，效果较差；或采用两阶段解决方案，受限于复杂的预处理或后处理。

Details

Method: 设计了一种新的端到端对象感知提升方法，名为Unified-Lift，基于3D高斯表示提供准确的3D分割。通过对比损失学习高斯级特征，并引入可学习的对象级代码本以实现显式的对象级理解。 Result: 在LERF-Masked、Replica和Messy Rooms数据集上的实验表明，Unified-Lift在分割质量和时间效率上明显优于现有方法。 Conclusion: Unified-Lift通过有效的代码本学习和噪声标签过滤模块，显著提升了3D分割的性能。代码已公开。 Abstract: Lifting multi-view 2D instance segmentation to a radiance field has proven to be effective to enhance 3D understanding. Existing methods rely on direct matching for end-to-end lifting, yielding inferior results; or employ a two-stage solution constrained by complex pre- or post-processing. In this work, we design a new end-to-end object-aware lifting approach, named Unified-Lift that provides accurate 3D segmentation based on the 3D Gaussian representation. To start, we augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. Importantly, we introduce a learnable object-level codebook to account for individual objects in the scene for an explicit object-level understanding and associate the encoded object-level features with the Gaussian-level point features for segmentation predictions. While promising, achieving effective codebook learning is non-trivial and a naive solution leads to degraded performance. Therefore, we formulate the association learning module and the noisy label filtering module for effective and robust codebook learning. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms datasets. Both qualitative and quantitative results manifest that our Unified-Lift clearly outperforms existing methods in terms of segmentation quality and time efficiency. The code is publicly available at \href{https://github.com/Runsong123/Unified-Lift}{https://github.com/Runsong123/Unified-Lift}.

A Revisit to the Decoder for Camouflaged Object Detection

Seung Woo Ko,Joopyo Hong,Suyoung Kim,Seungjai Bang,Sungzoon Cho,Nojun Kwak,Hyung-Sin Kim,Joonseok Lee

Task: 生成隐藏在背景中的伪装对象的精细分割图。

Motivation: 由于伪装对象的隐藏特性，解码器需要有效地提取伪装对象的特征并特别小心地生成其复杂边界。

Details

Method: 提出了一种新颖的架构，通过Enrich Decoder和Retouch Decoder增强COD中的解码策略。Enrich Decoder使用通道注意力放大对COD重要的特征通道，Retouch Decoder通过空间注意力进一步细化分割图。 Result: ENTO在各种编码器上表现出优越的性能，两个新颖组件发挥了独特的互补作用。 Conclusion: 提出的架构通过Enrich Decoder和Retouch Decoder有效地提高了伪装对象检测的性能。 Abstract: Camouflaged object detection (COD) aims to generate a fine-grained segmentation map of camouflaged objects hidden in their background. Due to the hidden nature of camouflaged objects, it is essential for the decoder to be tailored to effectively extract proper features of camouflaged objects and extra-carefully generate their complex boundaries. In this paper, we propose a novel architecture that augments the prevalent decoding strategy in COD with Enrich Decoder and Retouch Decoder, which help to generate a fine-grained segmentation map. Specifically, the Enrich Decoder amplifies the channels of features that are important for COD using channel-wise attention. Retouch Decoder further refines the segmentation maps by spatially attending to important pixels, such as the boundary regions. With extensive experiments, we demonstrate that ENTO shows superior performance using various encoders, with the two novel components playing their unique roles that are mutually complementary.

Intra and Inter Parser-Prompted Transformers for Effective Image Restoration

Cong Wang,Jinshan Pan,Liyan Wang,Wei Wang

Task: 提出了一种用于图像恢复的Intra和Inter Parser-Prompted Transformers (PPTformer)模型。

Motivation: 探索视觉基础模型中的有用特征以提升图像恢复效果。

Details

Method: PPTformer包含两个部分：图像恢复网络（IRNet）和解析器提示特征生成网络（PPFGNet）。IRNet用于从退化的观测中恢复图像，PPFGNet为IRNet提供可靠的解析器信息以增强恢复效果。提出了Intra Parser-Prompted Attention (IntraPPA)和Inter Parser-Prompted Attention (InterPPA)来隐式和显式地学习有用的解析器特征。 Result: PPTformer在图像去雨、去焦模糊、去雪和低光增强任务中达到了最先进的性能。 Conclusion: PPTformer通过结合IntraPPA和InterPPA，有效地提升了图像恢复的效果。 Abstract: We propose Intra and Inter Parser-Prompted Transformers (PPTformer) that explore useful features from visual foundation models for image restoration. Specifically, PPTformer contains two parts: an Image Restoration Network (IRNet) for restoring images from degraded observations and a Parser-Prompted Feature Generation Network (PPFGNet) for providing IRNet with reliable parser information to boost restoration. To enhance the integration of the parser within IRNet, we propose Intra Parser-Prompted Attention (IntraPPA) and Inter Parser-Prompted Attention (InterPPA) to implicitly and explicitly learn useful parser features to facilitate restoration. The IntraPPA re-considers cross attention between parser and restoration features, enabling implicit perception of the parser from a long-range and intra-layer perspective. Conversely, the InterPPA initially fuses restoration features with those of the parser, followed by formulating these fused features within an attention mechanism to explicitly perceive parser information. Further, we propose a parser-prompted feed-forward network to guide restoration within pixel-wise gating modulation. Experimental results show that PPTformer achieves state-of-the-art performance on image deraining, defocus deblurring, desnowing, and low-light enhancement.

AIGVE-Tool: AI-Generated Video Evaluation Toolkit with Multifaceted Benchmark

Xinhao Xiang,Xiao Liu,Zizhong Li,Zhuosheng Liu,Jiawei Zhang

Task: 提出一个统一的框架AIGVE-Tool，用于系统化评估AI生成的视频。

Motivation: 现有的评估指标缺乏统一的框架，导致评估方法分散且冗余，且许多方法受限于特定数据集，限制了其在不同视频领域的适用性。

Details

Method: 引入AIGVE-Tool，一个结构化和可扩展的评估框架，结合了多种评估方法，并通过模块化配置系统实现灵活定制。同时提出了AIGVE-Bench，一个基于五种最先进视频生成模型的大规模基准数据集。 Result: AIGVE-Tool在提供标准化和可靠的评估结果方面表现出色，揭示了当前模型的具体优势和局限性。 Conclusion: AIGVE-Tool有效地推动了下一代AI生成视频技术的发展。 Abstract: The rapid advancement in AI-generated video synthesis has led to a growth demand for standardized and effective evaluation metrics. Existing metrics lack a unified framework for systematically categorizing methodologies, limiting a holistic understanding of the evaluation landscape. Additionally, fragmented implementations and the absence of standardized interfaces lead to redundant processing overhead. Furthermore, many prior approaches are constrained by dataset-specific dependencies, limiting their applicability across diverse video domains. To address these challenges, we introduce AIGVE-Tool (AI-Generated Video Evaluation Toolkit), a unified framework that provides a structured and extensible evaluation pipeline for a comprehensive AI-generated video evaluation. Organized within a novel five-category taxonomy, AIGVE-Tool integrates multiple evaluation methodologies while allowing flexible customization through a modular configuration system. Additionally, we propose AIGVE-Bench, a large-scale benchmark dataset created with five SOTA video generation models based on hand-crafted instructions and prompts. This dataset systematically evaluates various video generation models across nine critical quality dimensions. Extensive experiments demonstrate the effectiveness of AIGVE-Tool in providing standardized and reliable evaluation results, highlighting specific strengths and limitations of current models and facilitating the advancements of next-generation AI-generated video techniques.

Fast Autoregressive Video Generation with Diagonal Decoding

Yang Ye,Junliang Guo,Haoyu Wu,Tianyu He,Tim Pearce,Tabish Rashid,Katja Hofmann,Jiang Bian

Task: 提出一种用于自回归预训练模型的无训练推理加速算法Diagonal Decoding (DiagD)。

Motivation: 自回归Transformer模型在视频生成中表现出色，但其逐令牌解码过程在处理长视频时成为瓶颈。

Details

Method: 通过利用视频中的空间和时间相关性，DiagD在空间-时间令牌网格中沿对角线路径生成令牌，实现帧内并行解码和跨帧部分重叠解码。 Result: 实验表明，DiagD在多个自回归视频生成模型和数据集上实现了高达10倍的加速，同时保持相当的视觉保真度。 Conclusion: DiagD是一种多功能且适应性强的算法，能够在推理速度和视觉质量之间提供灵活的控制，并通过微调策略进一步减小训练-推理差距。 Abstract: Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. In this paper, we propose Diagonal Decoding (DiagD), a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Furthermore, we propose a cost-effective finetuning strategy that aligns the attention patterns of the model with our decoding order, further mitigating the training-inference gap on small-scale models. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to $10\times$ speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.

Limb-Aware Virtual Try-On Network with Progressive Clothing Warping

Shengping Zhang,Xiaoyu Han,Weigang Zhang,Xiangyuan Lan,Hongxun Yao,Qingming Huang

Task: 将商店中的服装图像转移到人物图像上，实现基于图像的虚拟试穿。

Motivation: 现有方法通常采用单一的全局变形来进行服装变形，缺乏对商店服装的细粒度建模，导致服装外观失真。此外，现有方法通常无法很好地生成肢体细节，因为它们使用的服装无关的人物表示没有参考人物图像的肢体纹理。

Details

Method: 提出了名为PL-VTON的肢体感知虚拟试穿网络，通过渐进式细粒度服装变形和肢体感知纹理融合，生成高质量的试穿结果。具体包括渐进式服装变形（PCW）、人物解析估计器（PPE）和肢体感知纹理融合（LTF）。 Result: 实验表明，PL-VTON在定性和定量上都优于现有方法。 Conclusion: PL-VTON通过细粒度建模和肢体感知纹理融合，显著提高了虚拟试穿的质量，特别是在生成逼真的肢体细节方面。 Abstract: Image-based virtual try-on aims to transfer an in-shop clothing image to a person image. Most existing methods adopt a single global deformation to perform clothing warping directly, which lacks fine-grained modeling of in-shop clothing and leads to distorted clothing appearance. In addition, existing methods usually fail to generate limb details well because they are limited by the used clothing-agnostic person representation without referring to the limb textures of the person image. To address these problems, we propose Limb-aware Virtual Try-on Network named PL-VTON, which performs fine-grained clothing warping progressively and generates high-quality try-on results with realistic limb details. Specifically, we present Progressive Clothing Warping (PCW) that explicitly models the location and size of in-shop clothing and utilizes a two-stage alignment strategy to progressively align the in-shop clothing with the human body. Moreover, a novel gravity-aware loss that considers the fit of the person wearing clothing is adopted to better handle the clothing edges. Then, we design Person Parsing Estimator (PPE) with a non-limb target parsing map to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and body regions. Finally, we introduce Limb-aware Texture Fusion (LTF) that focuses on generating realistic details in limb regions, where a coarse try-on result is first generated by fusing the warped clothing image with the person image, then limb textures are further fused with the coarse result under limb-aware guidance to refine limb details. Extensive experiments demonstrate that our PL-VTON outperforms the state-of-the-art methods both qualitatively and quantitatively.

Growing a Twig to Accelerate Large Vision-Language Models

Zhenwei Shao,Mingyang Wang,Zhou Yu,Wenwen Pan,Yan Yang,Tao Wei,Hongyuan Zhang,Ning Mao,Wei Chen,Jun Yu

Task: 提出了一种名为TwigVLM的架构，通过在大视觉语言模型（VLM）的早期层上增加一个轻量级分支来加速模型。

Motivation: 现有的视觉标记剪枝方法在加速大视觉语言模型时存在两个主要问题：早期层的注意力信号不敏感导致准确性下降，以及在生成长响应时速度提升有限。

Details

Method: TwigVLM采用了一种基于分支引导的标记剪枝（TTP）策略和自推测解码（SSD）策略。 Result: 实验结果表明，TwigVLM在剪枝88.9%的视觉标记后保留了96%的原始性能，并在生成长响应时实现了154%的加速。 Conclusion: TwigVLM在准确性和速度方面均优于现有的最先进VLM加速方法。 Abstract: Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM -- a simple and general architecture by growing a lightweight twig upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods. Code will be made publicly available.

SCJD: Sparse Correlation and Joint Distillation for Efficient 3D Human Pose Estimation

Weihong Chen,Xuemiao Xu,Haoxin Yang,Yi Xie,Peng Xiao,Cheng Xu,Huaidong Zhang,Pheng-Ann Heng

Task: 提出了一种新的框架SCJD，用于在3D人体姿态估计中平衡效率和准确性。

Motivation: 现有的3D人体姿态估计方法虽然精度高，但计算开销大且推理速度慢，而知识蒸馏方法未能解决多帧输入中的空间关系和时间相关性。

Details

Method: SCJD引入了稀疏相关输入序列下采样以减少学生网络输入的冗余，同时保留帧间相关性。提出了动态关节空间注意力蒸馏，包括动态关节嵌入蒸馏和相邻关节注意力蒸馏，以增强学生网络的特征表示和空间理解。此外，通过上采样和全局监督，时间一致性蒸馏对齐了教师和学生网络之间的时间相关性。 Result: 大量实验表明，SCJD达到了最先进的性能。 Conclusion: SCJD框架在3D人体姿态估计中有效地平衡了效率和准确性，具有显著的优势。 Abstract: Existing 3D Human Pose Estimation (HPE) methods achieve high accuracy but suffer from computational overhead and slow inference, while knowledge distillation methods fail to address spatial relationships between joints and temporal correlations in multi-frame inputs. In this paper, we propose Sparse Correlation and Joint Distillation (SCJD), a novel framework that balances efficiency and accuracy for 3D HPE. SCJD introduces Sparse Correlation Input Sequence Downsampling to reduce redundancy in student network inputs while preserving inter-frame correlations. For effective knowledge transfer, we propose Dynamic Joint Spatial Attention Distillation, which includes Dynamic Joint Embedding Distillation to enhance the student's feature representation using the teacher's multi-frame context feature, and Adjacent Joint Attention Distillation to improve the student network's focus on adjacent joint relationships for better spatial understanding. Additionally, Temporal Consistency Distillation aligns the temporal correlations between teacher and student networks through upsampling and global supervision. Extensive experiments demonstrate that SCJD achieves state-of-the-art performance. Code is available at https://github.com/wileychan/SCJD.

Reliable uncertainty quantification for 2D/3D anatomical landmark localization using multi-output conformal prediction

Jef Jonkers,Frank Coopman,Luc Duchateau,Glenn Van Wallendael,Sofie Van Hoecke

Task: 在医学影像中自动定位解剖标志，并提供可靠的预测不确定性量化。

Motivation: 当前的不确定性量化方法在结合正态性假设时往往低估了总预测不确定性，无法满足临床决策支持的需求。

Details

Method: 提出了两种新的多输出预测方法：多输出回归作为分类的保形预测（M-R2CCP）及其变体多输出回归到分类的保形预测集到区域（M-R2C2R），这些方法生成了灵活的非凸预测区域，更好地捕捉了标志预测的不确定性结构。 Result: 在多个2D和3D数据集上的广泛实验表明，这些方法在有效性和效率上均优于现有的多输出保形预测方法。 Conclusion: 这项工作在解剖标志定位的可靠不确定性估计方面取得了显著进展，为临床诊断提供了可信的置信度测量，并且这些方法在多输出回归问题中具有广泛的应用潜力。 Abstract: Automatic anatomical landmark localization in medical imaging requires not just accurate predictions but reliable uncertainty quantification for effective clinical decision support. Current uncertainty quantification approaches often fall short, particularly when combined with normality assumptions, systematically underestimating total predictive uncertainty. This paper introduces conformal prediction as a framework for reliable uncertainty quantification in anatomical landmark localization, addressing a critical gap in automatic landmark localization. We present two novel approaches guaranteeing finite-sample validity for multi-output prediction: Multi-output Regression-as-Classification Conformal Prediction (M-R2CCP) and its variant Multi-output Regression to Classification Conformal Prediction set to Region (M-R2C2R). Unlike conventional methods that produce axis-aligned hyperrectangular or ellipsoidal regions, our approaches generate flexible, non-convex prediction regions that better capture the underlying uncertainty structure of landmark predictions. Through extensive empirical evaluation across multiple 2D and 3D datasets, we demonstrate that our methods consistently outperform existing multi-output conformal prediction approaches in both validity and efficiency. This work represents a significant advancement in reliable uncertainty estimation for anatomical landmark localization, providing clinicians with trustworthy confidence measures for their diagnoses. While developed for medical imaging, these methods show promise for broader applications in multi-output regression problems.

Operational Change Detection for Geographical Information: Overview and Challenges

Nicolas Gonthier

Task: 综述适用于大规模地理数据库操作更新的变化检测方法。

Motivation: 由于气候变化和人类活动的影响，领土的快速演变需要国家测绘机构及时有效地更新地理空间数据库。

Details

Method: 将自动变化检测方法分为四类：基于规则的方法、统计方法、机器学习方法和模拟方法，并讨论每类方法的优缺点和适用性。 Result: 确定了国家测绘机构的关键应用，包括地理空间数据库更新的优化、基于变化的现象和动态监测。 Conclusion: 强调了变化检测面临的挑战，并指出需要不断创新变化检测技术以满足未来地理信息系统的需求。 Abstract: Rapid evolution of territories due to climate change and human impact requires prompt and effective updates to geospatial databases maintained by the National Mapping Agency. This paper presents a comprehensive overview of change detection methods tailored for the operational updating of large-scale geographic databases. This review first outlines the fundamental definition of change, emphasizing its multifaceted nature, from temporal to semantic characterization. It categorizes automatic change detection methods into four main families: rule-based, statistical, machine learning, and simulation methods. The strengths, limitations, and applicability of every family are discussed in the context of various input data. Then, key applications for National Mapping Agencies are identified, particularly the optimization of geospatial database updating, change-based phenomena, and dynamics monitoring. Finally, the paper highlights the current challenges for leveraging change detection such as the variability of change definition, the missing of relevant large-scale datasets, the diversity of input data, the unstudied no-change detection, the human in the loop integration and the operational constraints. The discussion underscores the necessity for ongoing innovation in change detection techniques to address the future needs of geographic information systems for national mapping agencies.

Towards properties of adversarial image perturbations

Egor Kuznetsov,Kirill Aistov,Maxim Koroteev

Task: 研究对抗性扰动对VMAF图像质量度量的影响。

Motivation: 探讨对抗性扰动在保持图像主观质量的同时显著提高VMAF度量的可能性。

Details

Method: 使用随机梯度方法研究对抗性扰动的特性，基于傅里叶功率谱计算扰动结构，并在PyTorch中进行直接VMAF优化。 Result: 适度的图像亮度变化（约10个像素单位）可以使VMAF增长约60%，且图像主观质量几乎不变。对抗性扰动的振幅与图像亮度呈近似线性关系。 Conclusion: 对抗性扰动可以在不显著改变图像主观质量的情况下显著提高VMAF度量，但VMAF值与主观判断之间存在显著差异。 Abstract: Using stochastic gradient approach we study the properties of adversarial perturbations resulting in noticeable growth of VMAF image quality metric. The structure of the perturbations is investigated depending on the acceptable PSNR values and based on the Fourier power spectrum computations for the perturbations. It is demonstrated that moderate variation of image brightness ($\sim 10$ pixel units in a restricted region of an image can result in VMAF growth by $\sim 60\%$). Unlike some other methods demonstrating similar VMAF growth, the subjective quality of an image remains almost unchanged. It is also shown that the adversarial perturbations may demonstrate approximately linear dependence of perturbation amplitudes on the image brightness. The perturbations are studied based on the direct VMAF optimization in PyTorch. The significant discrepancies between the metric values and subjective judgements are also demonstrated when image restoration from noise is carried out using the same direct VMAF optimization.

Condensing Action Segmentation Datasets via Generative Network Inversion

Guodong Ding,Rongyu Chen,Angela Yao

Task: 提出一种用于时间动作分割的程序化视频数据集的压缩方法。

Motivation: 减少时间动作分割数据集的存储需求，同时保持性能。

Details

Method: 利用生成先验和网络反演将数据压缩为紧凑的潜在代码，并通过采样多样且具代表性的动作序列来最小化视频冗余。 Result: 在标准基准测试中，该方法在压缩时间动作分割数据集方面表现出色，特别是在Breakfast数据集上，存储减少了500倍以上，同时保留了83%的性能。在下游增量学习任务中，其性能优于现有技术。 Conclusion: 所提出的压缩方法在减少存储需求的同时，能够保持甚至提升时间动作分割的性能，具有广泛的应用前景。 Abstract: This work presents the first condensation approach for procedural video datasets used in temporal action segmentation. We propose a condensation framework that leverages generative prior learned from the dataset and network inversion to condense data into compact latent codes with significant storage reduced across temporal and channel aspects. Orthogonally, we propose sampling diverse and representative action sequences to minimize video-wise redundancy. Our evaluation on standard benchmarks demonstrates consistent effectiveness in condensing TAS datasets and achieving competitive performances. Specifically, on the Breakfast dataset, our approach reduces storage by over 500$\times$ while retaining 83% of the performance compared to training with the full dataset. Furthermore, when applied to a downstream incremental learning task, it yields superior performance compared to the state-of-the-art.

SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models

Subhadeep Koley,Tapas Kumar Dutta,Aneeshan Sain,Pinaki Nath Chowdhury,Ayan Kumar Bhunia,Yi-Zhe Song

Task: 通过结合Stable Diffusion (SD)和CLIP，解决基础模型在草图理解中的局限性。

Motivation: 基础模型在计算机视觉中取得了革命性进展，但在草图理解方面仍面临抽象、稀疏视觉输入的挑战。

Details

Method: 通过系统分析，发现SD在提取抽象草图特征时存在困难，并表现出频率域偏差。通过动态注入CLIP特征到SD的去噪过程，并自适应地聚合语义层次特征，解决了这些问题。 Result: 在草图检索、识别、分割和对应学习方面取得了最先进的性能提升（分别为+3.35%、+1.06%、+29.42%和+21.22%）。 Conclusion: 该方法展示了基础模型时代中第一个真正通用的草图特征表示。 Abstract: While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35%), recognition (+1.06%), segmentation (+29.42%), and correspondence learning (+21.22%), demonstrating the first truly universal sketch feature representation in the era of foundation models.

Exploring Disparity-Accuracy Trade-offs in Face Recognition Systems: The Role of Datasets, Architectures, and Loss Functions

Siddharth D Jaiswal,Sagnik Basu,Sandipan Sikdar,Animesh Mukherjee

Task: 分析三种自动人脸识别系统（FRSs）在性别预测任务中的表现，探讨模型架构、损失函数和数据集对准确性和偏差的影响。

Motivation: 尽管自动人脸识别系统在准确性上已超越人类水平，但在某些人口统计群体中仍存在偏差。理解模型架构、损失函数和数据集对准确性-偏差权衡的影响，对于设计更好的、无偏见的平台至关重要。

Details

Method: 对三种FRSs进行深入分析，通过多种架构修改生成十个深度学习模型，并结合四种损失函数，在七个面部数据集上进行266种评估配置的基准测试。 Result: 结果表明，模型架构、损失函数和数据集对准确性和偏差都有单独和综合的影响。数据集具有固有属性，使其在不同模型上表现相似，且数据集的选择决定了模型的感知偏差。 Conclusion: 模型无法泛化出统一的“女性面孔”和“男性面孔”定义，建议模型开发者使用本研究作为模型开发和部署的蓝图。 Abstract: Automated Face Recognition Systems (FRSs), developed using deep learning models, are deployed worldwide for identity verification and facial attribute analysis. The performance of these models is determined by a complex interdependence among the model architecture, optimization/loss function and datasets. Although FRSs have surpassed human-level accuracy, they continue to be disparate against certain demographics. Due to the ubiquity of applications, it is extremely important to understand the impact of the three components -- model architecture, loss function and face image dataset on the accuracy-disparity trade-off to design better, unbiased platforms. In this work, we perform an in-depth analysis of three FRSs for the task of gender prediction, with various architectural modifications resulting in ten deep-learning models coupled with four loss functions and benchmark them on seven face datasets across 266 evaluation configurations. Our results show that all three components have an individual as well as a combined impact on both accuracy and disparity. We identify that datasets have an inherent property that causes them to perform similarly across models, independent of the choice of loss functions. Moreover, the choice of dataset determines the model's perceived bias -- the same model reports bias in opposite directions for three gender-balanced datasets of ``in-the-wild'' face images of popular individuals. Studying the facial embeddings shows that the models are unable to generalize a uniform definition of what constitutes a ``female face'' as opposed to a ``male face'', due to dataset diversity. We provide recommendations to model developers on using our study as a blueprint for model development and subsequent deployment.

Zining Wang,Tongkun Guan,Pei Fu,Chen Duan,Qianyi Jiang,Zhentao Guo,Shan Guo,Junfeng Luo,Wei Shen,Xiaokang Yang

Task: 设计一种适合文档级多模态大语言模型的图像-文本预训练任务，以桥接视觉和语言模态。

Motivation: 多模态大语言模型（MLLMs）为文档理解引入了新的维度，但如何设计合适的图像-文本预训练任务以桥接视觉和语言模态仍未被充分探索。

Details

Method: 提出了一种新的视觉-语言对齐方法，将关键问题视为视觉问答与掩码生成（VQAMask）任务，同时优化基于VQA的文本解析和掩码生成任务。 Result: 引入掩码生成任务显著提升了文档级理解性能，并在8B-MLLMs中取得了显著的改进。 Conclusion: 提出的VQAMask任务和Marten模型在文档级理解任务中表现出色，代码和数据集已公开。 Abstract: Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets are available at https://github.com/PriNing/Marten.

Comparative and Interpretative Analysis of CNN and Transformer Models in Predicting Wildfire Spread Using Remote Sensing Data

Yihang Zhou,Ruige Kong,Zhengsen Xu,Linlin Xu,Sibo Cheng

Task: 比较四种深度学习架构（Autoencoder、ResNet、UNet和Transformer-based Swin-UNet）在野火预测中的性能、效率和可解释性。

Motivation: 由于缺乏定量和可解释的比较分析，选择深度学习方法来预测野火仍然不确定，这对改进预防措施和优化模型至关重要。

Details

Method: 使用包含近十年美国加利福尼亚州遥感数据的真实数据集，比较四种深度学习架构的性能、效率和可解释性，并应用XAI技术增强模型的清晰度和可信度。 Result: Transformer-based Swin-UNet和UNet在预测准确性和模型可解释性方面优于Autoencoder和ResNet，特别是在Transformer-based Swin-UNet的先进注意力机制和UNet的跳跃连接方面。 Conclusion: Transformer-based Swin-UNet和UNet在野火预测中表现最佳，XAI分析揭示了它们在关注关键特征方面的优势，为未来模型设计和不同场景下的模型选择提供了指导。 Abstract: Facing the escalating threat of global wildfires, numerous computer vision techniques using remote sensing data have been applied in this area. However, the selection of deep learning methods for wildfire prediction remains uncertain due to the lack of comparative analysis in a quantitative and explainable manner, crucial for improving prevention measures and refining models. This study aims to thoroughly compare the performance, efficiency, and explainability of four prevalent deep learning architectures: Autoencoder, ResNet, UNet, and Transformer-based Swin-UNet. Employing a real-world dataset that includes nearly a decade of remote sensing data from California, U.S., these models predict the spread of wildfires for the following day. Through detailed quantitative comparison analysis, we discovered that Transformer-based Swin-UNet and UNet generally outperform Autoencoder and ResNet, particularly due to the advanced attention mechanisms in Transformer-based Swin-UNet and the efficient use of skip connections in both UNet and Transformer-based Swin-UNet, which contribute to superior predictive accuracy and model interpretability. Then we applied XAI techniques on all four models, this not only enhances the clarity and trustworthiness of models but also promotes focused improvements in wildfire prediction capabilities. The XAI analysis reveals that UNet and Transformer-based Swin-UNet are able to focus on critical features such as 'Previous Fire Mask', 'Drought', and 'Vegetation' more effectively than the other two models, while also maintaining balanced attention to the remaining features, leading to their superior performance. The insights from our thorough comparative analysis offer substantial implications for future model design and also provide guidance for model selection in different scenarios.

Concat-ID: Towards Universal Identity-Preserving Video Synthesis

Yong Zhong,Zhuoyi Yang,Jiayan Teng,Xiaotao Gu,Chongxuan Li

Task: 提出Concat-ID框架，用于身份保持的视频生成。

Motivation: 解决现有方法在身份一致性和面部可编辑性之间的平衡问题，并提高视频的自然度。

Details

Method: 使用变分自编码器提取图像特征，将其与视频潜在特征沿序列维度连接，仅利用3D自注意力机制，无需额外模块。引入跨视频配对策略和多阶段训练方案。 Result: Concat-ID在单身份和多身份生成方面优于现有方法，并在多主体场景中表现出色，如虚拟试穿和背景可控生成。 Conclusion: Concat-ID为身份保持的视频合成设立了新基准，提供了广泛应用的通用且可扩展的解决方案。 Abstract: We present Concat-ID, a unified framework for identity-preserving video generation. Concat-ID employs Variational Autoencoders to extract image features, which are concatenated with video latents along the sequence dimension, leveraging solely 3D self-attention mechanisms without the need for additional modules. A novel cross-video pairing strategy and a multi-stage training regimen are introduced to balance identity consistency and facial editability while enhancing video naturalness. Extensive experiments demonstrate Concat-ID's superiority over existing methods in both single and multi-identity generation, as well as its seamless scalability to multi-subject scenarios, including virtual try-on and background-controllable generation. Concat-ID establishes a new benchmark for identity-preserving video synthesis, providing a versatile and scalable solution for a wide range of applications.

RBFIM: Perceptual Quality Assessment for Compressed Point Clouds Using Radial Basis Function Interpolation

Zhang Chen,Shuai Wan,Siyu Ren,Fuzheng Yang,Mengting Yu,Junhui Hou

Task: 提出一种新的点云压缩感知失真评估方法RBFIM。

Motivation: 当前点云压缩中的单特征度量方法无法准确捕捉人类感知特征，导致失真评估不准确。

Details

Method: 利用径向基函数（RBF）插值将离散点特征转换为连续特征函数，从而建立原始点云与失真点云之间的精确对应特征。 Result: 在多个主观质量数据集上的实验表明，RBFIM在解决人类感知任务方面表现出色。 Conclusion: RBFIM显著提高了点云压缩质量评估的准确性，为点云压缩优化提供了有力支持。 Abstract: One of the main challenges in point cloud compression (PCC) is how to evaluate the perceived distortion so that the codec can be optimized for perceptual quality. Current standard practices in PCC highlight a primary issue: while single-feature metrics are widely used to assess compression distortion, the classic method of searching point-to-point nearest neighbors frequently fails to adequately build precise correspondences between point clouds, resulting in an ineffective capture of human perceptual features. To overcome the related limitations, we propose a novel assessment method called RBFIM, utilizing radial basis function (RBF) interpolation to convert discrete point features into a continuous feature function for the distorted point cloud. By substituting the geometry coordinates of the original point cloud into the feature function, we obtain the bijective sets of point features. This enables an establishment of precise corresponding features between distorted and original point clouds and significantly improves the accuracy of quality assessments. Moreover, this method avoids the complexity caused by bidirectional searches. Extensive experiments on multiple subjective quality datasets of compressed point clouds demonstrate that our RBFIM excels in addressing human perception tasks, thereby providing robust support for PCC optimization efforts.

CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models

Yiqi Zhu,Ziyue Wang,Can Zhang,Peng Li,Yang Liu

Task: 评估视觉语言模型（VLMs）在连续空间感知能力上的表现。

Motivation: 当前基准测试通常关注空间无关或离散图像，低估了从静态视角捕捉的图像的组合特性，即连续空间感知。

Details

Method: 提出了CoSpace基准测试，包含2,918张图像和1,626个问答对，涵盖七种任务类型，并对19种专有和开源VLMs进行评估。 Result: 评估结果显示，大多数模型在连续空间感知能力上存在缺陷，包括专有模型。开源模型和专有模型的主要差异不在于准确性，而在于响应的一致性。 Conclusion: 增强连续空间感知能力对于VLMs在现实任务中的有效表现至关重要，鼓励进一步研究以推进这一能力。 Abstract: Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 19 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.

Lightweight Gradient-Aware Upscaling of 3D Gaussian Splatting Images

Simon Niedermayr,Christoph Neuhauser Rüdiger Westermann

Task: 提出一种针对轻量级GPU上的3D高斯泼溅（3DGS）的图像放大技术。

Motivation: 提高3DGS的渲染速度并减少重建中的常见伪影。

Details

Method: 利用高斯分析图像梯度进行基于梯度的双三次样条插值，以低成本放大低分辨率3DGS渲染。 Result: 在多个数据集上的实验表明，该技术显著提高了渲染速度（3x-4x）并保持了高重建保真度。 Conclusion: 梯度感知放大技术可以集成到3DGS模型的梯度优化中，进一步提高了重建质量和性能。 Abstract: We introduce an image upscaling technique tailored for 3D Gaussian Splatting (3DGS) on lightweight GPUs. Compared to 3DGS, it achieves significantly higher rendering speeds and reduces artifacts commonly observed in 3DGS reconstructions. Our technique upscales low-resolution 3DGS renderings with a marginal increase in cost by directly leveraging the analytical image gradients of Gaussians for gradient-based bicubic spline interpolation. The technique is agnostic to the specific 3DGS implementation, achieving novel view synthesis at rates 3x-4x higher than the baseline implementation. Through extensive experiments on multiple datasets, we showcase the performance improvements and high reconstruction fidelity attainable with gradient-aware upscaling of 3DGS images. We further demonstrate the integration of gradient-aware upscaling into the gradient-based optimization of a 3DGS model and analyze its effects on reconstruction quality and performance.

RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images

Junjin Xiao,Qing Zhang,Yonewei Nie,Lei Zhu,Wei-Shi Zheng

Task: 从稀疏的多视角图像中合成未见人类的高保真新视角。

Motivation: 解决现有方法在稀疏视角和复杂人体几何重建中的不足。

Details

Method: 通过将SMPL顶点提升为密集且可靠的3D先验点来表示准确的人体几何，并基于这些点回归人体高斯参数。利用像素级和体素级特征预测图像对齐的3D先验点，并从中回归粗略高斯。通过从粗略3D高斯渲染深度图来回归细粒度像素级高斯，以捕捉高频细节。 Result: 在多个基准数据集上的实验表明，该方法在新视角合成和跨数据集泛化方面优于现有方法。 Conclusion: RoGSplat方法在稀疏视角和复杂人体几何重建中表现出色，具有较高的保真度和泛化能力。 Abstract: This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code is available at https://github.com/iSEE-Laboratory/RoGSplat.

AI-Driven Diabetic Retinopathy Diagnosis Enhancement through Image Processing and Salp Swarm Algorithm-Optimized Ensemble Network

Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel

Task: 提出一种有效的集成方法用于糖尿病视网膜病变（DR）诊断。

Motivation: 传统的诊断方法耗时且容易出错，深度学习技术提供了创新的解决方案，但单一深度学习模型在从复杂视网膜图像中提取关键特征时存在问题。

Details

Method: 提出一种包含四个主要阶段的集成方法：图像预处理、选择预训练模型、特征增强和优化。具体包括使用CLAHE增强图像对比度、Gamma校正调整亮度、离散小波变换（DWT）进行图像融合、选择DenseNet169、MobileNetV1和Xception三种预训练模型进行特征提取，并在每个模型中集成改进的残差块，最后使用Salp Swarm Algorithm（SSA）优化权重进行集成。 Result: 在Kaggle APTOS 2019多类数据集上评估，获得了88.52%的准确率。 Conclusion: 所提出的集成方法在糖尿病视网膜病变诊断中表现出色，具有较高的准确率。 Abstract: Diabetic retinopathy is a leading cause of blindness in diabetic patients and early detection plays a crucial role in preventing vision loss. Traditional diagnostic methods are often time-consuming and prone to errors. The emergence of deep learning techniques has provided innovative solutions to improve diagnostic efficiency. However, single deep learning models frequently face issues related to extracting key features from complex retinal images. To handle this problem, we present an effective ensemble method for DR diagnosis comprising four main phases: image pre-processing, selection of backbone pre-trained models, feature enhancement, and optimization. Our methodology initiates with the pre-processing phase, where we apply CLAHE to enhance image contrast and Gamma correction is then used to adjust the brightness for better feature recognition. We then apply Discrete Wavelet Transform (DWT) for image fusion by combining multi-resolution details to create a richer dataset. Then, we selected three pre-trained models with the best performance named DenseNet169, MobileNetV1, and Xception for diverse feature extraction. To further improve feature extraction, an improved residual block is integrated into each model. Finally, the predictions from these base models are then aggregated using weighted ensemble approach, with the weights optimized by using Salp Swarm Algorithm (SSA).SSA intelligently explores the weight space and finds the optimal configuration of base architectures to maximize the performance of the ensemble model. The proposed model is evaluated on the multiclass Kaggle APTOS 2019 dataset and obtained 88.52% accuracy.

Segmentation-Guided Neural Radiance Fields for Novel Street View Synthesis

Yizhou Li,Yusuke Monno,Masatoshi Okutomi,Yuuichi Tanaka,Seiichi Kataoka,Teruaki Kosiba

Task: 提出一种用于户外街景的分割引导增强NeRF方法，专注于复杂城市环境。

Motivation: 将NeRF扩展到大规模户外环境面临瞬态物体、稀疏相机和纹理以及变化光照条件等挑战。

Details

Method: 扩展ZipNeRF并利用Grounded SAM进行分割掩码生成，有效处理瞬态物体、天空建模和地面正则化，并引入外观嵌入以适应视图序列中的不一致光照。 Result: 实验结果表明，该方法优于基线ZipNeRF，提高了新视图合成质量，减少了伪影并增强了细节。 Conclusion: 提出的分割引导增强NeRF方法在复杂城市环境中表现出色，能够有效处理户外街景中的挑战。 Abstract: Recent advances in Neural Radiance Fields (NeRF) have shown great potential in 3D reconstruction and novel view synthesis, particularly for indoor and small-scale scenes. However, extending NeRF to large-scale outdoor environments presents challenges such as transient objects, sparse cameras and textures, and varying lighting conditions. In this paper, we propose a segmentation-guided enhancement to NeRF for outdoor street scenes, focusing on complex urban environments. Our approach extends ZipNeRF and utilizes Grounded SAM for segmentation mask generation, enabling effective handling of transient objects, modeling of the sky, and regularization of the ground. We also introduce appearance embeddings to adapt to inconsistent lighting across view sequences. Experimental results demonstrate that our method outperforms the baseline ZipNeRF, improving novel view synthesis quality with fewer artifacts and sharper details.

Panoramic Distortion-Aware Tokenization for Person Detection and Localization Using Transformers in Overhead Fisheye Images

Nobuhiko Wakai,Satoshi Sato,Yasunori Ishii,Takayoshi Yamashita

Task: 提出一种从鱼眼图像中准确检测和定位人物的方法。

Motivation: 由于人物旋转和小尺寸人物等因素，从鱼眼图像中准确检测人物仍然是一个挑战。

Details

Method: 将鱼眼图像转换为全景图像，并引入全景失真感知的标记化过程，结合全景图像重映射和标记化过程进行人物检测和定位。 Result: 实验表明，该方法在大规模数据集上优于传统方法。 Conclusion: 所提出的方法在人物检测和定位方面具有更高的准确性。 Abstract: Person detection methods are used widely in applications including visual surveillance, pedestrian detection, and robotics. However, accurate detection of persons from overhead fisheye images remains an open challenge because of factors including person rotation and small-sized persons. To address the person rotation problem, we convert the fisheye images into panoramic images. For smaller people, we focused on the geometry of the panoramas. Conventional detection methods tend to focus on larger people because these larger people yield large significant areas for feature maps. In equirectangular panoramic images, we find that a person's height decreases linearly near the top of the images. Using this finding, we leverage the significance values and aggregate tokens that are sorted based on these values to balance the significant areas. In this leveraging process, we introduce panoramic distortion-aware tokenization. This tokenization procedure divides a panoramic image using self-similarity figures that enable determination of optimal divisions without gaps, and we leverage the maximum significant values in each tile of token groups to preserve the significant areas of smaller people. To achieve higher detection accuracy, we propose a person detection and localization method that combines panoramic-image remapping and the tokenization procedure. Extensive experiments demonstrated that our method outperforms conventional methods when applied to large-scale datasets.

Multi-task Learning for Identification of Porcelain in Song and Yuan Dynasties

Ziyao Ling,Giovanni Delnevo,Paola Salomoni,Silvia Mirri

Task: 利用深度学习和迁移学习技术自动化分类中国瓷器文物的四个关键属性：朝代、釉色、器型和类型。

Motivation: 传统分类方法依赖专家分析，耗时、主观且难以扩展，因此需要自动化方法来提高分类效率和准确性。

Details

Method: 评估了四种卷积神经网络（ResNet50、MobileNetV2、VGG16和InceptionV3），比较了它们在有无预训练权重下的性能。 Result: 迁移学习显著提高了分类准确性，特别是在类型分类等复杂任务上。MobileNetV2和ResNet50在所有任务中表现出高准确性和鲁棒性，而VGG16在更多样化的分类任务中表现较差。 Conclusion: 迁移学习在瓷器分类中具有显著优势，未来研究方向包括领域特定的预训练、注意力机制的集成、可解释的AI方法以及推广到其他文化文物。 Abstract: Chinese porcelain holds immense historical and cultural value, making its accurate classification essential for archaeological research and cultural heritage preservation. Traditional classification methods rely heavily on expert analysis, which is time-consuming, subjective, and difficult to scale. This paper explores the application of DL and transfer learning techniques to automate the classification of porcelain artifacts across four key attributes: dynasty, glaze, ware, and type. We evaluate four Convolutional Neural Networks (CNNs) - ResNet50, MobileNetV2, VGG16, and InceptionV3 - comparing their performance with and without pre-trained weights. Our results demonstrate that transfer learning significantly enhances classification accuracy, particularly for complex tasks like type classification, where models trained from scratch exhibit lower performance. MobileNetV2 and ResNet50 consistently achieve high accuracy and robustness across all tasks, while VGG16 struggles with more diverse classifications. We further discuss the impact of dataset limitations and propose future directions, including domain-specific pre-training, integration of attention mechanisms, explainable AI methods, and generalization to other cultural artifacts.

CRCE: Coreference-Retention Concept Erasure in Text-to-Image Diffusion Models

Yuyang Xue,Edward Moroshko,Feng Chen,Steven McDonagh,Sotirios A. Tsaftaris

Task: 提出一种新的概念擦除框架CRCE，用于在文本到图像扩散模型中精确擦除目标概念。

Motivation: 现有的概念擦除方法存在欠擦除和过擦除的问题，无法精确擦除目标概念。

Details

Method: 利用大型语言模型识别与目标概念语义相关的概念和应保留的独立概念，通过显式建模核心引用和保留概念来实现更精确的概念擦除。 Result: 实验表明，CRCE在多种擦除任务中优于现有方法。 Conclusion: CRCE框架能够在不引起意外擦除的情况下，更精确地移除目标概念。 Abstract: Text-to-Image diffusion models can produce undesirable content that necessitates concept erasure techniques. However, existing methods struggle with under-erasure, leaving residual traces of targeted concepts, or over-erasure, mistakenly eliminating unrelated but visually similar concepts. To address these limitations, we introduce CRCE, a novel concept erasure framework that leverages Large Language Models to identify both semantically related concepts that should be erased alongside the target and distinct concepts that should be preserved. By explicitly modeling coreferential and retained concepts semantically, CRCE enables more precise concept removal, without unintended erasure. Experiments demonstrate that CRCE outperforms existing methods on diverse erasure tasks.

Make Your Training Flexible: Towards Deployment-Efficient Video Models

Chenting Wang,Kunchang Li,Tianxiang Jiang,Xiangyu Zeng,Yi Wang,Limin Wang

Task: 提出一种新的测试设置——Token Optimization，以最大化不同预算下的输入信息。

Motivation: 现有的视频训练方法在固定的时空网格上采样固定数量的token，导致精度与计算效率的权衡不理想，且缺乏对下游任务不同计算预算的适应性。

Details

Method: 提出了一种新的增强工具Flux，通过使采样网格灵活并利用token选择，提升模型鲁棒性。 Result: 在大规模视频预训练中集成Flux，FluxViT在标准成本下在多个任务上取得了新的最先进结果。仅使用1/4的token，仍能匹配之前最先进模型的性能，节省近90%的计算资源。 Conclusion: FluxViT通过Token Optimization和Flux工具，显著提升了视频训练的效率与性能，适用于实际场景。 Abstract: Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90\% savings. All models and data are available at https://github.com/OpenGVLab/FluxViT.

Deep Unsupervised Segmentation of Log Point Clouds

Fedor Zolotarev,Tuomas Eerola,Tomi Kauppi

Task: 提出一种基于Point Transformer的点云分割技术，用于分析木材原木的表面点云。

Motivation: 在锯木厂中，准确测量木材原木以优化锯切过程至关重要。现有的研究表明，仅使用激光扫描仪生成的表面点云即可准确预测原木的内部结构，这为基于X射线CT的测量设备提供了一种成本效益高且快速的替代方案。

Details

Method: 提出了一种基于Point Transformer的点云分割技术，该技术通过无监督方式学习找到属于原木表面的点。这是通过利用圆柱体的几何特性并考虑木材原木常见的形状变化的损失函数来实现的。 Result: 该方法在木材原木上展示了其准确性，并且该技术也可用于其他圆柱形物体。 Conclusion: 所提出的方法在木材原木的点云分割中表现出色，具有广泛的应用潜力。 Abstract: In sawmills, it is essential to accurately measure the raw material, i.e. wooden logs, to optimise the sawing process. Earlier studies have shown that accurate predictions of the inner structure of the logs can be obtained using just surface point clouds produced by a laser scanner. This provides a cost-efficient and fast alternative to the X-ray CT-based measurement devices. The essential steps in analysing log point clouds is segmentation, as it forms the basis for finding the fine surface details that provide the cues about the inner structure of the log. We propose a novel Point Transformer-based point cloud segmentation technique that learns to find the points belonging to the log surface in unsupervised manner. This is obtained using a loss function that utilises the geometrical properties of a cylinder while taking into account the shape variation common in timber logs. We demonstrate the accuracy of the method on wooden logs, but the approach could be utilised also on other cylindrical objects.

CTSR: Controllable Fidelity-Realness Trade-off Distillation for Real-World Image Super Resolution

Runyi Li,Bin Chen,Jian Zhang,Radu Timofte

Task: 提出一种基于蒸馏的方法，通过几何分解和多个教师模型的性能优势，实现更平衡的真实感和保真度权衡。

Motivation: 现有基于扩散模型的方法在视觉真实感方面表现出色，但在保真度和真实感之间的平衡上表现不佳。初步实验表明，多个模型的线性组合优于单个模型。

Details

Method: 提出了一种基于蒸馏的方法，利用几何分解和多个教师模型的性能优势，实现更平衡的真实感和保真度权衡。 Result: 在多个真实世界图像超分辨率基准测试中，该方法超越了现有的最先进方法，在保真度和真实感指标上均表现出色。 Conclusion: 提出的CTSR方法能够灵活调整超分辨率过程中的真实感和保真度权衡，实现了更优的性能。 Abstract: Real-world image super-resolution is a critical image processing task, where two key evaluation criteria are the fidelity to the original image and the visual realness of the generated results. Although existing methods based on diffusion models excel in visual realness by leveraging strong priors, they often struggle to achieve an effective balance between fidelity and realness. In our preliminary experiments, we observe that a linear combination of multiple models outperforms individual models, motivating us to harness the strengths of different models for a more effective trade-off. Based on this insight, we propose a distillation-based approach that leverages the geometric decomposition of both fidelity and realness, alongside the performance advantages of multiple teacher models, to strike a more balanced trade-off. Furthermore, we explore the controllability of this trade-off, enabling a flexible and adjustable super-resolution process, which we call CTSR (Controllable Trade-off Super-Resolution). Experiments conducted on several real-world image super-resolution benchmarks demonstrate that our method surpasses existing state-of-the-art approaches, achieving superior performance across both fidelity and realness metrics.

Manual Labelling Artificially Inflates Deep Learning-Based Segmentation Performance on Closed Canopy: Validation Using TLS

Matthew J. Allen,Harry J. F. Owen,Stuart W. D. Grieve,Emily R. Lines

Task: 通过无人机获取的RGB图像结合深度学习模型进行精确的个体树冠分割。

Motivation: 传统基于实地森林调查的方法劳动强度大且空间覆盖有限，无法准确评估生态系统对气候变化的响应。

Details

Method: 使用Terrestrial Laser Scanning (TLS)数据生成高保真验证标签，评估DeepForest (RetinaNet)和Detectree2 (Mask R-CNN)两种深度学习模型在混合未管理的北方和地中海森林数据上的表现。 Result: 与基于手工标注的生态相似站点数据相比，模型在地中海森林的TLS衍生地面真实数据上的表现显著下降（AP50: 0.094 vs. 0.670）。仅在冠层树上进行评估时，差距有所缩小（Canopy AP50: 0.365），但表现仍远低于类似的手工标注数据。模型在北方森林数据上表现也较差（AP50: 0.142），仅在冠层树上进行评估时有所提高（Canopy AP50: 0.308）。 Conclusion: 在封闭冠层森林中，基于航空的分割方法存在根本性限制，即使在冠层树上进行评估，模型的定位精度仍然非常低（Max AP75: 0.051）。 Abstract: Monitoring forest dynamics at an individual tree scale is essential for accurately assessing ecosystem responses to climate change, yet traditional methods relying on field-based forest inventories are labor-intensive and limited in spatial coverage. Advances in remote sensing using drone-acquired RGB imagery combined with deep learning models have promised precise individual tree crown (ITC) segmentation; however, existing methods are frequently validated against human-annotated images, lacking rigorous independent ground truth. In this study, we generate high-fidelity validation labels from co-located Terrestrial Laser Scanning (TLS) data for drone imagery of mixed unmanaged boreal and Mediterranean forests. We evaluate the performance of two widely used deep learning ITC segmentation models - DeepForest (RetinaNet) and Detectree2 (Mask R-CNN) - on these data, and compare to performance on further Mediterranean forest data labelled manually. When validated against TLS-derived ground truth from Mediterranean forests, model performance decreased significantly compared to assessment based on hand-labelled from an ecologically similar site (AP50: 0.094 vs. 0.670). Restricting evaluation to only canopy trees shrank this gap considerably (Canopy AP50: 0.365), although performance was still far lower than on similar hand-labelled data. Models also performed poorly on boreal forest data (AP50: 0.142), although again increasing when evaluated on canopy trees only (Canopy AP50: 0.308). Both models showed very poor localisation accuracy at stricter IoU thresholds, even when restricted to canopy trees (Max AP75: 0.051). Similar results have been observed in studies using aerial LiDAR data, suggesting fundamental limitations in aerial-based segmentation approaches in closed canopy forests.

Improving Adaptive Density Control for 3D Gaussian Splatting

Glenn Grubert,Florian Barthel,Anna Hilsmann,Peter Eisert

Task: 改进3D高斯泼溅（3DGS）的自适应密度控制机制以提高渲染质量和训练效率。

Motivation: 3DGS在场景重建过程中管理高斯原语数量时面临挑战，现有的密度控制和修剪标准可能导致渲染质量下降，特别是在背景重建不足或前景过拟合的情况下。

Details

Method: 提出了三种改进方法：修正场景范围计算、指数上升梯度阈值和显著性感知修剪策略。 Result: 改进后的方法在保持相同数量高斯原语的情况下提高了渲染质量，并且训练速度显著加快，训练时间缩短了一半以上。 Conclusion: 这些改进与大多数现有的3DGS衍生工作兼容，对未来的研究具有重要意义。 Abstract: 3D Gaussian Splatting (3DGS) has become one of the most influential works in the past year. Due to its efficient and high-quality novel view synthesis capabilities, it has been widely adopted in many research fields and applications. Nevertheless, 3DGS still faces challenges to properly manage the number of Gaussian primitives that are used during scene reconstruction. Following the adaptive density control (ADC) mechanism of 3D Gaussian Splatting, new Gaussians in under-reconstructed regions are created, while Gaussians that do not contribute to the rendering quality are pruned. We observe that those criteria for densifying and pruning Gaussians can sometimes lead to worse rendering by introducing artifacts. We especially observe under-reconstructed background or overfitted foreground regions. To encounter both problems, we propose three new improvements to the adaptive density control mechanism. Those include a correction for the scene extent calculation that does not only rely on camera positions, an exponentially ascending gradient threshold to improve training convergence, and significance-aware pruning strategy to avoid background artifacts. With these adaptions, we show that the rendering quality improves while using the same number of Gaussians primitives. Furthermore, with our improvements, the training converges considerably faster, allowing for more than twice as fast training times while yielding better quality than 3DGS. Finally, our contributions are easily compatible with most existing derivative works of 3DGS making them relevant for future works.

Free-Lunch Color-Texture Disentanglement for Stylized Image Generation

Jiang Qin,Senmao Li,Alexandra Gomez-Villa,Shiqi Yang,Yaxing Wang,Kai Wang,Joost van de Weijer

Task: 提出一种无需调优的方法，实现风格化文本到图像生成中的颜色-纹理解耦。

Motivation: 当前基于扩散的方法在控制多个风格属性（如颜色和纹理）方面存在困难，无法实现细粒度的风格定制。

Details

Method: 利用CLIP图像嵌入空间中的图像-提示可加性属性，开发了从单个颜色和纹理参考图像中分离和提取颜色-纹理嵌入的技术，并应用白化和着色变换来增强颜色一致性，引入噪声项以保持纹理保真度。 Result: 在WikiArt和StyleDrop数据集上的实验表明，SADis在风格化图像生成任务中定性和定量上均优于现有方法。 Conclusion: SADis方法为风格化图像生成提供了更精确和可定制的解决方案。 Abstract: Recent advances in Text-to-Image (T2I) diffusion models have transformed image generation, enabling significant progress in stylized generation using only a few style reference images. However, current diffusion-based methods struggle with fine-grained style customization due to challenges in controlling multiple style attributes, such as color and texture. This paper introduces the first tuning-free approach to achieve free-lunch color-texture disentanglement in stylized T2I generation, addressing the need for independently controlled style elements for the Disentangled Stylized Image Generation (DisIG) problem. Our approach leverages the Image-Prompt Additivity property in the CLIP image embedding space to develop techniques for separating and extracting Color-Texture Embeddings (CTE) from individual color and texture reference images. To ensure that the color palette of the generated image aligns closely with the color reference, we apply a whitening and coloring transformation to enhance color consistency. Additionally, to prevent texture loss due to the signal-leak bias inherent in diffusion training, we introduce a noise term that preserves textural fidelity during the Regularized Whitening and Coloring Transformation (RegWCT). Through these methods, our Style Attributes Disentanglement approach (SADis) delivers a more precise and customizable solution for stylized image generation. Experiments on images from the WikiArt and StyleDrop datasets demonstrate that, both qualitatively and quantitatively, SADis surpasses state-of-the-art stylization methods in the DisIG task.

Towards synthetic generation of realistic wooden logs

Fedor Zolotarev,Borek Reich,Tuomas Eerola,Tomi Kauppi,Pavel Zemcik

Task: 提出一种新的方法来合成逼真的3D木块表示。

Motivation: 高效的锯木加工依赖于对木块及其内部节疤分布的准确测量。CT扫描可以获取节疤的准确信息，但在锯木环境中通常不可行。利用表面测量和机器学习技术预测木块内部结构是一种有前景的替代方案，但获取足够的训练数据仍然是一个挑战。

Details

Method: 主要关注木块生成的两个方面：树内节疤生长的建模和包括节疤到达表面的区域的逼真表面合成。 Result: 提出的数学木块模型能够准确拟合从CT扫描获得的真实数据，并生成逼真的木块。 Conclusion: 这是第一种能够生成木块内部节疤和外部表面结构的木块合成方法。 Abstract: In this work, we propose a novel method to synthetically generate realistic 3D representations of wooden logs. Efficient sawmilling heavily relies on accurate measurement of logs and the distribution of knots inside them. Computed Tomography (CT) can be used to obtain accurate information about the knots but is often not feasible in a sawmill environment. A promising alternative is to utilize surface measurements and machine learning techniques to predict the inner structure of the logs. However, obtaining enough training data remains a challenge. We focus mainly on two aspects of log generation: the modeling of knot growth inside the tree, and the realistic synthesis of the surface including the regions, where the knots reach the surface. This results in the first log synthesis approach capable of generating both the internal knot and external surface structures of wood. We demonstrate that the proposed mathematical log model accurately fits to real data obtained from CT scans and enables the generation of realistic logs.

PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

Baiqin Wang,Xiangyu Zhu,Fan Shen,Hao Xu,Zhen Lei

Task: 改进音频驱动的说话人脸生成中的唇音同步和情感控制。

Motivation: 当前方法在面部动画控制（如说话风格和情感表达）方面存在不足，导致输出单一。

Details

Method: 提出了一种新颖的框架PC-Talk，通过隐式关键点变形实现唇音同步和情感控制。 Result: 在HDTF和MEAD数据集上展示了出色的控制能力，并达到了最先进的性能。 Conclusion: PC-Talk框架显著提高了说话视频的多样性和用户友好性。 Abstract: Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Wei Song,Yuran Wang,Zijia Song,Yadong Li,Haoze Sun,Weipeng Chen,Zenan Zhou,Jianhua Xu,Jiaqi Wang,Kaicheng Yu

Task: 提出一种名为DualToken的方法，统一视觉理解和生成的表示空间。

Motivation: 现有的视觉分词器在视觉生成任务中表现良好，但在理解任务中缺乏高层语义表示；而通过对比学习训练的视觉编码器在语言对齐方面表现良好，但在生成任务中难以解码回像素空间。

Details

Method: DualToken通过引入独立的代码本分别处理高层和低层特征，将语义和感知信息解耦，从而在单一分词器中统一理解和生成表示。 Result: DualToken在重建和语义任务中均达到了最先进的性能，并在下游的多模态大语言模型理解和生成任务中表现出色。 Conclusion: DualToken作为一种统一的分词器，超越了两种不同类型视觉编码器的简单组合，在统一的多模态大语言模型中提供了更优越的性能。 Abstract: The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks while demonstrating remarkable effectiveness in downstream MLLM understanding and generation tasks. Notably, we also show that DualToken, as a unified tokenizer, surpasses the naive combination of two distinct types vision encoders, providing superior performance within a unified MLLM.

LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

Yu Cheng,Fajie Yuan

Task: 提出一种新型高效的视频变分自编码器框架LeanVAE，用于解决Latent Video Diffusion Models（LVDMs）训练中的计算瓶颈问题。

Motivation: 随着LVDM训练的扩展，Video VAEs的计算开销成为关键瓶颈，尤其是在编码高分辨率视频时。

Details

Method: 提出了LeanVAE框架，包括两个关键创新：(1) 基于Neighborhood-Aware Feedforward（NAF）模块和非重叠补丁操作的轻量级架构，大幅降低计算成本；(2) 集成小波变换和压缩感知技术以提高重建质量。 Result: 实验验证了LeanVAE在视频重建和生成中的优越性，特别是在提高效率方面。LeanVAE提供了高达50倍的FLOPs减少和44倍的推理速度提升，同时保持竞争力的重建质量。 Conclusion: LeanVAE为可扩展、高效的视频生成提供了新的见解，模型和代码已开源。 Abstract: Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space.However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs.Our model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation.Our models and code are available at https://github.com/westlake-repl/LeanVAE.

EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment

Yufei Zhu,Yiming Zhong,Zemin Yang,Peishan Cong,Jingyi Yu,Xinge Zhu,Yuexin Ma

Task: 提出一种进化抓取生成方法，通过高效的偏好对齐来持续提升抓取性能。

Motivation: 由于在低多样性数据上训练的模型在复杂环境中难以有效泛化，而现实世界中的场景是无限多样的，因此需要一种能够从复杂环境中学习的机器人方法。

Details

Method: 提出了EvolvingGrasp方法，结合Handpose wise Preference Optimization (HPO) 和 Physics-aware Consistency Model，通过正负反馈持续对齐偏好并逐步优化抓取策略。 Result: 在四个基准数据集上的广泛实验表明，该方法在抓取成功率和采样效率上达到了最先进的性能。 Conclusion: EvolvingGrasp方法能够实现进化抓取生成，确保在仿真和实际场景中实现稳健、物理可行且偏好对齐的抓取。 Abstract: Dexterous robotic hands often struggle to generalize effectively in complex environments due to the limitations of models trained on low-diversity data. However, the real world presents an inherently unbounded range of scenarios, making it impractical to account for every possible variation. A natural solution is to enable robots learning from experience in complex environments, an approach akin to evolution, where systems improve through continuous feedback, learning from both failures and successes, and iterating toward optimal performance. Motivated by this, we propose EvolvingGrasp, an evolutionary grasp generation method that continuously enhances grasping performance through efficient preference alignment. Specifically, we introduce Handpose wise Preference Optimization (HPO), which allows the model to continuously align with preferences from both positive and negative feedback while progressively refining its grasping strategies. To further enhance efficiency and reliability during online adjustments, we incorporate a Physics-aware Consistency Model within HPO, which accelerates inference, reduces the number of timesteps needed for preference finetuning, and ensures physical plausibility throughout the process. Extensive experiments across four benchmark datasets demonstrate state of the art performance of our method in grasp success rate and sampling efficiency. Our results validate that EvolvingGrasp enables evolutionary grasp generation, ensuring robust, physically feasible, and preference-aligned grasping in both simulation and real scenarios.

3D Densification for Multi-Map Monocular VSLAM in Endoscopy

X. Anadón,Javier Rodríguez-Puigvert,J. M. M. Montiel

Task: 提出一种方法来去除稀疏内窥镜多地图CudaSIFT-SLAM中的异常值并增加地图密度。

Motivation: 稀疏多地图在相机定位方面表现良好，但在环境表示方面表现较差，存在噪声、高比例的不准确重建3D点以及低密度问题，无法满足临床应用需求。

Details

Method: 使用NN LightDepth进行深度密集预测，并通过LMedS与稀疏CudaSIFT子地图对齐，从而在过滤异常值的同时缓解单目深度估计中的尺度模糊问题。 Result: 在C3VD幻影结肠数据集中，实验证明该方法能够生成准确的地图，RMS精度为4.15毫米，计算时间合理。在Endomapper数据集的真实结肠镜检查中也报告了定性结果。 Conclusion: 该方法能够有效去除异常值并增加地图密度，生成可靠的3D地图，适用于临床应用。 Abstract: Multi-map Sparse Monocular visual Simultaneous Localization and Mapping applied to monocular endoscopic sequences has proven efficient to robustly recover tracking after the frequent losses in endoscopy due to motion blur, temporal occlusion, tools interaction or water jets. The sparse multi-maps are adequate for robust camera localization, however they are very poor for environment representation, they are noisy, with a high percentage of inaccurately reconstructed 3D points, including significant outliers, and more importantly with an unacceptable low density for clinical applications. We propose a method to remove outliers and densify the maps of the state of the art for sparse endoscopy multi-map CudaSIFT-SLAM. The NN LightDepth for up-to-scale depth dense predictions are aligned with the sparse CudaSIFT submaps by means of the robust to spurious LMedS. Our system mitigates the inherent scale ambiguity in monocular depth estimation while filtering outliers, leading to reliable densified 3D maps. We provide experimental evidence of accurate densified maps 4.15 mm RMS accuracy at affordable computing time in the C3VD phantom colon dataset. We report qualitative results on the real colonoscopy from the Endomapper dataset.

VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

Shoubin Yu,Difan Liu,Ziqiao Ma,Yicong Hong,Yang Zhou,Hao Tan,Joyce Chai,Mohit Bansal

Task: 提出一个统一的视频编辑框架VEGGIE，用于处理多样化的用户指令，包括视频概念编辑、定位和推理。

Motivation: 现有的视频扩散模型在处理指令编辑和多样化任务时仍面临挑战，需要一种统一的框架来支持这些任务。

Details

Method: VEGGIE框架首先使用多模态大语言模型（MLLM）解释用户指令并将其定位到视频上下文中，生成帧特定的任务查询，然后通过扩散模型渲染这些计划并生成符合用户意图的编辑视频。采用课程学习策略，先在大规模指令图像编辑数据上对齐MLLM和视频扩散模型，然后在高质量多任务视频数据上进行端到端微调。此外，引入了一种新的数据合成管道，通过利用图像到视频模型将静态图像数据转换为多样化的高质量视频编辑样本。 Result: VEGGIE在指令视频编辑中表现出色，优于其他基线模型，特别是在视频对象定位和推理分割任务中表现突出。 Conclusion: VEGGIE展示了在多样化任务中的强大性能，并揭示了多任务之间的相互帮助，展示了零样本多模态指令和上下文视频编辑等有前景的应用。 Abstract: Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.

MAST-Pro: Dynamic Mixture-of-Experts for Adaptive Segmentation of Pan-Tumors with Knowledge-Driven Prompts

Runqi Meng,Sifan Song,Pengfei Jin,Yujin Oh,Lin Teng,Yulin Wang,Yiqun Sun,Ling Chen,Xiang Li,Quanzheng Li,Ning Guo,Dinggang Shen

Task: 提出一种名为MAST-Pro的新框架，用于提高泛肿瘤分割的准确性。

Motivation: 现有方法在医学先验知识的整合、通用特征与肿瘤特定特征的平衡以及临床适应的高计算成本方面存在不足。

Details

Method: MAST-Pro框架结合了动态专家混合（D-MoE）和知识驱动的提示，通过文本和解剖学提示提供领域特定的先验知识，动态选择专家以平衡通用和肿瘤特定特征的学习，并采用参数高效微调（PEFT）来提高效率。 Result: 在多解剖肿瘤数据集上的实验表明，MAST-Pro在平均DSC上提高了5.20%，同时减少了91.04%的可训练参数，且不牺牲准确性。 Conclusion: MAST-Pro框架在泛肿瘤分割任务中表现出色，显著提高了分割准确性并降低了计算成本。 Abstract: Accurate tumor segmentation is crucial for cancer diagnosis and treatment. While foundation models have advanced general-purpose segmentation, existing methods still struggle with: (1) limited incorporation of medical priors, (2) imbalance between generic and tumor-specific features, and (3) high computational costs for clinical adaptation. To address these challenges, we propose MAST-Pro (Mixture-of-experts for Adaptive Segmentation of pan-Tumors with knowledge-driven Prompts), a novel framework that integrates dynamic Mixture-of-Experts (D-MoE) and knowledge-driven prompts for pan-tumor segmentation. Specifically, text and anatomical prompts provide domain-specific priors, guiding tumor representation learning, while D-MoE dynamically selects experts to balance generic and tumor-specific feature learning, improving segmentation accuracy across diverse tumor types. To enhance efficiency, we employ Parameter-Efficient Fine-Tuning (PEFT), optimizing MAST-Pro with significantly reduced computational overhead. Experiments on multi-anatomical tumor datasets demonstrate that MAST-Pro outperforms state-of-the-art approaches, achieving up to a 5.20% improvement in average DSC while reducing trainable parameters by 91.04%, without compromising accuracy.

RFMI: Estimating Mutual Information on Rectified Flow for Text-to-Image Alignment

Chao Wang,Giulio Franzese,Alessandro Finamore,Pietro Michiardi

Task: 提出了一种新的互信息估计器RFMI，并研究了基于RFMI的自监督微调方法，以改进文本到图像生成的对齐性。

Motivation: 现有的文本到图像生成模型在生成图像时仍存在与提示对齐不佳的问题，且现有方法仅适用于扩散模型，并需要辅助数据集、评分模型和提示的语义分析。

Details

Method: 引入了RFMI，一种用于RF模型的互信息估计器，利用预训练模型本身进行互信息估计，并研究了基于RFMI的自监督微调方法。 Result: 实验证明了RFMI在互信息估计基准上的有效性，并在SD3.5-Medium上的微调实验证实了RFMI在提高文本到图像对齐性方面的有效性，同时保持了图像质量。 Conclusion: RFMI是一种有效的互信息估计器，基于RFMI的自监督微调方法可以显著提高文本到图像生成的对齐性。 Abstract: Rectified Flow (RF) models trained with a Flow matching framework have achieved state-of-the-art performance on Text-to-Image (T2I) conditional generation. Yet, multiple benchmarks show that synthetic images can still suffer from poor alignment with the prompt, i.e., images show wrong attribute binding, subject positioning, numeracy, etc. While the literature offers many methods to improve T2I alignment, they all consider only Diffusion Models, and require auxiliary datasets, scoring models, and linguistic analysis of the prompt. In this paper we aim to address these gaps. First, we introduce RFMI, a novel Mutual Information (MI) estimator for RF models that uses the pre-trained model itself for the MI estimation. Then, we investigate a self-supervised fine-tuning approach for T2I alignment based on RFMI that does not require auxiliary information other than the pre-trained model itself. Specifically, a fine-tuning set is constructed by selecting synthetic images generated from the pre-trained RF model and having high point-wise MI between images and prompts. Our experiments on MI estimation benchmarks demonstrate the validity of RFMI, and empirical fine-tuning on SD3.5-Medium confirms the effectiveness of RFMI for improving T2I alignment while maintaining image quality.

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

Zhengxian Yang,Shi Pan,Shengqi Wang,Haoxiang Wang,Li Lin,Guanjun Li,Zhengqi Wen,Borong Lin,Jianhua Tao,Tao Yu

Task: 介绍ImViD数据集，用于沉浸式体积视频的重建和6自由度多模态沉浸式VR体验。

Motivation: 增强用户参与度，推动VR/AR技术的发展，特别是沉浸式体积视频的完整场景捕捉、大6自由度交互空间、多模态反馈和高分辨率高帧率内容。

Details

Method: 引入ImViD数据集，支持多视角、多模态数据捕捉，包括5K分辨率60FPS的多视角视频和同步音频，涵盖丰富的室内外场景。 Result: 通过基准测试和重建交互结果，证明了数据集和基线方法的有效性。 Conclusion: ImViD数据集和基线方法将推动未来沉浸式体积视频制作的研究。 Abstract: User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.

Impossible Videos

Zechen Bai,Hai Ci,Mike Zheng Shou

Task: 评估和促进视频理解和生成模型的进展，特别是针对不可能视频内容。

Motivation: 当前合成数据集主要复制现实世界场景，忽略了不可能、反事实和反现实的视频概念。

Details

Method: 引入IPV-Bench基准，基于包含4个领域、14个类别的综合分类法，构建提示套件评估视频生成模型，并策划视频基准评估视频理解模型。 Result: 综合评估揭示了视频模型的局限性和未来发展方向。 Conclusion: IPV-Bench为下一代视频模型铺平了道路。 Abstract: Synthetic videos nowadays is widely used to complement data scarcity and diversity of real-world videos. Current synthetic datasets primarily replicate real-world scenarios, leaving impossible, counterfactual and anti-reality video concepts underexplored. This work aims to answer two questions: 1) Can today's video generation models effectively follow prompts to create impossible video content? 2) Are today's video understanding models good enough for understanding impossible videos? To this end, we introduce IPV-Bench, a novel benchmark designed to evaluate and foster progress in video understanding and generation. IPV-Bench is underpinned by a comprehensive taxonomy, encompassing 4 domains, 14 categories. It features diverse scenes that defy physical, biological, geographical, or social laws. Based on the taxonomy, a prompt suite is constructed to evaluate video generation models, challenging their prompt following and creativity capabilities. In addition, a video benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos, which particularly requires reasoning on temporal dynamics and world knowledge. Comprehensive evaluations reveal limitations and insights for future directions of video models, paving the way for next-generation video models.

Diffusion-based Facial Aesthetics Enhancement with 3D Structure Guidance

Lisha Li,Jingwen Hou,Weide Liu,Yuming Fang,Jiebin Yan

Task: 通过调整面部图像的结构和外观来增强面部美学，同时尽可能保留其身份。

Motivation: 现有方法在生成面部美学增强（FAE）时，虽然取得了不错的效果，但可能会产生过度美化或身份一致性较低的结果。

Details

Method: 提出了基于扩散的最近邻结构引导（NNSG-Diffusion）方法，通过3D结构引导来美化2D面部图像。具体来说，从最近邻参考面部提取FAE引导，并通过参考匹配的2D参考面部和2D输入面部来恢复3D面部模型，从而提取深度和轮廓引导。 Result: 实验表明，该方法在增强面部美学的同时，能够更好地保留面部身份。 Conclusion: NNSG-Diffusion方法在面部美学增强方面优于现有方法，能够有效提升面部吸引力并保持身份一致性。 Abstract: Facial Aesthetics Enhancement (FAE) aims to improve facial attractiveness by adjusting the structure and appearance of a facial image while preserving its identity as much as possible. Most existing methods adopted deep feature-based or score-based guidance for generation models to conduct FAE. Although these methods achieved promising results, they potentially produced excessively beautified results with lower identity consistency or insufficiently improved facial attractiveness. To enhance facial aesthetics with less loss of identity, we propose the Nearest Neighbor Structure Guidance based on Diffusion (NNSG-Diffusion), a diffusion-based FAE method that beautifies a 2D facial image with 3D structure guidance. Specifically, we propose to extract FAE guidance from a nearest neighbor reference face. To allow for less change of facial structures in the FAE process, a 3D face model is recovered by referring to both the matched 2D reference face and the 2D input face, so that the depth and contour guidance can be extracted from the 3D face model. Then the depth and contour clues can provide effective guidance to Stable Diffusion with ControlNet for FAE. Extensive experiments demonstrate that our method is superior to previous relevant methods in enhancing facial aesthetics while preserving facial identity.

DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

Mert Bulent Sariyildiz,Philippe Weinzaepfel,Thomas Lucas,Pau de Jorge,Diane Larlus,Yannis Kalantidis

Task: 研究异构教师蒸馏（co-distillation）问题，即在教师模型设计目标和训练数据差异显著的情况下进行多教师蒸馏。

Motivation: 探索在多教师蒸馏中，教师模型包括专门用于2D和3D感知任务的视觉模型时，是否也能取得类似的成功。

Details

Method: 提出数据共享策略和教师特定编码，并引入DUNE模型，该模型在2D视觉、3D理解和3D人体感知任务中表现出色。 Result: DUNE模型在各自任务上表现与较大的教师模型相当，有时甚至超越它们，特别是在无地图视觉重定位任务中超越了MASt3R。 Conclusion: 异构教师蒸馏是可行的，DUNE模型在多个任务上表现出色，证明了其有效性。 Abstract: Recent multi-teacher distillation methods have unified the encoders of multiple foundation models into a single encoder, achieving competitive performance on core vision tasks like classification, segmentation, and depth estimation. This led us to ask: Could similar success be achieved when the pool of teachers also includes vision models specialized in diverse tasks across both 2D and 3D perception? In this paper, we define and investigate the problem of heterogeneous teacher distillation, or co-distillation, a challenging multi-teacher distillation scenario where teacher models vary significantly in both (a) their design objectives and (b) the data they were trained on. We explore data-sharing strategies and teacher-specific encoding, and introduce DUNE, a single encoder excelling in 2D vision, 3D understanding, and 3D human perception. Our model achieves performance comparable to that of its larger teachers, sometimes even outperforming them, on their respective tasks. Notably, DUNE surpasses MASt3R in Map-free Visual Relocalization with a much smaller encoder.

ExDDV: A New Dataset for Explainable Deepfake Detection in Video

Vlad Hondru,Eduard Hogea,Darian Onchis,Radu Tudor Ionescu

Task: 开发一个用于视频深度伪造检测的可解释数据集和基准。

Motivation: 由于生成的视频质量越来越高，人类越来越难以识别深度伪造内容，而自动深度伪造检测器也容易出错且决策不可解释，导致人类容易受到深度伪造欺诈和错误信息的影响。

Details

Method: 引入了ExDDV数据集，包含约5.4K个真实和深度伪造视频，并手动标注了文本描述（解释伪影）和点击（指出伪影）。评估了多种视觉-语言模型，并进行了各种微调和上下文学习策略的实验。 Result: 实验结果表明，文本和点击监督都是开发鲁棒的可解释深度伪造视频模型所必需的，这些模型能够定位和描述观察到的伪影。 Conclusion: ExDDV数据集和代码可用于复现结果，为深度伪造检测提供了新的可解释性基准。 Abstract: The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.

Hongyu Zhang,Yufan Deng,Shenghai Yuan,Peng Jin,Zesen Cheng,Yian Zhao,Chang Liu,Jie Chen

Task: 提出了一种名为MagicComp的训练自由方法，通过双阶段优化增强组合式文本到视频生成。

Motivation: 现有方法在准确绑定属性、确定空间关系以及捕捉多个主体之间的复杂动作交互方面仍存在困难。

Details

Method: 在条件阶段引入语义锚点消歧，通过逐步注入语义锚点的方向向量来强化主体特定语义并解决主体间歧义；在去噪阶段提出动态布局融合注意力，通过掩码注意力调制将主体灵活绑定到其时空区域。 Result: 在T2V-CompBench和VBench上的广泛实验表明，MagicComp优于现有最先进的方法。 Conclusion: MagicComp是一种模型无关且多功能的方法，可以无缝集成到现有的T2V架构中，展示了其在复杂提示和轨迹可控视频生成应用中的潜力。 Abstract: Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: https://hong-yu-zhang.github.io/MagicComp-Page/.

Joint Image-Instance Spatial-Temporal Attention for Few-shot Action Recognition

Zefeng Qian,Chongyang Zhang,Yifei Huang,Gang Wang,Jiangyong Ying

Task: Few-shot Action Recognition (FSAR) using a novel joint Image-Instance level Spatial-temporal attention approach (I2ST).

Motivation: Existing methods for FSAR mainly focus on image-level features, which incorporate background noise and insufficiently focus on real foreground (action-related instances), compromising recognition capability.

Details

Method: Proposes a novel joint Image-Instance level Spatial-temporal attention approach (I2ST) that perceives action-related instances and integrates them with image features via spatial-temporal attention. It consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial-temporal Attention. Result: The proposed I2ST approach effectively integrates action-related instances with image features, improving recognition capability in few-shot scenarios. Conclusion: The I2ST approach addresses the limitations of existing methods by focusing on action-related instances and integrating them with image features, thereby enhancing few-shot action recognition performance. Abstract: Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial-temporal attention approach (I2ST) for Few-shot Action Recognition. The core concept of I2ST is to perceive the action-related instances and integrate them with image features via spatial-temporal attention. Specifically, I2ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial-temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial-temporal Attention is used to construct the feature dependency between instances and images...

Bolt3D: Generating 3D Scenes in Seconds

Stanislaw Szymanowicz,Jason Y. Zhang,Pratul Srinivasan,Ruiqi Gao,Arthur Brussee,Aleksander Holynski,Ricardo Martin-Brualla,Jonathan T. Barron,Philipp Henzler

Task: 提出一种用于快速前馈3D场景生成的潜在扩散模型。

Motivation: 利用现有的强大且可扩展的2D扩散网络架构，生成一致的高保真3D场景表示。

Details

Method: 通过将最先进的密集3D重建技术应用于现有的多视图图像数据集，创建大规模的多视图一致3D几何和外观数据集，并训练模型。 Result: Bolt3D在单个GPU上不到7秒的时间内直接采样3D场景表示，推理成本比之前的多视图生成模型减少了300倍。 Conclusion: Bolt3D显著降低了3D场景生成的推理成本，并提高了生成速度。 Abstract: We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.

SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model

Yucheng Mao,Boyang Wang,Nilesh Kulkarni,Jeong Joon Park

Task: 从多张同一场景的退化照片中联合去噪以恢复图像

Motivation: 通过结合多张退化图像中的互补信息，更好地约束图像恢复问题

Details

Method: 实现了一个强大的多视图扩散模型，通过从多视图关系中提取丰富信息来联合生成未损坏的视图 Result: 实验表明，多视图方法在图像去模糊和超分辨率任务上优于现有的单视图图像甚至基于视频的方法 Conclusion: 该模型能够输出3D一致的图像，使其成为需要强大多视图集成的应用（如3D重建或姿态估计）的有前途的工具 Abstract: The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

Xinyu Fang,Zhijian Chen,Kai Lan,Shengyuan Ding,Yingji Liang,Xiangyu Zhao,Farong Wen,Zicheng Zhang,Guofeng Zhang,Haodong Duan,Kai Chen,Dahua Lin

Task: 评估多模态大语言模型（MLLMs）在现实世界图像任务中的创造能力

Motivation: 填补多模态大语言模型在创造力评估领域的空白

Details

Method: 引入Creation-MMBench，一个包含765个测试案例和51个细粒度任务的多模态基准 Result: 当前开源的MLLMs在创造性任务中显著落后于专有模型，视觉微调可能对基础LLM的创造能力产生负面影响 Conclusion: Creation-MMBench为提升MLLM的创造力提供了有价值的见解，并为未来多模态生成智能的改进奠定了基础 Abstract: Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on https://github.com/open-compass/Creation-MMBench.

ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

Yulin Pan,Xiangteng He,Chaojie Mao,Zhen Han,Zeyinzi Jiang,Jingfeng Zhang,Yu Liu

Task: 提出ICE-Bench，一个统一且全面的基准，用于严格评估图像生成模型。

Motivation: 评估图像生成模型的性能仍然是一个巨大的挑战。

Details

Method: ICE-Bench通过以下关键特征实现全面性：(1) 从粗到细的任务：将图像生成系统分解为四个任务类别，并进一步细分为31个细粒度任务；(2) 多维度指标：评估框架在6个维度上评估图像生成能力，并引入11个指标支持多维度评估；(3) 混合数据：数据来自真实场景和虚拟生成，有效提高数据多样性并缓解模型评估中的偏差问题。 Result: 通过ICE-Bench对现有生成模型进行了全面分析，揭示了基准的挑战性以及当前模型能力与实际生成需求之间的差距。 Conclusion: ICE-Bench将开源，包括其数据集、评估代码和模型，为研究社区提供有价值的资源。 Abstract: Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Tasks: We systematically deconstruct image generation into four task categories: No-ref/Ref Image Creating/Editing, based on the presence or absence of source images and reference images. And further decompose them into 31 fine-grained tasks covering a broad spectrum of image generation requirements, culminating in a comprehensive benchmark. (2) Multi-dimensional Metrics: The evaluation framework assesses image generation capabilities across 6 dimensions: aesthetic quality, imaging quality, prompt following, source consistency, reference consistency, and controllability. 11 metrics are introduced to support the multi-dimensional evaluation. Notably, we introduce VLLM-QA, an innovative metric designed to assess the success of image editing by leveraging large models. (3) Hybrid Data: The data comes from real scenes and virtual generation, which effectively improves data diversity and alleviates the bias problem in model evaluation. Through ICE-Bench, we conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements. To foster further advancements in the field, we will open-source ICE-Bench, including its dataset, evaluation code, and models, thereby providing a valuable resource for the research community.

Multi-view Reconstruction via SfM-guided Monocular Depth Estimation

Haoyu Guo,He Zhu,Sida Peng,Haotong Lin,Yunzhi Yan,Tao Xie,Wenguan Wang,Xiaowei Zhou,Hujun Bao

Task: 提出一种新的多视角几何重建方法。

Motivation: 近年来，大型视觉模型发展迅速，在各种任务中表现出色，并展示了显著的泛化能力。然而，单目深度估计任务存在模糊性，估计的深度值通常不够准确，限制了其在辅助多视角重建任务中的实用性。

Details

Method: 将SfM信息（一种强大的多视角先验）引入深度估计过程，从而提高深度预测的质量，并使其能够直接应用于多视角几何重建。 Result: 在公开的真实世界数据集上的实验结果表明，与之前的单目深度估计工作相比，我们的方法显著提高了深度估计的质量。此外，我们在包括室内、街景和航拍视图在内的各种场景中评估了我们的方法的重建质量，超越了最先进的多视角立体视觉（MVS）方法。 Conclusion: 通过将SfM信息引入深度估计过程，我们的方法显著提高了深度估计的质量，并在多视角几何重建任务中表现出色。 Abstract: In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at https://zju3dv.github.io/murre/ .

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

Minglei Shi,Ziyang Yuan,Haotian Yang,Xintao Wang,Mingwu Zheng,Xin Tao,Wenliang Zhao,Wenzhao Zheng,Jie Zhou,Jiwen Lu,Pengfei Wan,Di Zhang,Kun Gai

Task: 提出一种新的扩散模型方法DiffMoE，以解决现有扩散模型在不同条件和噪声水平下输入处理均匀的问题。

Motivation: 现有扩散模型在不同条件和噪声水平下的表现受限，需要一种能够充分利用扩散过程异质性的方法。

Details

Method: DiffMoE引入了一个批量级全局令牌池，使专家在训练期间能够访问全局令牌分布，并结合容量预测器动态分配计算资源。 Result: DiffMoE在ImageNet基准测试中实现了最先进的性能，显著优于具有3倍激活参数的密集架构和现有的MoE方法，同时保持1倍激活参数。 Conclusion: DiffMoE不仅在类条件生成任务中表现出色，还在更具挑战性的文本到图像生成任务中展示了其广泛的适用性。 Abstract: Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: https://shiml20.github.io/DiffMoE/

Stable Virtual Camera: Generative View Synthesis with Diffusion Models

Jensen,Zhou,Hang Gao,Vikram Voleti,Aaryaman Vasishta,Chun-Han Yao,Mark Boss,Philip Torr,Christian Rupprecht,Varun Jampani

Task: 提出一种通用的扩散模型（Seva），用于生成场景的新视角，给定任意数量的输入视角和目标相机。

Motivation: 现有方法在生成大视角变化或时间上平滑的样本时存在困难，且依赖于特定的任务配置。

Details

Method: 通过简单的模型设计、优化的训练配方和灵活的采样策略，克服了这些限制，并在测试时泛化到各种视图合成任务。 Result: 生成的样本保持高度一致性，无需额外的基于3D表示的蒸馏，从而简化了实际应用中的视图合成。此外，该方法可以生成高质量的视频，持续时间长达半分钟，并具有无缝循环闭合。 Conclusion: 广泛的基准测试表明，Seva在不同数据集和设置下均优于现有方法。 Abstract: We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time. As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild. Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure. Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings.

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

NVIDIA,:,Hassan Abu Alhaija,Jose Alvarez,Maciej Bala,Tiffany Cai,Tianshi Cao,Liz Cha,Joshua Chen,Mike Chen,Francesco Ferroni,Sanja Fidler,Dieter Fox,Yunhao Ge,Jinwei Gu,Ali Hassani,Michael Isaev,Pooya Jannaty,Shiyi Lan,Tobias Lasser,Huan Ling,Ming-Yu Liu,Xian Liu,Yifan Lu,Alice Luo,Qianli Ma,Hanzi Mao,Fabio Ramos,Xuanchi Ren,Tianchang Shen,Shitao Tang,Ting-Chun Wang,Jay Wu,Jiashu Xu,Stella Xu,Kevin Xie,Yuchong Ye,Xiaodong Yang,Xiaohui Zeng,Yu Zeng

Task: 介绍Cosmos-Transfer，一种基于多种空间控制输入生成世界模拟的条件世界生成模型。

Motivation: 实现高度可控的世界生成，应用于各种世界到世界的转换用例，包括Sim2Real。

Details

Method: 设计了一个自适应和可定制的空间条件方案，允许在不同空间位置对不同条件输入进行加权。 Result: 通过广泛评估分析了所提出的模型，并展示了其在物理AI中的应用，包括机器人Sim2Real和自动驾驶数据增强。 Conclusion: 展示了推理扩展策略以实现实时世界生成，并开源了模型和代码以加速该领域的研究发展。 Abstract: We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at https://github.com/nvidia-cosmos/cosmos-transfer1.

State Space Model Meets Transformer: A New Paradigm for 3D Object Detection

Chuxin Wang,Wenfei Yang,Xiang Liu,Tianzhu Zhang

Task: 提出一种基于交互状态空间模型（DEST）的3D室内物体检测新范式。

Motivation: 现有的DETR方法在3D室内物体检测中表现良好，但由于场景点特征在变压器解码器中保持不变，导致后期解码层贡献有限，限制了性能提升。

Details

Method: 设计了一种新的状态依赖SSM参数化方法，使系统状态能够有效地作为3D室内检测任务中的查询。此外，引入了四种关键设计：序列化和双向扫描策略、状态间注意力机制、门控前馈网络。 Result: 在ScanNet V2和SUN RGB-D数据集上的实验表明，DEST方法显著提升了性能，AP50分别提高了5.3和3.2。 Conclusion: DEST方法首次将查询建模为系统状态，场景点建模为系统输入，能够以线性复杂度同时更新场景点特征和查询特征，显著提升了3D室内物体检测的性能。 Abstract: DETR-based methods, which use multi-layer transformer decoders to refine object queries iteratively, have shown promising performance in 3D indoor object detection. However, the scene point features in the transformer decoder remain fixed, leading to minimal contributions from later decoder layers, thereby limiting performance improvement. Recently, State Space Models (SSM) have shown efficient context modeling ability with linear complexity through iterative interactions between system states and inputs. Inspired by SSMs, we propose a new 3D object DEtection paradigm with an interactive STate space model (DEST). In the interactive SSM, we design a novel state-dependent SSM parameterization method that enables system states to effectively serve as queries in 3D indoor detection tasks. In addition, we introduce four key designs tailored to the characteristics of point cloud and SSM: The serialization and bidirectional scanning strategies enable bidirectional feature interaction among scene points within the SSM. The inter-state attention mechanism models the relationships between state points, while the gated feed-forward network enhances inter-channel correlations. To the best of our knowledge, this is the first method to model queries as system states and scene points as system inputs, which can simultaneously update scene point features and query features with linear complexity. Extensive experiments on two challenging datasets demonstrate the effectiveness of our DEST-based method. Our method improves the GroupFree baseline in terms of AP50 on ScanNet V2 (+5.3) and SUN RGB-D (+3.2) datasets. Based on the VDETR baseline, Our method sets a new SOTA on the ScanNetV2 and SUN RGB-D datasets.

Deeply Supervised Flow-Based Generative Models

Inkyu Shin,Chenglin Yang,Liang-Chieh Chen

Task: 提出一种新的框架DeepFlow，通过层间通信增强速度表示，以改进基于流的生成模型。

Motivation: 现有的基于流的生成模型仅从最终层输出训练速度表示，未能充分利用丰富的层间表示，可能阻碍模型收敛。

Details

Method: DeepFlow将Transformer层划分为平衡分支，并在相邻分支之间插入轻量级的Velocity Refiner with Acceleration (VeRA)块，以对齐Transformer块内的中间速度特征。 Result: DeepFlow在ImageNet上收敛速度提高了8倍，性能相当，FID降低了2.6，训练时间减半。在文本到图像生成任务中，DeepFlow在MSCOCO和零样本GenEval上的表现优于基线模型。 Conclusion: DeepFlow通过层间通信和深度监督显著提高了基于流的生成模型的收敛速度和生成质量。 Abstract: Flow based generative models have charted an impressive path across multiple visual generation tasks by adhering to a simple principle: learning velocity representations of a linear interpolant. However, we observe that training velocity solely from the final layer output underutilizes the rich inter layer representations, potentially impeding model convergence. To address this limitation, we introduce DeepFlow, a novel framework that enhances velocity representation through inter layer communication. DeepFlow partitions transformer layers into balanced branches with deep supervision and inserts a lightweight Velocity Refiner with Acceleration (VeRA) block between adjacent branches, which aligns the intermediate velocity features within transformer blocks. Powered by the improved deep supervision via the internal velocity alignment, DeepFlow converges 8 times faster on ImageNet with equivalent performance and further reduces FID by 2.6 while halving training time compared to previous flow based models without a classifier free guidance. DeepFlow also outperforms baselines in text to image generation tasks, as evidenced by evaluations on MSCOCO and zero shot GenEval.

Tracking Meets Large Multimodal Models for Driving Scenario Understanding

Ayesha Ishaq,Jean Lahoud,Fahad Shahbaz Khan,Salman Khan,Hisham Cholakkal,Rao Muhammad Anwer

Task: 提出一种将跟踪信息集成到大型多模态模型（LMMs）中的新方法，以增强其对驾驶场景的时空理解。

Motivation: 现有的LMMs在自动驾驶研究中主要依赖图像数据，未能充分利用3D空间和时间元素，导致在动态驾驶环境中的效果有限。

Details

Method: 通过引入跟踪编码器将3D跟踪数据嵌入到LMMs中，丰富视觉查询的空间和时间线索，并采用自监督方法预训练跟踪编码器。 Result: 实验结果表明，该方法在DriveLM-nuScenes基准测试中准确率提高了9.5%，ChatGPT得分提高了7.04分，总体得分提高了9.4%，在DriveLM-CARLA基准测试中最终得分提高了3.7%。 Conclusion: 通过集成3D跟踪数据，显著提升了LMMs在自动驾驶中的感知、规划和预测任务的性能。 Abstract: Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at https://github.com/mbzuai-oryx/TrackingMeetsLMM

Utilization of Neighbor Information for Image Classification with Different Levels of Supervision

Gihan Jayatilaka,Abhinav Shrivastava,Matthew Gwilliam

Task: 提出一种灵活的方法，用于在半监督和无监督图像识别之间架起桥梁，适用于广义类别发现（GCD）和图像聚类。

Motivation: 尽管GCD和图像聚类的动机有重叠，但现有方法通常只适用于单一任务，GCD方法依赖于数据的标记部分，而深度图像聚类方法没有有效利用标签的机制。

Details

Method: 提出了一种创新方法，利用邻居信息进行分类（UNIC），在无监督（聚类）和半监督（GCD）设置中都有效。通过采样和清理策略识别准确的正面和负面邻居，并通过采样这两种邻居计算聚类损失来微调骨干网络。 Result: 在聚类任务中（ImageNet-100, ImageNet200）提升了3%，在GCD任务中（ImageNet-100, CUB, SCars, Aircraft）分别提升了0.8%、5%、2%和4%。 Conclusion: 该方法在聚类和GCD任务中都取得了最先进的结果，成功地将两种任务结合起来。 Abstract: We propose to bridge the gap between semi-supervised and unsupervised image recognition with a flexible method that performs well for both generalized category discovery (GCD) and image clustering. Despite the overlap in motivation between these tasks, the methods themselves are restricted to a single task -- GCD methods are reliant on the labeled portion of the data, and deep image clustering methods have no built-in way to leverage the labels efficiently. We connect the two regimes with an innovative approach that Utilizes Neighbor Information for Classification (UNIC) both in the unsupervised (clustering) and semisupervised (GCD) setting. State-of-the-art clustering methods already rely heavily on nearest neighbors. We improve on their results substantially in two parts, first with a sampling and cleaning strategy where we identify accurate positive and negative neighbors, and secondly by finetuning the backbone with clustering losses computed by sampling both types of neighbors. We then adapt this pipeline to GCD by utilizing the labelled images as ground truth neighbors. Our method yields state-of-the-art results for both clustering (+3% ImageNet-100, Imagenet200) and GCD (+0.8% ImageNet-100, +5% CUB, +2% SCars, +4% Aircraft).

Advances in 4D Generation: A Survey

Qiaowei Miao,Kehan Li,Jinsheng Quan,Zhiyuan Min,Shaojie Ma,Yichao Xu,Yi Yang,Yawei Luo

Task: 对4D生成领域进行全面调查，系统研究其理论基础、关键方法和实际应用。

Motivation: 4D生成作为一个新兴且快速发展的研究领域，结合了时间维度的生成任务，具有广泛的应用前景。

Details

Method: 介绍了4D数据表示的核心概念，探讨了驱动4D生成的使能技术，包括时空建模、神经表示和生成框架的进展，并回顾了采用不同控制机制和表示策略生成4D输出的最新研究。 Result: 总结了4D生成技术的研究轨迹，并探讨了其在动态对象建模、场景生成、数字人合成、4D内容编辑和自动驾驶等领域的广泛应用。 Conclusion: 分析了4D生成的关键挑战，如数据可用性、计算效率和时空一致性，并提出了未来研究的有前景方向。 Abstract: Generative artificial intelligence has witnessed remarkable advancements across multiple domains in recent years. Building on the successes of 2D and 3D content generation, 4D generation, which incorporates the temporal dimension into generative tasks, has emerged as a burgeoning yet rapidly evolving research area. This paper presents a comprehensive survey of this emerging field, systematically examining its theoretical foundations, key methodologies, and practical applications, with the aim of providing readers with a holistic understanding of the current state and future potential of 4D generation. We begin by introducing the core concepts of 4D data representations, encompassing both structured and unstructured formats, and their implications for generative tasks. Building upon this foundation, we delve into the enabling technologies that drive 4D generation, including advancements in spatiotemporal modeling, neural representations, and generative frameworks. We further review recent studies that employ diverse control mechanisms and representation strategies for generating 4D outputs, categorizing these approaches and summarizing their research trajectories. In addition, we explore the wide-ranging applications of 4D generation techniques, spanning dynamic object modeling, scene generation, digital human synthesis, 4D content editing, and autonomous driving. Finally, we analyze the key challenges inherent to 4D generation, such as data availability, computational efficiency, and spatiotemporal consistency, and propose promising directions for future research. Our code is publicly available at: \href{https://github.com/MiaoQiaowei/Awesome-4D}{https://github.com/MiaoQiaowei/Awesome-4D}.

The Power of Context: How Multimodality Improves Image Super-Resolution

Kangfu Mei,Hossein Talebi,Mojtaba Ardakani,Vishal M. Patel,Peyman Milanfar,Mauricio Delbracio

Task: 提出一种利用多模态信息（包括深度、分割、边缘和文本提示）在扩散模型框架内学习强大的生成先验的单图像超分辨率（SISR）方法。

Motivation: 现有的单图像超分辨率方法通常依赖于有限的图像先验，导致结果不理想。通过利用多模态信息，可以更好地恢复细节并保持感知质量。

Details

Method: 提出了一种灵活的网络架构，有效地融合多模态信息，并在扩散模型框架内学习生成先验。通过使用其他模态的空间信息来引导基于文本的区域条件，减少了文本提示引入的幻觉。 Result: 实验表明，该模型在生成SISR方法中超越了现有技术，实现了卓越的视觉质量和保真度。 Conclusion: 所提出的方法通过利用多模态信息，显著提高了单图像超分辨率的效果，展示了其在视觉质量和保真度方面的优势。 Abstract: Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at https://mmsr.kfmei.com/.

Aligning Multimodal LLM with Human Preference: A Survey

Tao Yu,Yi-Fan Zhang†,Chaoyou Fu,Junkang Wu,Jinda Lu,Kun Wang,Xingyu Lu,Yunhang Shen,Guibin Zhang,Dingjie Song,Yibo Yan,Tianlong Xu,Qingsong Wen,Zhang Zhang,Yan Huang,Liang Wang,Tieniu Tan

Task: 对多模态大语言模型（MLLMs）的对齐算法进行全面和系统的综述

Motivation: 尽管多模态大语言模型在处理复杂任务方面表现出色，但在真实性、安全性、推理能力和与人类偏好的一致性方面仍存在不足，这促使了对齐算法的研究和发展。

Details

Method: 本文探讨了四个关键方面：对齐算法的应用场景、构建对齐数据集的核心因素、评估对齐算法的基准以及未来发展方向。 Result: 本文提供了一个全面的对齐算法综述，帮助研究人员组织当前领域的进展，并激发更好的对齐方法。 Conclusion: 对齐算法是解决多模态大语言模型挑战的有力方法，本文的综述为未来的研究提供了方向和灵感。 Abstract: Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Alignment.

MusicInfuser: Making Video Diffusion Listen and Dance

Susung Hong,Ira Kemelmacher-Shlizerman,Brian Curless,Steven M. Seitz

Task: 生成与指定音乐同步的高质量舞蹈视频。

Motivation: 通过引入轻量级的音乐-视频交叉注意力和低秩适配器，展示如何将现有的视频扩散模型适应于音乐输入，而无需设计和训练新的多模态音频-视频模型。

Details

Method: 引入轻量级的音乐-视频交叉注意力和低秩适配器，仅对舞蹈视频进行微调。 Result: MusicInfuser 在保持底层模型的灵活性和生成能力的同时，实现了高质量的音乐驱动视频生成。 Conclusion: 通过引入 Video-LLMs 评估框架，评估舞蹈生成质量的多个维度，证明了 MusicInfuser 的有效性。 Abstract: We introduce MusicInfuser, an approach for generating high-quality dance videos that are synchronized to a specified music track. Rather than attempting to design and train a new multimodal audio-video model, we show how existing video diffusion models can be adapted to align with musical inputs by introducing lightweight music-video cross-attention and a low-rank adapter. Unlike prior work requiring motion capture data, our approach fine-tunes only on dance videos. MusicInfuser achieves high-quality music-driven video generation while preserving the flexibility and generative capabilities of the underlying models. We introduce an evaluation framework using Video-LLMs to assess multiple dimensions of dance generation quality. The project page and code are available at https://susunghong.github.io/MusicInfuser.

Efficient Domain Augmentation for Autonomous Driving Testing Using Diffusion Models

Luciano Baresi,Davide Yi Xian Hu,Andrea Stocco,Paolo Tonella

Task: 探索生成人工智能技术与基于物理的模拟器集成，以增强自动驾驶系统（ADS）的系统级测试。

Motivation: 模拟测试在评估自动驾驶系统的可靠性方面被广泛使用，但其有效性受到模拟器中可用操作设计域（ODD）条件的限制。

Details

Method: 评估基于扩散模型的三种生成策略（指令编辑、修复和带优化的修复）的有效性和计算开销，使用基于语义分割的自动检测器确保生成图像的语义保存和真实性。 Result: 扩散模型有助于增加ADS系统级测试的ODD覆盖率，自动语义验证器的误报率低至3%，成功识别了新的ADS系统故障。 Conclusion: 生成人工智能技术与物理模拟器的集成可以有效增强自动驾驶系统的系统级测试，提高ODD覆盖率并识别新的系统故障。 Abstract: Simulation-based testing is widely used to assess the reliability of Autonomous Driving Systems (ADS), but its effectiveness is limited by the operational design domain (ODD) conditions available in such simulators. To address this limitation, in this work, we explore the integration of generative artificial intelligence techniques with physics-based simulators to enhance ADS system-level testing. Our study evaluates the effectiveness and computational overhead of three generative strategies based on diffusion models, namely instruction-editing, inpainting, and inpainting with refinement. Specifically, we assess these techniques' capabilities to produce augmented simulator-generated images of driving scenarios representing new ODDs. We employ a novel automated detector for invalid inputs based on semantic segmentation to ensure semantic preservation and realism of the neural generated images. We then perform system-level testing to evaluate the ADS's generalization ability to newly synthesized ODDs. Our findings show that diffusion models help increase the ODD coverage for system-level testing of ADS. Our automated semantic validator achieved a percentage of false positives as low as 3%, retaining the correctness and quality of the generated images for testing. Our approach successfully identified new ADS system failures before real-world testing.

AUTV: Creating Underwater Video Datasets with Pixel-wise Annotations

Quang Trung Truong,Wong Yuk Kwan,Duc Thanh Nguyen,Binh-Son Hua,Sai-Kit Yeung

Task: 提出AUTV框架，用于合成带有像素级注释的海洋视频数据。

Motivation: 现有的无训练视频生成技术在动态海洋环境和相机运动的影响下，往往产生运动中断和对齐不良的结果。

Details

Method: 提出AUTV框架，构建了两个视频数据集UTV和SUTV。UTV包含2000个视频-文本对，SUTV包含10000个带有海洋对象分割掩码的合成视频。 Result: UTV提供了多样化的水下视频，包含外观、纹理、相机内参、光照和动物行为等综合注释。SUTV可用于改进水下下游任务，如视频修复和视频对象分割。 Conclusion: AUTV框架有效解决了现有技术在海洋视频生成中的问题，并展示了其在视频修复和视频对象分割中的应用潜力。 Abstract: Underwater video analysis, hampered by the dynamic marine environment and camera motion, remains a challenging task in computer vision. Existing training-free video generation techniques, learning motion dynamics on the frame-by-frame basis, often produce poor results with noticeable motion interruptions and misaligments. To address these issues, we propose AUTV, a framework for synthesizing marine video data with pixel-wise annotations. We demonstrate the effectiveness of this framework by constructing two video datasets, namely UTV, a real-world dataset comprising 2,000 video-text pairs, and SUTV, a synthetic video dataset including 10,000 videos with segmentation masks for marine objects. UTV provides diverse underwater videos with comprehensive annotations including appearance, texture, camera intrinsics, lighting, and animal behavior. SUTV can be used to improve underwater downstream tasks, which are demonstrated in video inpainting and video object segmentation.

Conditional Electrocardiogram Generation Using Hierarchical Variational Autoencoders

Ivan Sviridov,Konstantin Egorov

Task: 提出一种基于条件Nouveau VAE的ECG信号生成模型（cNVAE-ECG），用于生成具有多种病理特征的高分辨率ECG信号。

Motivation: 心血管疾病（CVDs）是全球主要的死亡原因，自动ECG分析可以提高CVD诊断的可用性、速度和准确性。然而，获取足够的训练数据是开发机器学习模型的主要困难。

Details

Method: 使用条件Nouveau VAE模型生成ECG信号，并与现有的基于GAN的模型进行比较。 Result: 提出的cNVAE-ECG模型在多个实际下游任务中表现出色，包括在迁移学习场景中AUROC提高了2%，超越了基于GAN的竞争对手。 Conclusion: cNVAE-ECG模型在生成高分辨率ECG信号方面表现出色，能够有效提高CVD诊断的准确性和可用性。 Abstract: Cardiovascular diseases (CVDs) are disorders impacting the heart and circulatory system. These disorders are the foremost and continuously escalating cause of mortality worldwide. One of the main tasks when working with CVDs is analyzing and identifying pathologies on a 12-lead electrocardiogram (ECG) with a standard 10-second duration. Using machine learning (ML) in automatic ECG analysis increases CVD diagnostics' availability, speed, and accuracy. However, the most significant difficulty in developing ML models is obtaining a sufficient training dataset. Due to the limitations of medical data usage, such as expensiveness, errors, the ambiguity of labels, imbalance of classes, and privacy issues, utilizing synthetic samples depending on specific pathologies bypasses these restrictions and improves algorithm quality. Existing solutions for the conditional generation of ECG signals are mainly built on Generative Adversarial Networks (GANs), and only a few papers consider the architectures based on Variational Autoencoders (VAEs), showing comparable results in recent works. This paper proposes the publicly available conditional Nouveau VAE model for ECG signal generation (cNVAE-ECG), which produces high-resolution ECGs with multiple pathologies. We provide an extensive comparison of the proposed model on various practical downstream tasks, including transfer learning scenarios showing an area under the receiver operating characteristic (AUROC) increase up to 2% surpassing GAN-like competitors.

Multimodal Lead-Specific Modeling of ECG for Low-Cost Pulmonary Hypertension Assessment

Mohammod N. I. Suvon,Shuo Zhou,Prasun C. Tripathi,Wenrui Fan,Samer Alabed,Bishesh Khanal,Venet Osmani,Andrew J. Swift,Chen,Chen,Haiping Lu

Task: 提出一种基于12导联和6导联心电图的多模态变分自编码器模型（LS-EMVAE），用于肺动脉高压（PH）的检测和表型分析。

Motivation: 在低收入和中等收入国家（LMICs），肺动脉高压（PH）常因缺乏先进的诊断工具而被漏诊。现有的机器学习方法主要关注资源有限的地区，忽视了没有诊断工具的农村地区。6导联心电图（6L-ECG）作为一种更便宜和便携的替代方案，其在PH检测中的临床价值尚未得到充分证明。

Details

Method: 提出Lead-Specific Electrocardiogram Multimodal Variational Autoencoder (LS-EMVAE)模型，该模型在大规模12导联心电图数据上进行预训练，并在特定任务数据（12L-ECG或6L-ECG）上进行微调。模型将每个12导联心电图的导联视为单独模态，并引入层次专家组合（Mixture and Product of Experts）进行自适应潜在特征融合。 Result: LS-EMVAE在12导联和6导联心电图设置下均优于现有基线模型，6导联心电图的性能与12导联心电图相当，展示了其在全球PH筛查中的潜力。 Conclusion: LS-EMVAE模型在12导联和6导联心电图设置下均表现出色，6导联心电图的性能与12导联心电图相当，为没有诊断工具的地区提供了公平的解决方案。 Abstract: Pulmonary hypertension (PH) is frequently underdiagnosed in low- and middle-income countries (LMICs) primarily due to the scarcity of advanced diagnostic tools. Several studies in PH have applied machine learning to low-cost diagnostic tools like 12-lead ECG (12L-ECG), but they mainly focus on areas with limited resources, overlooking areas with no diagnostic tools, such as rural primary healthcare in LMICs. Recent studies have shown the effectiveness of 6-lead ECG (6L-ECG), as a cheaper and portable alternative in detecting various cardiac conditions, but its clinical value for PH detection is not well proved. Furthermore, existing methods treat 12L-/6L-ECG as a single modality, capturing only shared features while overlooking lead-specific features essential for identifying complex cardiac hemodynamic changes. In this paper, we propose Lead-Specific Electrocardiogram Multimodal Variational Autoencoder (LS-EMVAE), a model pre-trained on large-population 12L-ECG data and fine-tuned on task-specific data (12L-ECG or 6L-ECG). LS-EMVAE models each 12L-ECG lead as a separate modality and introduces a hierarchical expert composition using Mixture and Product of Experts for adaptive latent feature fusion between lead-specific and shared features. Unlike existing approaches, LS-EMVAE makes better predictions on both 12L-ECG and 6L-ECG at inference, making it an equitable solution for areas with limited or no diagnostic tools. We pre-trained LS-EMVAE on 800,000 publicly available 12L-ECG samples and fine-tuned it for two tasks: 1) PH detection and 2) phenotyping pre-/post-capillary PH, on in-house datasets of 892 and 691 subjects across 12L-ECG and 6L-ECG settings. Extensive experiments show that LS-EMVAE outperforms existing baselines in both ECG settings, while 6L-ECG achieves performance comparable to 12L-ECG, unlocking its potential for global PH screening in areas without diagnostic tools.

Robust Detection of Extremely Thin Lines Using 0.2mm Piano Wire

Jisoo Hong,Youngjin Jung,Jihwan Bae,Seungho Song,Sung-Woo Kang

Task: 开发一种算法，用于检测电梯井内的参考线以确定自动化安装机器人的位置。

Motivation: 提高自动化安装机器人在电梯井内的定位精度，即使在存在噪声和其他干扰因素的情况下。

Details

Method: 使用高斯模糊、锐化滤波器、浮雕滤波器和傅里叶变换进行图像处理，然后应用Canny边缘检测和霍夫变换。通过平均霍夫变换检测到的线的x坐标来准确提取参考线。 Result: 实验结果表明，使用傅里叶变换的预处理方法（FCH）在LtoL、LtoR和RtoL数据集上实现了最高的检测率。使用高斯模糊和锐化滤波器的方法（GSCH）在RtoR数据集上表现出优越的检测性能。 Conclusion: 该研究提出了一种参考线检测算法，能够精确计算和控制电梯井安装中的自动化机器人位置。该方法在狭窄的工作空间中也有潜在的应用前景。未来的工作旨在开发具有基于机器学习的超参数调整功能的线检测算法。 Abstract: This study developed an algorithm capable of detecting a reference line (a 0.2 mm thick piano wire) to accurately determine the position of an automated installation robot within an elevator shaft. A total of 3,245 images were collected from the experimental tower of H Company, the leading elevator manufacturer in South Korea, and the detection performance was evaluated using four experimental approaches (GCH, GSCH, GECH, FCH). During the initial image processing stage, Gaussian blurring, sharpening filter, embossing filter, and Fourier Transform were applied, followed by Canny Edge Detection and Hough Transform. Notably, the method was developed to accurately extract the reference line by averaging the x-coordinates of the lines detected through the Hough Transform. This approach enabled the detection of the 0.2 mm thick piano wire with high accuracy, even in the presence of noise and other interfering factors (e.g., concrete cracks inside the elevator shaft or safety bars for filming equipment). The experimental results showed that Experiment 4 (FCH), which utilized Fourier Transform in the preprocessing stage, achieved the highest detection rate for the LtoL, LtoR, and RtoL datasets. Experiment 2(GSCH), which applied Gaussian blurring and a sharpening filter, demonstrated superior detection performance on the RtoR dataset. This study proposes a reference line detection algorithm that enables precise position calculation and control of automated robots in elevator shaft installation. Moreover, the developed method shows potential for applicability even in confined working spaces. Future work aims to develop a line detection algorithm equipped with machine learning-based hyperparameter tuning capabilities.

Periodontal Bone Loss Analysis via Keypoint Detection With Heuristic Post-Processing

Ryan Banks,Vishal Thengane,María Eugenia Guerrero,Nelly Maria García-Madueño,Yunpeng Li,Hongying Tang,Akhilanand Chaurasia

Task: 评估YOLOv8-pose深度学习模型在自动识别局部牙周骨丧失标志、条件和分期中的应用。

Motivation: 手动计算牙周骨丧失百分比有时不精确且耗时，因此需要一种自动化的方法来提高准确性和效率。

Details

Method: 使用193张标注的根尖周X光片对YOLOv8-pose进行微调，并提出了一种关键点检测指标PRCK，以及一个启发式后处理模块来调整关键点预测。 Result: 模型能够有效检测骨丧失关键点、牙齿框和牙槽骨吸收，但在检测分离的牙周韧带和分叉受累方面表现不足。后处理模型的PRCK 0.25为0.726，PRCK 0.05为0.401，牙齿对象检测的mAP 0.5为0.715，牙周分期的mesial dice得分为0.593，分叉受累的dice得分为0.280。 Conclusion: 该研究提供了一种阶段无关的牙周病检测方法，PRCK指标允许在牙科领域准确评估关键点，后处理模块能够正确调整预测的关键点，但依赖于姿态检测和分割模型的最低预测质量。 Abstract: Calculating percentage bone loss is a critical test for periodontal disease staging but is sometimes imprecise and time consuming when manually calculated. This study evaluates the application of a deep learning keypoint and object detection model, YOLOv8-pose, for the automatic identification of localised periodontal bone loss landmarks, conditions and staging. YOLOv8-pose was fine-tuned on 193 annotated periapical radiographs. We propose a keypoint detection metric, Percentage of Relative Correct Keypoints (PRCK), which normalises the metric to the average tooth size of teeth in the image. We propose a heuristic post-processing module that adjusts certain keypoint predictions to align with the edge of the related tooth, using a supporting instance segmentation model trained on an open source auxiliary dataset. The model can sufficiently detect bone loss keypoints, tooth boxes, and alveolar ridge resorption, but has insufficient performance at detecting detached periodontal ligament and furcation involvement. The model with post-processing demonstrated a PRCK 0.25 of 0.726 and PRCK 0.05 of 0.401 for keypoint detection, mAP 0.5 of 0.715 for tooth object detection, mesial dice score of 0.593 for periodontal staging, and dice score of 0.280 for furcation involvement. Our annotation methodology provides a stage agnostic approach to periodontal disease detection, by ensuring most keypoints are present for each tooth in the image, allowing small imbalanced datasets. Our PRCK metric allows accurate evaluation of keypoints in dental domains. Our post-processing module adjusts predicted keypoints correctly but is dependent on a minimum quality of prediction by the pose detection and segmentation models. Code: https:// anonymous.4open.science/r/Bone-Loss-Keypoint-Detection-Code. Dataset: https://bit.ly/4hJ3aE7.

Rujia Wang,Xiangbo Gao,Hao Xiang,Runsheng Xu,Zhengzhong Tu

Task: 通过共享感知信息来增强多智能体的感知能力，以协作执行机器人感知任务。

Motivation: 现有的协作感知系统传输中间特征图（如鸟瞰图表示），其中包含大量非关键信息，导致高通信带宽需求。

Details

Method: 提出了CoCMT，一种基于对象查询的协作框架，通过选择性提取和传输关键特征来优化通信带宽。引入了Efficient Query Transformer (EQFormer)来有效融合多智能体对象查询，并实施协同深度监督以增强阶段间的正向强化。 Result: 在OPV2V和V2V4Real数据集上的实验表明，CoCMT在显著减少通信需求的同时优于现有方法。在V2V4Real上，模型（Top-50对象查询）仅需0.416 Mb带宽，比现有方法少83倍，同时将AP70提高了1.1%。 Conclusion: CoCMT在带宽受限的环境中实现了实用的协作感知部署，而不会牺牲检测精度。 Abstract: Multi-agent collaborative perception enhances each agent perceptual capabilities by sharing sensing information to cooperatively perform robot perception tasks. This approach has proven effective in addressing challenges such as sensor deficiencies, occlusions, and long-range perception. However, existing representative collaborative perception systems transmit intermediate feature maps, such as bird-eye view (BEV) representations, which contain a significant amount of non-critical information, leading to high communication bandwidth requirements. To enhance communication efficiency while preserving perception capability, we introduce CoCMT, an object-query-based collaboration framework that optimizes communication bandwidth by selectively extracting and transmitting essential features. Within CoCMT, we introduce the Efficient Query Transformer (EQFormer) to effectively fuse multi-agent object queries and implement a synergistic deep supervision to enhance the positive reinforcement between stages, leading to improved overall performance. Experiments on OPV2V and V2V4Real datasets show CoCMT outperforms state-of-the-art methods while drastically reducing communication needs. On V2V4Real, our model (Top-50 object queries) requires only 0.416 Mb bandwidth, 83 times less than SOTA methods, while improving AP70 by 1.1 percent. This efficiency breakthrough enables practical collaborative perception deployment in bandwidth-constrained environments without sacrificing detection accuracy.

Feasibility study for reconstruction of knee MRI from one corresponding X-ray via CNN

Zhe Wang,Aladine Chetouani,Rachid Jennane

Task: 提出一种基于深度学习的方法，从对应的X射线图像生成MRI图像。

Motivation: X射线作为一种廉价且普及的医学成像技术，被广泛使用。随着医学技术的发展，MRI已成为KOA诊断的补充选择。

Details

Method: 使用卷积自编码器（CAE）模型的隐藏变量作为生成器模型的输入，以提供3D MRI图像。 Result: 生成3D MRI图像。 Conclusion: 提出的方法能够从X射线图像生成MRI图像，为医学诊断提供了新的可能性。 Abstract: Generally, X-ray, as an inexpensive and popular medical imaging technique, is widely chosen by medical practitioners. With the development of medical technology, Magnetic Resonance Imaging (MRI), an advanced medical imaging technique, has already become a supplementary diagnostic option for the diagnosis of KOA. We propose in this paper a deep-learning-based approach for generating MRI from one corresponding X-ray. Our method uses the hidden variables of a Convolutional Auto-Encoder (CAE) model, trained for reconstructing X-ray image, as inputs of a generator model to provide 3D MRI.

MSWAL: 3D Multi-class Segmentation of Whole Abdominal Lesions Dataset

Zhaodong Wu,Qiaochu Zhao,Ming Hu,Yulong Li,Haochen Xue,Kang Dang,Zhengyong Jiang,Angelos Stefanidis,Qiufeng Wang,Imran Razzak,Zongyuan Ge,Junjun He,Yu Qiao,Zhong Zheng,Feilong Tang,Jionglong Su

Task: 提出MSWAL数据集和Inception nnU-Net框架，用于腹部病变的多类别分割。

Motivation: 现有深度学习模型在腹部病变分割上存在局限性，主要由于训练数据集中缺乏典型腹部病变的标注。

Details

Method: 引入MSWAL数据集，包含694名患者的CT扫描数据，并提出Inception nnU-Net框架，结合Inception模块和nnU-Net架构。 Result: MSWAL数据集展示了强大的鲁棒性和泛化能力，Inception nnU-Net在MSWAL上显著提升了分割性能。 Conclusion: MSWAL数据集和Inception nnU-Net框架有效提升了腹部病变分割的准确性和泛化能力。 Abstract: With the significantly increasing incidence and prevalence of abdominal diseases, there is a need to embrace greater use of new innovations and technology for the diagnosis and treatment of patients. Although deep-learning methods have notably been developed to assist radiologists in diagnosing abdominal diseases, existing models have the restricted ability to segment common lesions in the abdomen due to missing annotations for typical abdominal pathologies in their training datasets. To address the limitation, we introduce MSWAL, the first 3D Multi-class Segmentation of the Whole Abdominal Lesions dataset, which broadens the coverage of various common lesion types, such as gallstones, kidney stones, liver tumors, kidney tumors, pancreatic cancer, liver cysts, and kidney cysts. With CT scans collected from 694 patients (191,417 slices) of different genders across various scanning phases, MSWAL demonstrates strong robustness and generalizability. The transfer learning experiment from MSWAL to two public datasets, LiTS and KiTS, effectively demonstrates consistent improvements, with Dice Similarity Coefficient (DSC) increase of 3.00% for liver tumors and 0.89% for kidney tumors, demonstrating that the comprehensive annotations and diverse lesion types in MSWAL facilitate effective learning across different domains and data distributions. Furthermore, we propose Inception nnU-Net, a novel segmentation framework that effectively integrates an Inception module with the nnU-Net architecture to extract information from different receptive fields, achieving significant enhancement in both voxel-level DSC and region-level F1 compared to the cutting-edge public algorithms on MSWAL. Our dataset will be released after being accepted, and the code is publicly released at https://github.com/tiuxuxsh76075/MSWAL-.

Online Signature Verification based on the Lagrange formulation with 2D and 3D robotic models

Moises Diaz,Miguel A. Ferrer,Juan M. Gil,Rafael Rodriguez,Peirong Zhang,Lianwen Jin

Task: 提出一种基于在线签名动态特征的新方法，用于在线签名验证。

Motivation: 通过数字化仪获取的签名数据通常包括时间和压力信息，但推断关于书写者手臂姿势、运动学和动力学的额外信息具有挑战性。

Details

Method: 通过拉格朗日公式推断出2D和3D机械臂模型的广义坐标和扭矩序列，结合运动学和动力学特征。 Result: 结果表明，这些特征在在线自动签名验证中非常有效，并在集成到深度学习模型时达到了最先进的水平。 Conclusion: 提出的基于动态特征的方法显著提高了在线签名验证的效果。 Abstract: Online Signature Verification commonly relies on function-based features, such as time-sampled horizontal and vertical coordinates, as well as the pressure exerted by the writer, obtained through a digitizer. Although inferring additional information about the writers arm pose, kinematics, and dynamics based on digitizer data can be useful, it constitutes a challenge. In this paper, we tackle this challenge by proposing a new set of features based on the dynamics of online signatures. These new features are inferred through a Lagrangian formulation, obtaining the sequences of generalized coordinates and torques for 2D and 3D robotic arm models. By combining kinematic and dynamic robotic features, our results demonstrate their significant effectiveness for online automatic signature verification and achieving state-of-the-art results when integrated into deep learning models.

Subgroup Performance of a Commercial Digital Breast Tomosynthesis Model for Breast Cancer Detection

Beatrice Brown-Mulry,Rohan Satya Isaac,Sang Hyup Lee,Ambika Seth,KyungJee Min,Theo Dapamede,Frank Li,Aawez Mansuri,MinJae Woo,Christian Allison Fauria-Robinson,Bhavna Paryani,Judy Wawira Gichoya,Hari Trivedi

Task: 评估Lunit INSIGHT DBT模型在数字乳腺断层合成（DBT）成像中的表现。

Motivation: 尽管研究表明AI模型在乳腺X光检查中具有改善乳腺癌筛查结果的潜力，但尚未对商业模型在DBT成像中的优缺点进行详细的亚组评估。

Details

Method: 在Emory乳腺影像数据集（EMBED）中的163,449例筛查乳腺X光检查中进行Lunit INSIGHT DBT模型的详细评估，分析模型在不同人口统计学、影像学和病理学亚组中的表现。 Result: 模型整体AUC为0.91（95% CI: 0.90-0.92），精确度为0.08（95% CI: 0.08-0.08），召回率为0.73（95% CI: 0.71-0.76）。在非浸润性癌症、钙化和致密乳腺组织病例中表现显著较低。 Conclusion: 这些结果强调了在临床部署新工具时，需要详细评估模型特性并保持警惕。 Abstract: While research has established the potential of AI models for mammography to improve breast cancer screening outcomes, there have not been any detailed subgroup evaluations performed to assess the strengths and weaknesses of commercial models for digital breast tomosynthesis (DBT) imaging. This study presents a granular evaluation of the Lunit INSIGHT DBT model on a large retrospective cohort of 163,449 screening mammography exams from the Emory Breast Imaging Dataset (EMBED). Model performance was evaluated in a binary context with various negative exam types (162,081 exams) compared against screen detected cancers (1,368 exams) as the positive class. The analysis was stratified across demographic, imaging, and pathologic subgroups to identify potential disparities. The model achieved an overall AUC of 0.91 (95% CI: 0.90-0.92) with a precision of 0.08 (95% CI: 0.08-0.08), and a recall of 0.73 (95% CI: 0.71-0.76). Performance was found to be robust across demographics, but cases with non-invasive cancers (AUC: 0.85, 95% CI: 0.83-0.87), calcifications (AUC: 0.80, 95% CI: 0.78-0.82), and dense breast tissue (AUC: 0.90, 95% CI: 0.88-0.91) were associated with significantly lower performance compared to other groups. These results highlight the need for detailed evaluation of model characteristics and vigilance in considering adoption of new tools for clinical deployment.

Next-Scale Autoregressive Models are Zero-Shot Single-Image Object View Synthesizers

Shiran Yuan,Hao Zhao

Task: 提出一种无需2D预训练、仅使用3D渲染数据进行训练的新视角合成方法ArchonView。

Motivation: 现有的基于扩散模型的新视角合成方法需要预训练的2D扩散检查点，这限制了其可扩展性。

Details

Method: 结合全局（姿态增强语义）和局部（多尺度分层编码）条件，基于下一尺度自回归范式构建模型。 Result: ArchonView在性能上显著超过现有方法，且在困难相机姿态下表现稳健，推理速度比扩散模型快数倍。 Conclusion: ArchonView展示了无需2D预训练即可实现高质量新视角合成的潜力，且性能随模型和数据集的规模扩展而提升。 Abstract: Methods based on diffusion backbones have recently revolutionized novel view synthesis (NVS). However, those models require pretrained 2D diffusion checkpoints (e.g., Stable Diffusion) as the basis for geometrical priors. Since such checkpoints require exorbitant amounts of data and compute to train, this greatly limits the scalability of diffusion-based NVS models. We present Next-Scale Autoregression Conditioned by View (ArchonView), a method that significantly exceeds state-of-the-art methods despite being trained from scratch with 3D rendering data only and no 2D pretraining. We achieve this by incorporating both global (pose-augmented semantics) and local (multi-scale hierarchical encodings) conditioning into a backbone based on the next-scale autoregression paradigm. Our model also exhibits robust performance even for difficult camera poses where previous methods fail, and is several times faster in inference speed compared to diffusion. We experimentally verify that performance scales with model and dataset size, and conduct extensive demonstration of our method's synthesis quality across several tasks. Our code is open-sourced at https://github.com/Shiran-Yuan/ArchonView.

A Convex formulation for linear discriminant analysis

Sai Vijay Kumar Surineela,Prathyusha Kanakamalla,Harigovind Harikumar,Tomojit Ghosh

Task: 提出一种称为凸线性判别分析（ConvexLDA）的监督降维技术。

Motivation: 解决高维数据（如RNA-seq数据）中传统线性判别分析（LDA）方法可能遇到的矩阵求逆问题和计算复杂度高的问题。

Details

Method: 通过优化一个多目标成本函数，平衡两个互补项：一是将样本拉向其类中心，二是通过最大化类中心的超椭球散射体积来将类分开。 Result: 实验表明，ConvexLDA在多种高维生物数据和图像数据集上优于几种流行的基于LDA的方法。 Conclusion: ConvexLDA通过其凸成本函数确保了全局最优性，提高了学习嵌入的可靠性，并且不需要矩阵求逆，计算效率更高。 Abstract: We present a supervised dimensionality reduction technique called Convex Linear Discriminant Analysis (ConvexLDA). The proposed model optimizes a multi-objective cost function by balancing two complementary terms. The first term pulls the samples of a class towards its centroid by minimizing a sample's distance from its class-centroid in low dimensional space. The second term pushes the classes far apart by maximizing their hyperellipsoid scattering volume via the logarithm of the determinant (\textit{log det}) of the outer product matrix formed by the low-dimensional class-centroids. Using the negative of the \textit{log det}, we pose the final cost as a minimization problem, which balances the two terms using a hyper-parameter $\lambda$. We demonstrate that the cost function is convex. Unlike Fisher LDA, the proposed method doesn't require to compute the inverse of a matrix, hence avoiding any ill-conditioned problem where data dimension is very high, e.g. RNA-seq data. ConvexLDA doesn't require pair-wise distance calculation, making it faster and more easily scalable. Moreover, the convex nature of the cost function ensures global optimality, enhancing the reliability of the learned embedding. Our experimental evaluation demonstrates that ConvexLDA outperforms several popular linear discriminant analysis (LDA)-based methods on a range of high-dimensional biological data, image data sets, etc.

Evaluating Global Geo-alignment for Precision Learned Autonomous Vehicle Localization using Aerial Data

Yi Yang,Xuran Zhao,H. Charles Zhao,Shumin Yuan,Samuel M. Bateman,Tiffany A. Huang,Chris Beall,Will Maddern

Task: 研究如何通过改进空中数据与自动驾驶车辆传感器数据的对齐来提高学习型定位系统的性能。

Motivation: 空中和卫星地图数据在自动驾驶车辆中的应用具有显著的成本降低和扩展性增强的潜力，但存在传感器模态差异和视角差异等挑战。

Details

Method: 使用因子图框架比较两种数据对齐方法，并通过消融研究评估紧密对齐的地面实况对学习型定位精度的影响。 Result: 在1600公里的自动驾驶车辆数据集上评估了使用数据对齐方法的学习型定位系统，定位误差低于0.3米和0.5度，足以满足自动驾驶应用的需求。 Conclusion: 改进训练时空中数据与自动驾驶车辆传感器数据的对齐对学习型定位系统的性能至关重要。 Abstract: Recently there has been growing interest in the use of aerial and satellite map data for autonomous vehicles, primarily due to its potential for significant cost reduction and enhanced scalability. Despite the advantages, aerial data also comes with challenges such as a sensor-modality gap and a viewpoint difference gap. Learned localization methods have shown promise for overcoming these challenges to provide precise metric localization for autonomous vehicles. Most learned localization methods rely on coarsely aligned ground truth, or implicit consistency-based methods to learn the localization task -- however, in this paper we find that improving the alignment between aerial data and autonomous vehicle sensor data at training time is critical to the performance of a learning-based localization system. We compare two data alignment methods using a factor graph framework and, using these methods, we then evaluate the effects of closely aligned ground truth on learned localization accuracy through ablation studies. Finally, we evaluate a learned localization system using the data alignment methods on a comprehensive (1600km) autonomous vehicle dataset and demonstrate localization error below 0.3m and 0.5$^{\circ}$ sufficient for autonomous vehicle applications.

Fibonacci-Net: A Lightweight CNN model for Automatic Brain Tumor Classification

Santanu Roy,Ashvath Suresh,Archit Gupta,Shubhi Tiwari,Palak Sahu,Prashant Adhikari,Yuvraj S. Shekhawat

Task: 提出一种轻量级模型“Fibonacci-Net”及一种新颖的池化技术，用于从不平衡的磁共振成像（MRI）数据集中自动分类脑肿瘤。

Motivation: 由于卷积神经网络（CNN）模型的出现，自动从MRI数据集中检测脑肿瘤在研究界引起了广泛关注。然而，传统CNN模型的性能受到类别不平衡问题的限制。

Details

Method: 提出了一种轻量级CNN模型，其中不同卷积层中的滤波器数量根据斐波那契数列选择。在模型的最后两个块中，采用深度可分离卷积（DWSC）层以显著降低模型的计算复杂度。在提出的Fibonacci-Net中，从第2到第4和第3到第5卷积块部署了两个并行连接（或跳跃连接）。这种跳跃连接包含一种新颖的Average-2Max池化层，生成两个具有略微不同统计数据的卷积输出堆栈。 Result: 在最具挑战性的“44类MRI数据集”上，使用提出的Fibonacci-Net后，获得了96.2%的准确率、97.17%的精确率、95.9%的召回率、96.5%的F1分数和99.9%的特异性。 Conclusion: 提出的Fibonacci-Net模型及新颖的池化技术有效缓解了类别不平衡问题，并在多个MRI数据集上表现出色。 Abstract: This research proposes a very lightweight model "Fibonacci-Net" along with a novel pooling technique, for automatic brain tumor classification from imbalanced Magnetic Resonance Imaging (MRI) datasets. Automatic brain tumor detection from MRI dataset has garnered significant attention in the research community, since the inception of Convolutional Neural Network (CNN) models. However, the performance of conventional CNN models is hindered due to class imbalance problems. The novelties of this work are as follows: (I) A lightweight CNN model is proposed in which the number of filters in different convolutional layers is chosen according to the numbers of Fibonacci series. (II) In the last two blocks of the proposed model, depth-wise separable convolution (DWSC) layers are employed to considerably reduce the computational complexity of the model. (III) Two parallel concatenations (or, skip connections) are deployed from 2nd to 4th, and 3rd to 5th convolutional block in the proposed Fibonacci-Net. This skip connection encompasses a novel Average-2Max pooling layer that produces two stacks of convoluted output, having a bit different statistics. Therefore, this parallel concatenation block works as an efficient feature augmenter inside the model, thus, automatically alleviating the class imbalance problem to a certain extent. For validity purpose, we have implemented the proposed framework on three MRI datasets which are highly class-imbalanced. (a) The first dataset has four classes, i.e., glioma tumor, meningioma tumor, pituitary tumor, and no-tumor. (b) Second and third MRI datasets have 15 and 44 classes respectively. Experimental results reveal that, after employing the proposed Fibonacci-Net we have achieved 96.2% accuracy, 97.17% precision, 95.9% recall, 96.5% F1 score, and 99.9% specificity on the most challenging ``44-classes MRI dataset''.

BG-Triangle: Bézier Gaussian Triangle for 3D Vectorization and Rendering

Minye Wu,Haizhao Dai,Kaixin Yao,Tinne Tuytelaars,Jingyi Yu

Task: 提出一种新的混合表示方法Bézier Gaussian Triangle (BG-Triangle)，用于在保持精确形状建模的同时进行分辨率无关的可微渲染。

Motivation: 现有的可微渲染方法由于缺乏明确的边界定义，难以保持锐利的边缘。

Details

Method: 结合Bézier三角形矢量图形基元和高斯概率模型，提出一种新的混合表示方法BG-Triangle，并采用一种鲁棒且有效的不连续性感知渲染技术来减少物体边界的不确定性。 Result: 实验表明，BG-Triangle在渲染质量上与3DGS相当，但在边界保持上表现更优，且使用的基元数量更少。 Conclusion: BG-Triangle展示了矢量图形基元的优势，并有望在经典和新兴表示方法之间架起桥梁。 Abstract: Differentiable rendering enables efficient optimization by allowing gradients to be computed through the rendering process, facilitating 3D reconstruction, inverse rendering and neural scene representation learning. To ensure differentiability, existing solutions approximate or re-formulate traditional rendering operations using smooth, probabilistic proxies such as volumes or Gaussian primitives. Consequently, they struggle to preserve sharp edges due to the lack of explicit boundary definitions. We present a novel hybrid representation, B\'ezier Gaussian Triangle (BG-Triangle), that combines B\'ezier triangle-based vector graphics primitives with Gaussian-based probabilistic models, to maintain accurate shape modeling while conducting resolution-independent differentiable rendering. We present a robust and effective discontinuity-aware rendering technique to reduce uncertainties at object boundaries. We also employ an adaptive densification and pruning scheme for efficient training while reliably handling level-of-detail (LoD) variations. Experiments show that BG-Triangle achieves comparable rendering quality as 3DGS but with superior boundary preservation. More importantly, BG-Triangle uses a much smaller number of primitives than its alternatives, showcasing the benefits of vectorized graphics primitives and the potential to bridge the gap between classic and emerging representations.

Striving for Simplicity: Simple Yet Effective Prior-Aware Pseudo-Labeling for Semi-Supervised Ultrasound Image Segmentation

Yaxiong Chen,Yujie Wang,Zixuan Zheng,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou

Task: 提出一种简单而有效的伪标签方法，结合对抗学习的形状先验来规范医学图像分割。

Motivation: 医学超声图像分析需求大，但手动分析难以跟上。自动分割需要大量标注数据，而这些数据稀缺。半监督学习利用未标注和有限标注数据是一种有前景的方法。

Details

Method: 设计了一个编码器-双解码器网络，其中形状先验作为隐式形状模型，惩罚解剖学上不合理但与真实值不偏离的预测。 Result: 在两种基准测试中，该方法在不同分区协议下达到了最先进的性能。 Conclusion: 该方法为未来的半监督医学图像分割提供了一个强有力的基线。 Abstract: Medical ultrasound imaging is ubiquitous, but manual analysis struggles to keep pace. Automated segmentation can help but requires large labeled datasets, which are scarce. Semi-supervised learning leveraging both unlabeled and limited labeled data is a promising approach. State-of-the-art methods use consistency regularization or pseudo-labeling but grow increasingly complex. Without sufficient labels, these models often latch onto artifacts or allow anatomically implausible segmentations. In this paper, we present a simple yet effective pseudo-labeling method with an adversarially learned shape prior to regularize segmentations. Specifically, we devise an encoder-twin-decoder network where the shape prior acts as an implicit shape model, penalizing anatomically implausible but not ground-truth-deviating predictions. Without bells and whistles, our simple approach achieves state-of-the-art performance on two benchmarks under different partition protocols. We provide a strong baseline for future semi-supervised medical image segmentation. Code is available at https://github.com/WUTCM-Lab/Shape-Prior-Semi-Seg.

TarPro: Targeted Protection against Malicious Image Editing

Kaixin Shen,Ruijie Quan,Jiaxu Miao,Jun Xiao,Yi Yang

Task: 提出一种针对恶意图像编辑的保护机制，同时保留正常编辑功能。

Motivation: 现有的保护方法无法在阻止恶意编辑的同时保持正常编辑功能，导致一些有害内容仍能生成。

Details

Method: 提出TarPro框架，通过语义感知约束和轻量级扰动生成器来实现目标保护。 Result: 实验表明，TarPro在保护效果上优于现有方法，同时对正常编辑的影响最小。 Conclusion: TarPro是一种实用且安全的图像编辑解决方案。 Abstract: The rapid advancement of image editing techniques has raised concerns about their misuse for generating Not-Safe-for-Work (NSFW) content. This necessitates a targeted protection mechanism that blocks malicious edits while preserving normal editability. However, existing protection methods fail to achieve this balance, as they indiscriminately disrupt all edits while still allowing some harmful content to be generated. To address this, we propose TarPro, a targeted protection framework that prevents malicious edits while maintaining benign modifications. TarPro achieves this through a semantic-aware constraint that only disrupts malicious content and a lightweight perturbation generator that produces a more stable, imperceptible, and robust perturbation for image protection. Extensive experiments demonstrate that TarPro surpasses existing methods, achieving a high protection efficacy while ensuring minimal impact on normal edits. Our results highlight TarPro as a practical solution for secure and controlled image editing.

Uncertainty-Aware Global-View Reconstruction for Multi-View Multi-Label Feature Selection

Pingting Hao,Kunpeng Liu,Wanfu Gao

Task: 提出一种基于全局视图重建的统一模型，用于多视图多标签学习中的特征选择和样本不确定性感知。

Motivation: 现有的多视图多标签学习方法通常分别从一致性部分和互补部分提取信息，可能导致噪声问题，且特征选择方法通常忽略样本的不确定性。

Details

Method: 通过样本之间的图结构、样本置信度和视图关系来重建全局视图，并在重建过程中融入样本不确定性的感知。 Result: 实验结果表明，该方法在多视图数据集上表现出优越的性能。 Conclusion: 所提出的方法通过全局视图重建和样本不确定性感知，有效提升了多视图多标签学习的性能和可信度。 Abstract: In recent years, multi-view multi-label learning (MVML) has gained popularity due to its close resemblance to real-world scenarios. However, the challenge of selecting informative features to ensure both performance and efficiency remains a significant question in MVML. Existing methods often extract information separately from the consistency part and the complementary part, which may result in noise due to unclear segmentation. In this paper, we propose a unified model constructed from the perspective of global-view reconstruction. Additionally, while feature selection methods can discern the importance of features, they typically overlook the uncertainty of samples, which is prevalent in realistic scenarios. To address this, we incorporate the perception of sample uncertainty during the reconstruction process to enhance trustworthiness. Thus, the global-view is reconstructed through the graph structure between samples, sample confidence, and the view relationship. The accurate mapping is established between the reconstructed view and the label matrix. Experimental results demonstrate the superior performance of our method on multi-view datasets.

Shift, Scale and Rotation Invariant Multiple Object Detection using Balanced Joint Transform Correlator

Xi Shen,Julian Gamboa,Tabassom Hamidfar,Shamima Mitu,Selim M. Shahriar

Task: 提出一种分段极坐标梅林变换（SPMT）以处理单帧图像中存在多个目标的情况。

Motivation: 传统的极坐标梅林变换（PMT）无法正确处理单帧图像中存在多个目标的情况，因此需要一种新的方法来扩展其应用。

Details

Method: 提出分段极坐标梅林变换（SPMT），并将其集成到光电联合变换相关器中。 Result: 仿真结果表明，SPMT能够在各种变换条件下实现多目标的同时检测，并在匹配和非匹配目标之间表现出显著的区分能力。 Conclusion: SPMT扩展了PMT的应用范围，能够在单帧图像中同时检测多个目标，并表现出鲁棒的检测能力。 Abstract: The Polar Mellin Transform (PMT) is a well-known technique that converts images into shift, scale and rotation invariant signatures for object detection using opto-electronic correlators. However, this technique cannot be properly applied when there are multiple targets in a single input. Here, we propose a Segmented PMT (SPMT) that extends this methodology for cases where multiple objects are present within the same frame. Simulations show that this SPMT can be integrated into an opto-electronic joint transform correlator to create a correlation system capable of detecting multiple objects simultaneously, presenting robust detection capabilities across various transformation conditions, with remarkable discrimination between matching and non-matching targets.

Binjie Liu,Lina Liu,Sanyi Zhang,Songen Gu,Yihao Zhi,Tianyi Zhu,Lei Yang,Long Ye

Task: 本文专注于全身共语音手势生成。

Motivation: 现有方法通常使用自回归模型和矢量量化标记进行手势生成，这会导致信息丢失并影响生成手势的真实性。为了解决这个问题，本文提出了MAG，一种新颖的多模态对齐框架，用于高质量和多样化的共语音手势合成，而不依赖于离散标记化。

Details

Method: 本文提出了MTA-VAE（运动-文本-音频对齐变分自编码器），利用预训练的WavCaps文本和音频嵌入来增强语义和节奏对齐，从而生成更真实的手势。在此基础上，提出了多模态掩码自回归模型（MMAG），通过扩散在连续运动嵌入中进行自回归建模，而不使用矢量量化。 Result: 在两个基准数据集上的大量实验表明，MAG在定量和定性上都达到了最先进的性能，生成了高度真实和多样化的共语音手势。 Conclusion: MAG框架在共语音手势生成方面表现出色，代码将公开发布以促进未来研究。 Abstract: This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation, which results in information loss and compromises the realism of the generated gestures. To address this, inspired by the natural continuity of real-world human motion, we propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization. Specifically, (1) we introduce a motion-text-audio-aligned variational autoencoder (MTA-VAE), which leverages pre-trained WavCaps' text and audio embeddings to enhance both semantic and rhythmic alignment with motion, ultimately producing more realistic gestures. (2) Building on this, we propose a multimodal masked autoregressive model (MMAG) that enables autoregressive modeling in continuous motion embeddings through diffusion without vector quantization. To further ensure multi-modal consistency, MMAG incorporates a hybrid granularity audio-text fusion block, which serves as conditioning for diffusion process. Extensive experiments on two benchmark datasets demonstrate that MAG achieves stateof-the-art performance both quantitatively and qualitatively, producing highly realistic and diverse co-speech gestures.The code will be released to facilitate future research.

Foundation Feature-Driven Online End-Effector Pose Estimation: A Marker-Free and Learning-Free Approach

Tianshu Wu,Jiyao Zhang,Shiqian Liang,Zhengxiao Han,Hao Dong

Task: 提出一种基于基础特征的在线末端执行器姿态估计算法（FEEPE），用于相机空间与机器人空间之间的精确转换估计。

Motivation: 传统的手眼校准方法需要离线图像采集，限制了其在线自校准的适用性。最近的学习型机器人姿态估计方法虽然推进了在线校准，但在跨机器人泛化和机器人完全可见性方面存在困难。

Details

Method: FEEPE算法利用预训练的视觉特征来估计从CAD模型和目标图像中提取的2D-3D对应关系，通过PnP算法实现6D姿态估计。为了解决部分观察和对称性带来的模糊性，引入了一种多历史关键帧增强的姿态优化算法，利用时间信息提高准确性。 Result: 与传统的手眼校准相比，FEEPE实现了无标记的在线校准。与机器人姿态估计不同，它以无需训练的方式跨机器人和末端执行器进行泛化。 Conclusion: FEEPE算法在灵活性、泛化能力和性能方面表现出色，展示了其在实际应用中的潜力。 Abstract: Accurate transformation estimation between camera space and robot space is essential. Traditional methods using markers for hand-eye calibration require offline image collection, limiting their suitability for online self-calibration. Recent learning-based robot pose estimation methods, while advancing online calibration, struggle with cross-robot generalization and require the robot to be fully visible. This work proposes a Foundation feature-driven online End-Effector Pose Estimation (FEEPE) algorithm, characterized by its training-free and cross end-effector generalization capabilities. Inspired by the zero-shot generalization capabilities of foundation models, FEEPE leverages pre-trained visual features to estimate 2D-3D correspondences derived from the CAD model and target image, enabling 6D pose estimation via the PnP algorithm. To resolve ambiguities from partial observations and symmetry, a multi-historical key frame enhanced pose optimization algorithm is introduced, utilizing temporal information for improved accuracy. Compared to traditional hand-eye calibration, FEEPE enables marker-free online calibration. Unlike robot pose estimation, it generalizes across robots and end-effectors in a training-free manner. Extensive experiments demonstrate its superior flexibility, generalization, and performance.

Image-Based Metrics in Ultrasound for Estimation of Global Speed-of-Sound

Roman Denkin,Orcun Goksel

Task: 利用常规图像分析技术和指标来估计组织声速（SoS）。

Motivation: 传统的声速估计方法依赖于复杂的声学传播物理模型，而本文提出了一种新颖且简单的方法，利用常规图像分析技术和指标来估计组织声速。

Details

Method: 研究了十一项指标，分为三类：图像质量、图像相似性和多帧变化，通过在数值模拟和体模实验中进行测试。 Result: 在单帧图像质量指标中，传统的Focus和提出的Smoothed Threshold Tenengrad指标在复合图像中达到了满意的精度。图像质量指标被各种图像比较指标超越，这些指标在应用于单对图像时误差始终低于8 m/s。Mean Square Error是一种计算效率高的全局估计替代方法。Mutual Information和Correlation对处理小图像片段具有鲁棒性，适用于多层声速估计。 Conclusion: 基于图像分析的声速估计方法提供了一种计算效率高且数据可访问的替代方案，具有扩展到分层或局部声速成像的潜力。 Abstract: Accurate speed-of-sound (SoS) estimation is crucial for ultrasound image formation, yet conventional systems often rely on an assumed value for imaging. While several methods exist for SoS estimation, they typically depend on complex physical models of acoustic propagation. We propose to leverage conventional image analysis techniques and metrics, as a novel and simple approach to estimate tissue SoS. We study eleven metrics in three categories for assessing image quality, image similarity and multi-frame variation, by testing them in numerical simulations and phantom experiments. Among single-frame image quality metrics, conventional Focus and our proposed Smoothed Threshold Tenengrad metrics achieved satisfactory accuracy, however only when applied to compounded images. Image quality metrics were largely surpassed by various image comparison metrics, which exhibited errors consistently under 8 m/s even applied to a single pair of images. Particularly, Mean Square Error is a computationally efficient alternative for global estimation. Mutual Information and Correlation are found to be robust to processing small image segments, making them suitable, e.g., for multi-layer SoS estimation. The above metrics do not require access to raw channel data as they can operate on post-beamformed data, and in the case of image quality metrics they can operate on B-mode images, given that the beamforming SoS can be controlled for beamforming using a multitude of values. These image analysis based SoS estimation methods offer a computationally efficient and data-accessible alternative to conventional physics-based methods, with potential extensions to layered or local SoS imaging.

Bridging Past and Future: End-to-End Autonomous Driving with Historical Prediction and Planning

Bozhou Zhang,Nan Song,Xin Jin,Li Zhang

Task: 提出了一种新的端到端自动驾驶框架BridgeAD，通过多步查询来区分每个未来时间步的查询，从而有效利用历史信息进行感知和运动规划。

Motivation: 现有的方法在运动规划中要么忽略了历史信息，要么未能与多步性质对齐，导致无法有效预测或规划多个未来时间步。

Details

Method: 提出了BridgeAD框架，将运动和规划查询重新表述为多步查询，以区分每个未来时间步的查询，并将历史查询与感知和运动规划相结合。 Result: 在nuScenes数据集上的大量实验表明，BridgeAD在开环和闭环设置下均达到了最先进的性能。 Conclusion: BridgeAD通过在每个时间步聚合历史信息，增强了端到端自动驾驶管道的整体一致性和准确性。 Abstract: End-to-end autonomous driving unifies tasks in a differentiable framework, enabling planning-oriented optimization and attracting growing attention. Current methods aggregate historical information either through dense historical bird's-eye-view (BEV) features or by querying a sparse memory bank, following paradigms inherited from detection. However, we argue that these paradigms either omit historical information in motion planning or fail to align with its multi-step nature, which requires predicting or planning multiple future time steps. In line with the philosophy of future is a continuation of past, we propose BridgeAD, which reformulates motion and planning queries as multi-step queries to differentiate the queries for each future time step. This design enables the effective use of historical prediction and planning by applying them to the appropriate parts of the end-to-end system based on the time steps, which improves both perception and motion planning. Specifically, historical queries for the current frame are combined with perception, while queries for future frames are integrated with motion planning. In this way, we bridge the gap between past and future by aggregating historical insights at every time step, enhancing the overall coherence and accuracy of the end-to-end autonomous driving pipeline. Extensive experiments on the nuScenes dataset in both open-loop and closed-loop settings demonstrate that BridgeAD achieves state-of-the-art performance.

Yongqi Li,Lu Yang,Jian Wang,Runyang You,Wenjie Li,Liqiang Nie

Task: 构建一个用于无害多模态助手的偏好数据集，并提出盲偏好优化（BPO）方法来增强多模态大语言模型（MLLMs）的安全性。

Motivation: 由于多模态大语言模型（MLLMs）在多模态理解、推理和交互方面的广泛应用，相关的安全问题变得日益重要。

Details

Method: 构建了MMSafe-PO偏好数据集，并提出盲偏好优化（BPO）方法。 Result: BPO显著提高了基础MLLM的安全率45.0%，并在其他安全基准上大幅降低了不安全率（MM-SafetyBench上14.5%，HarmEval上82.9%）。 Conclusion: MMSafe-PO数据集和BPO方法在增强MLLMs安全性方面表现出有效性和鲁棒性。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval, demonstrating the effectiveness and robustness of both the dataset and the approach. We release code and data at https://lu-yang666.github.io/MMsafe-PO-Web/.

Mapping Urban Villages in China: Progress and Challenges

Rui Cao,Wei Tu,Dongsheng Chen,Wenyu Zhang

Task: 评估当前城市村庄映射的进展并识别挑战和未来方向。

Motivation: 高质量城市化进程中，城市村庄问题成为中国的突出社会问题，但缺乏可用的地理空间数据，因此优先进行城市村庄映射至关重要。

Details

Method: 通过全面的文献综述，总结中国城市村庄映射的研究区域、数据来源和方法。 Result: 当前研究仅覆盖有限的研究区域和时期，由于概念模糊性、空间异质性和数据可用性的挑战，识别方法的可扩展性、可转移性和可解释性不足。 Conclusion: 未来研究可以在全国范围内实现大面积映射，补充和推进当前研究。 Abstract: The shift toward high-quality urbanization has brought increased attention to the issue of "urban villages", which has become a prominent social problem in China. However, there is a lack of available geospatial data on urban villages, making it crucial to prioritize urban village mapping. In order to assess the current progress in urban village mapping and identify challenges and future directions, we have conducted a comprehensive review, which to the best of our knowledge is the first of its kind in this field. Our review begins by providing a clear context for urban villages and elaborating the method for literature review, then summarizes the study areas, data sources, and approaches used for urban village mapping in China. We also address the challenges and future directions for further research. Through thorough investigation, we find that current studies only cover very limited study areas and periods and lack sufficient investigation into the scalability, transferability, and interpretability of identification approaches due to the challenges in concept fuzziness and variances, spatial heterogeneity and variances of urban villages, and data availability. Future research can complement and further the current research in the following potential directions in order to achieve large-area mapping across the whole nation...

Yifei Dong,Fengyi Wu,Qi He,Heng Li,Minghan Li,Zebang Cheng,Yuxuan Zhou,Jingdong Sun,Qi Dai,Zhi-Qi Cheng,Alexander G Hauptmann

Task: 提出一个统一的Human-Aware VLN（HA-VLN）基准，结合离散和连续导航范式，并在明确的社会意识约束下进行。

Motivation: 现有的Vision-and-Language Navigation（VLN）系统通常只关注离散（全景）或连续（自由运动）范式，忽视了人类居住的动态环境的复杂性。

Details

Method: 1. 标准化任务定义，平衡离散-连续导航与个人空间需求；2. 增强的人类运动数据集（HAPS 2.0）和升级的模拟器，捕捉现实的多人类互动、户外环境和精细的运动-语言对齐；3. 在16,844条以人类为中心的指令上进行广泛的基准测试；4. 在拥挤的室内空间进行真实世界的机器人测试；5. 发布公共排行榜，支持离散和连续任务的透明比较。 Result: 实证结果表明，当社会背景被整合时，导航成功率提高，碰撞减少。 Conclusion: 通过发布所有数据集、模拟器、代理代码和评估工具，旨在推动更安全、更强大、更负责任的VLN研究。 Abstract: Vision-and-Language Navigation (VLN) systems often focus on either discrete (panoramic) or continuous (free-motion) paradigms alone, overlooking the complexities of human-populated, dynamic environments. We introduce a unified Human-Aware VLN (HA-VLN) benchmark that merges these paradigms under explicit social-awareness constraints. Our contributions include: 1. A standardized task definition that balances discrete-continuous navigation with personal-space requirements; 2. An enhanced human motion dataset (HAPS 2.0) and upgraded simulators capturing realistic multi-human interactions, outdoor contexts, and refined motion-language alignment; 3. Extensive benchmarking on 16,844 human-centric instructions, revealing how multi-human dynamics and partial observability pose substantial challenges for leading VLN agents; 4. Real-world robot tests validating sim-to-real transfer in crowded indoor spaces; and 5. A public leaderboard supporting transparent comparisons across discrete and continuous tasks. Empirical results show improved navigation success and fewer collisions when social context is integrated, underscoring the need for human-centric design. By releasing all datasets, simulators, agent code, and evaluation tools, we aim to advance safer, more capable, and socially responsible VLN research.

RoMedFormer: A Rotary-Embedding Transformer Foundation Model for 3D Genito-Pelvic Structure Segmentation in MRI and CT

Yuheng Li,Mingzhe Hu,Richard L. J. Qiu,Maria Thor,Andre Williams,Deborah Marshall,Xiaofeng Yang

Task: 提出RoMedFormer，一种基于旋转嵌入的Transformer基础模型，用于MRI和CT中的3D女性生殖-盆腔结构分割。

Motivation: 现有的分割模型在跨成像模态和解剖变异的泛化能力上存在困难。

Details

Method: RoMedFormer利用自监督学习和旋转位置嵌入来增强空间特征表示并捕捉3D医学数据中的长程依赖关系。 Result: 实验结果表明，RoMedFormer在分割生殖-盆腔器官方面表现出色。 Conclusion: 研究结果突出了基于Transformer的架构在医学图像分割中的潜力，并为更具可迁移性的分割框架铺平了道路。 Abstract: Deep learning-based segmentation of genito-pelvic structures in MRI and CT is crucial for applications such as radiation therapy, surgical planning, and disease diagnosis. However, existing segmentation models often struggle with generalizability across imaging modalities, and anatomical variations. In this work, we propose RoMedFormer, a rotary-embedding transformer-based foundation model designed for 3D female genito-pelvic structure segmentation in both MRI and CT. RoMedFormer leverages self-supervised learning and rotary positional embeddings to enhance spatial feature representation and capture long-range dependencies in 3D medical data. We pre-train our model using a diverse dataset of 3D MRI and CT scans and fine-tune it for downstream segmentation tasks. Experimental results demonstrate that RoMedFormer achieves superior performance segmenting genito-pelvic organs. Our findings highlight the potential of transformer-based architectures in medical image segmentation and pave the way for more transferable segmentation frameworks.

ADAPT: An Autonomous Forklift for Construction Site Operation

Johannes Huemer,Markus Murschitz,Matthias Schörghuber,Lukas Reisinger,Thomas Kadiofsky,Christoph Weidinger,Mario Niedermeyer,Benedikt Widy,Marcel Zeilinger,Csaba Beleznai,Tobias Glück,Andreas Kugi,Patrik Zips

Task: 开发并评估一种用于建筑环境的全自动越野叉车ADAPT。

Motivation: 建筑行业中的材料物流效率低下、延迟和安全风险问题，以及劳动力短缺问题。

Details

Method: 集成AI驱动的感知技术与传统决策、规划和控制方法，以应对复杂环境中的挑战。 Result: 通过广泛的实地测试验证，ADAPT在长期性能上接近人类操作员的水平。 Conclusion: 自主户外叉车可以在接近人类水平的性能下运行，为更安全、更高效的建筑物流提供了可行的途径。 Abstract: Efficient material logistics play a critical role in controlling costs and schedules in the construction industry. However, manual material handling remains prone to inefficiencies, delays, and safety risks. Autonomous forklifts offer a promising solution to streamline on-site logistics, reducing reliance on human operators and mitigating labor shortages. This paper presents the development and evaluation of the Autonomous Dynamic All-terrain Pallet Transporter (ADAPT), a fully autonomous off-road forklift designed for construction environments. Unlike structured warehouse settings, construction sites pose significant challenges, including dynamic obstacles, unstructured terrain, and varying weather conditions. To address these challenges, our system integrates AI-driven perception techniques with traditional approaches for decision making, planning, and control, enabling reliable operation in complex environments. We validate the system through extensive real-world testing, comparing its long-term performance against an experienced human operator across various weather conditions. We also provide a comprehensive analysis of challenges and key lessons learned, contributing to the advancement of autonomous heavy machinery. Our findings demonstrate that autonomous outdoor forklifts can operate near human-level performance, offering a viable path toward safer and more efficient construction logistics.

Yali Bi,Enyu Che,Yinan Chen,Yuanpeng He,Jingwei Qu

Task: 提出一种基于多原型的嵌入细化方法，用于半监督医学图像分割。

Motivation: 医学图像分割的准确性依赖于区分体素差异，而传统线性分类器难以捕捉这些细微差异。

Details

Method: 设计了一种基于多原型的分类策略，通过聚类体素来探索类内变化，并引入一致性约束来缓解线性分类器的局限性。 Result: 在两个流行的基准测试中，该方法相比现有方法表现出优越的性能。 Conclusion: 所提出的方法在半监督医学图像分割中表现出色，能够有效捕捉类内变化并提高分割准确性。 Abstract: Medical image segmentation aims to identify anatomical structures at the voxel-level. Segmentation accuracy relies on distinguishing voxel differences. Compared to advancements achieved in studies of the inter-class variance, the intra-class variance receives less attention. Moreover, traditional linear classifiers, limited by a single learnable weight per class, struggle to capture this finer distinction. To address the above challenges, we propose a Multi-Prototype-based Embedding Refinement method for semi-supervised medical image segmentation. Specifically, we design a multi-prototype-based classification strategy, rethinking the segmentation from the perspective of structural relationships between voxel embeddings. The intra-class variations are explored by clustering voxels along the distribution of multiple prototypes in each class. Next, we introduce a consistency constraint to alleviate the limitation of linear classifiers. This constraint integrates different classification granularities from a linear classifier and the proposed prototype-based classifier. In the thorough evaluation on two popular benchmarks, our method achieves superior performance compared with state-of-the-art methods. Code is available at https://github.com/Briley-byl123/MPER.

Retrospective: A CORDIC Based Configurable Activation Function for NN Applications

Omkar Kokane,Gopal Raut,Salim Ullah,Mukul Lokhande,Adam Teman,Akash Kumar,Santosh Kumar Vishvakarma

Task: 设计一种基于CORDIC的激活函数配置，以加速资源受限系统的ASIC硬件设计。

Motivation: 提供功能可重构性，以满足AI应用中对激活函数的需求。

Details

Method: 采用Shift-and-Add CORDIC技术，设计动态可配置和精度可调的激活函数核心。 Result: 优化了MAC、Sigmoid和Tanh功能，并将其整合到ReLU AFs中，形成了NEURIC计算单元，实现了98.5%的结果质量（QoR）。 Conclusion: NEURIC成为资源高效向量引擎的基本组件，适用于DNNs、RNNs/LSTMs和Transformers的AI加速器。 Abstract: A CORDIC-based configuration for the design of Activation Functions (AF) was previously suggested to accelerate ASIC hardware design for resource-constrained systems by providing functional reconfigurability. Since its introduction, this new approach for neural network acceleration has gained widespread popularity, influencing numerous designs for activation functions in both academic and commercial AI processors. In this retrospective analysis, we explore the foundational aspects of this initiative, summarize key developments over recent years, and introduce the DA-VINCI AF tailored for the evolving needs of AI applications. This new generation of dynamically configurable and precision-adjustable activation function cores promise greater adaptability for a range of activation functions in AI workloads, including Swish, SoftMax, SeLU, and GeLU, utilizing the Shift-and-Add CORDIC technique. The previously presented design has been optimized for MAC, Sigmoid, and Tanh functionalities and incorporated into ReLU AFs, culminating in an accumulative NEURIC compute unit. These enhancements position NEURIC as a fundamental component in the resource-efficient vector engine for the realization of AI accelerators that focus on DNNs, RNNs/LSTMs, and Transformers, achieving a quality of results (QoR) of 98.5%.

Advancing Medical Representation Learning Through High-Quality Data

Negin Baghbanzadeh,Adibvafa Fallahpour,Yasaman Parhizkar,Franklin Ogidi,Shuvendu Roy,Sajad Ashkezari,Vahid Reza Khazaie,Michael Colacci,Ali Etemad,Arash Afkanpour,Elham Dolatabadi

Task: 探索高质量医学视觉-语言数据集对模型性能的影响。

Motivation: 尽管医学视觉-语言数据集的规模在增长，但数据集质量对模型性能的影响尚未得到充分探索。

Details

Method: 引入Open-PMC，一个来自PubMed Central的高质量医学数据集，包含220万图像-文本对，并丰富了图像模态注释、子图和总结的文本引用。 Result: 实验结果表明，数据集质量（不仅仅是规模）显著提升了模型性能。 Conclusion: 数据整理质量在推进多模态医学AI中起着至关重要的作用。 Abstract: Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

Weakly Supervised Spatial Implicit Neural Representation Learning for 3D MRI-Ultrasound Deformable Image Registration in HDR Prostate Brachytherapy

Jing Wang,Ruirui Liu,Yu Lei,Michael J. Baine,Tian Liu,Yang Lei

Task: 开发一种弱监督的空间隐式神经表示（SINR）方法，用于3D MRI-超声（US）可变形配准。

Motivation: 准确的3D MRI-超声（US）可变形配准对于高剂量率（HDR）前列腺近距离放射治疗中的实时引导至关重要。

Details

Method: 使用来自MRI/US分割的稀疏表面监督，而不是密集强度匹配。SINR将变形建模为连续空间函数，患者特定的表面先验引导静止速度场以实现生物学上合理的变形。 Result: 在公共数据集上，前列腺DSC为0.93±0.05，MSD为0.87±0.10 mm，HD95为1.58±0.37 mm。在机构数据集上，前列腺CTV的DSC为0.88±0.09，MSD为1.21±0.38 mm，HD95为2.09±1.48 mm。膀胱和直肠的性能较低，原因是超声的视野有限。视觉评估证实了准确的配准，差异最小。 Conclusion: 本研究引入了一种新颖的基于弱监督SINR的3D MRI-US可变形配准方法。通过利用稀疏表面监督和空间先验，实现了准确、稳健且计算效率高的配准，增强了HDR前列腺近距离放射治疗中的实时图像引导，并提高了治疗精度。 Abstract: Purpose: Accurate 3D MRI-ultrasound (US) deformable registration is critical for real-time guidance in high-dose-rate (HDR) prostate brachytherapy. We present a weakly supervised spatial implicit neural representation (SINR) method to address modality differences and pelvic anatomy challenges. Methods: The framework uses sparse surface supervision from MRI/US segmentations instead of dense intensity matching. SINR models deformations as continuous spatial functions, with patient-specific surface priors guiding a stationary velocity field for biologically plausible deformations. Validation included 20 public Prostate-MRI-US-Biopsy cases and 10 institutional HDR cases, evaluated via Dice similarity coefficient (DSC), mean surface distance (MSD), and 95% Hausdorff distance (HD95). Results: The proposed method achieved robust registration. For the public dataset, prostate DSC was $0.93 \pm 0.05$, MSD $0.87 \pm 0.10$ mm, and HD95 $1.58 \pm 0.37$ mm. For the institutional dataset, prostate CTV achieved DSC $0.88 \pm 0.09$, MSD $1.21 \pm 0.38$ mm, and HD95 $2.09 \pm 1.48$ mm. Bladder and rectum performance was lower due to ultrasound's limited field of view. Visual assessments confirmed accurate alignment with minimal discrepancies. Conclusion: This study introduces a novel weakly supervised SINR-based approach for 3D MRI-US deformable registration. By leveraging sparse surface supervision and spatial priors, it achieves accurate, robust, and computationally efficient registration, enhancing real-time image guidance in HDR prostate brachytherapy and improving treatment precision.

Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation

Umar Farooq,Jean-Yves Guillemaut,Adrian Hilton,Marco Volino

Task: 提出了一种名为Opti3DGS的频率调制粗到细优化框架，旨在减少用于表示场景的高斯原语数量，从而降低内存和存储需求。

Motivation: 3D高斯泼溅（3DGS）技术在实时渲染高质量场景重建方面取得了革命性进展，但其高GPU内存和磁盘存储需求限制了其在消费级设备上的实际应用。

Details

Method: Opti3DGS利用图像频率调制，首先强制进行粗略的场景表示，然后通过调制训练图像中的频率细节逐步细化。 Result: 在基线3DGS上，Opti3DGS平均减少了62%的高斯原语，训练GPU内存需求减少了40%，优化时间减少了20%，且不牺牲视觉质量。 Conclusion: Opti3DGS不仅与许多基于3DGS的技术无缝集成，还一致减少了高斯原语数量，同时保持甚至提高了视觉质量，并且自然地生成了细节层次场景表示。 Abstract: The field of Novel View Synthesis has been revolutionized by 3D Gaussian Splatting (3DGS), which enables high-quality scene reconstruction that can be rendered in real-time. 3DGS-based techniques typically suffer from high GPU memory and disk storage requirements which limits their practical application on consumer-grade devices. We propose Opti3DGS, a novel frequency-modulated coarse-to-fine optimization framework that aims to minimize the number of Gaussian primitives used to represent a scene, thus reducing memory and storage demands. Opti3DGS leverages image frequency modulation, initially enforcing a coarse scene representation and progressively refining it by modulating frequency details in the training images. On the baseline 3DGS, we demonstrate an average reduction of 62% in Gaussians, a 40% reduction in the training GPU memory requirements and a 20% reduction in optimization time without sacrificing the visual quality. Furthermore, we show that our method integrates seamlessly with many 3DGS-based techniques, consistently reducing the number of Gaussian primitives while maintaining, and often improving, visual quality. Additionally, Opti3DGS inherently produces a level-of-detail scene representation at no extra cost, a natural byproduct of the optimization pipeline. Results and code will be made publicly available.

Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset

Yiqun Mei,Mingming He,Li Ma,Julien Philip,Wenqi Xian,David M George,Xueming Yu,Gabriel Dedic,Ahmet Levent Taşel,Ning Yu,Vishal M. Patel,Paul Debevec

Task: 提出一种新的视频肖像重照明方法Lux Post Facto，以生成既逼真又时间一致的照明效果。

Motivation: 视频肖像重照明在保持逼真性和时间稳定性方面仍然具有挑战性，通常需要强大的模型设计和高质量配对视频数据集的密集训练。

Details

Method: 设计了一种新的条件视频扩散模型，基于最先进的预训练视频扩散模型，并引入新的照明注入机制以实现精确控制。使用混合数据集（静态表情OLAT数据和野外肖像表演视频）联合学习重照明和时间建模。 Result: 实验结果表明，该模型在逼真性和时间一致性方面均达到了最先进的效果。 Conclusion: Lux Post Facto方法在视频肖像重照明方面表现出色，避免了获取不同光照条件下的配对视频数据的需求。 Abstract: Video portrait relighting remains challenging because the results need to be both photorealistic and temporally stable. This typically requires a strong model design that can capture complex facial reflections as well as intensive training on a high-quality paired video dataset, such as dynamic one-light-at-a-time (OLAT). In this work, we introduce Lux Post Facto, a novel portrait video relighting method that produces both photorealistic and temporally consistent lighting effects. From the model side, we design a new conditional video diffusion model built upon state-of-the-art pre-trained video diffusion model, alongside a new lighting injection mechanism to enable precise control. This way we leverage strong spatial and temporal generative capability to generate plausible solutions to the ill-posed relighting problem. Our technique uses a hybrid dataset consisting of static expression OLAT data and in-the-wild portrait performance videos to jointly learn relighting and temporal modeling. This avoids the need to acquire paired video data in different lighting conditions. Our extensive experiments show that our model produces state-of-the-art results both in terms of photorealism and temporal consistency.