Skip to content

Table of Contents

cs.CL [Back]

[1] FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Jonas Golde,Patrick Haller,Alan Akbik

Main category: cs.CL

TL;DR: 本文提出了FiNERweb,一个用于多语言命名实体识别(NER)的数据集生成管道,利用回归模型和多语言大语言模型在91种语言和25种书写系统中生成高质量的合成标注数据,在更少数据下实现了与强基线相当或更好的零样本迁移性能,并公开了含目标语言标签的数据集及相关资源。

Details Motivation: 现有的多语言NER研究中,大语言模型生成的合成数据多为副产品,缺乏系统性与可复用性;此外,当前模型在使用目标语言标签时性能下降,亟需高质量、可扩展的多语言NER数据集。 Method: 基于FineWeb-Edu构建FiNERweb管道:首先训练回归模型识别适合NER的文本段落,然后使用多语言大语言模型对其进行标注,最终生成覆盖91种语言、25种脚本的大规模数据集,并提供英语标签及各目标语言的翻译标签版本。 Result: 回归模型在筛选任务上F1超过84;基于FiNERweb训练的模型在英语、泰语和斯瓦希里语的零样本迁移中表现优于或媲美使用19倍更多数据的强基线;LLM评估显示标注忠实度得分为3.99/5,完整性为4.05/5;使用目标语言标签时模型性能下降0.02–0.09 F1。 Conclusion: FiNERweb通过可扩展的教师-学生范式生成高质量、多语言NER数据,在少量数据下实现优异的零样本迁移效果,且标注质量可靠;发布目标语言标签有助于缩小评估偏差,推动更有效的多语言NER研究。 Abstract: Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99 out of 5) and completeness (4.05 out of 5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.

[2] Olmo 3

Team Olmo,:,Allyson Ettinger,Amanda Bertsch,Bailey Kuehl,David Graham,David Heineman,Dirk Groeneveld,Faeze Brahman,Finbarr Timbers,Hamish Ivison,Jacob Morrison,Jake Poznanski,Kyle Lo,Luca Soldaini,Matt Jordan,Mayee Chen,Michael Noukhovitch,Nathan Lambert,Pete Walsh,Pradeep Dasigi,Robert Berry,Saumya Malik,Saurabh Shah,Scott Geng,Shane Arora,Shashank Gupta,Taira Anderson,Teng Xiao,Tyler Murray,Tyler Romero,Victoria Graf,Akari Asai,Akshita Bhagia,Alexander Wettig,Alisa Liu,Aman Rangapur,Chloe Anastasiades,Costa Huang,Dustin Schwenk,Harsh Trivedi,Ian Magnusson,Jaron Lochner,Jiacheng Liu,Lester James V. Miranda,Maarten Sap,Malia Morgan,Michael Schmitz,Michal Guerquin,Michael Wilson,Regan Huff,Ronan Le Bras,Rui Xin,Rulin Shao,Sam Skjonsberg,Shannon Zejiang Shen,Shuyue Stella Li,Tucker Wilde,Valentina Pyatkin,Will Merrill,Yapei Chang,Yuling Gu,Zhiyuan Zeng,Ashish Sabharwal,Luke Zettlemoyer,Pang Wei Koh,Ali Farhadi,Noah A. Smith,Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: Olmo 3 是一个在 7B 和 32B 参数规模上完全开源的语言模型家族,专注于长上下文推理、函数调用、编程、指令遵循、通用对话和知识回忆。

Details Motivation: 推动完全开源语言模型的发展,提供透明且可复现的模型构建流程,支持多种任务场景。 Method: 发布包括完整生命周期在内的整个模型流程,涵盖所有阶段、检查点、数据点和依赖项。 Result: 旗舰模型 Olmo 3 Think 32B 成为迄今为止最强大的完全开源思考模型。 Conclusion: Olmo 3 的发布为开源社区提供了高性能且高度透明的模型资源,推动了开放科学研究。 Abstract: We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.

[3] Structure-Aware Decoding Mechanisms for Complex Entity Extraction with Large-Scale Language Models

Zhimin Qiu,Di Wu,Feng Liu,Chenrui Hu,Yuxiao Wang

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的结构感知解码方法,用于解决嵌套和重叠实体抽取中语义完整性和结构一致性难以兼顾的问题。

Details Motivation: 传统方法在处理嵌套和重叠实体时难以同时保持语义完整性和结构一致性,限制了复杂场景下的信息抽取性能。 Method: 引入候选片段生成机制和结构化注意力建模,结合预训练语言模型获取上下文表示,通过多粒度特征组合捕捉实体边界,并在解码过程中引入层次化结构约束,联合优化分类损失和结构一致性损失。 Result: 在ACE 2005数据集上显著提升了Accuracy、Precision、Recall和F1-Score,尤其在嵌套与重叠实体识别中表现出更强的边界定位和结构建模能力。 Conclusion: 结构感知解码能有效提升复杂语义抽取任务的性能,为具有层次化理解能力的语言模型发展提供了新视角和方法基础。 Abstract: This paper proposes a structure-aware decoding method based on large language models to address the difficulty of traditional approaches in maintaining both semantic integrity and structural consistency in nested and overlapping entity extraction tasks. The method introduces a candidate span generation mechanism and structured attention modeling to achieve unified modeling of entity boundaries, hierarchical relationships, and cross-dependencies. The model first uses a pretrained language model to obtain context-aware semantic representations, then captures multi-granular entity span features through candidate representation combinations, and introduces hierarchical structural constraints during decoding to ensure consistency between semantics and structure. To enhance stability in complex scenarios, the model jointly optimizes classification loss and structural consistency loss, maintaining high recognition accuracy under multi-entity co-occurrence and long-sentence dependency conditions. Experiments conducted on the ACE 2005 dataset demonstrate significant improvements in Accuracy, Precision, Recall, and F1-Score, particularly in nested and overlapping entity recognition, where the model shows stronger boundary localization and structural modeling capability. This study verifies the effectiveness of structure-aware decoding in complex semantic extraction tasks, provides a new perspective for developing language models with hierarchical understanding, and establishes a methodological foundation for high-precision information extraction.

[4] What Affects the Effective Depth of Large Language Models?

Yi Hu,Cai Zhou,Muhan Zhang

Main category: cs.CL

TL;DR: 当前大语言模型在增加深度时未能有效利用更多层进行有意义的计算,有效深度比例保持稳定,性能提升主要来自更长上下文而非更深计算,且模型不会在困难任务上动态使用更多层。

Details Motivation: 研究大语言模型随着规模、训练方式和任务难度变化时有效深度的变化规律,揭示当前模型对深度的利用不足问题。 Method: 分析Qwen-2.5系列模型(1.5B-32B)的行为,比较基础模型与长链思维链(long-CoT)模型的有效深度,并在不同难度任务上评估模型的层使用情况。 Result: 发现有效层数随模型规模增长但有效深度比率稳定;long-CoT模型未增加有效深度,表明推理能力提升源于更长上下文而非更深计算;模型在更难任务上并未动态使用更多层。 Conclusion: 当前大语言模型在不同规模、训练范式和任务难度下均未充分利用网络深度,提示未来可研究提高层利用率、模型剪枝和早期退出机制。 Abstract: The scaling of large language models (LLMs) emphasizes increasing depth, yet performance gains diminish with added layers. Prior work introduces the concept of "effective depth", arguing that deeper models fail to fully utilize their layers for meaningful computation. Building on this, we systematically study how effective depth varies with model scale, training type, and task difficulty. First, we analyze the model behavior of Qwen-2.5 family (1.5B-32B) and find that while the number of effective layers grows with model size, the effective depth ratio remains stable. Besides, comparisons between base and corresponding long-CoT models show no increase in effective depth, suggesting that improved reasoning stems from longer context rather than deeper per-token computation. Furthermore, evaluations across tasks of varying difficulty indicate that models do not dynamically use more layers for harder problems. Our results suggest that current LLMs underuse available depth across scales, training paradigms and tasks of varying difficulties, pointing out research opportunities on increasing the layer utilization rate of LLMs, model pruning, and early exiting. Our code is released at https://github.com/AheadOFpotato/what_affects_effective_depth.

[5] Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu,Lexington Whalen,Zhifan Ye,Xin Dong,Shizhe Diao,Jingyu Liu,Chengyue Wu,Hao Zhang,Enze Xie,Song Han,Maksim Khadkevich,Jan Kautz,Yingyan Celine Lin,Pavlo Molchanov

Main category: cs.CL

TL;DR: 本文研究了如何将预训练的自回归语言模型(AR)高效转换为扩散语言模型(dLM),提出了一种新的AR-to-dLM转换方法,通过块状注意力机制和位置依赖的掩码策略,在保持准确性的同时显著提升生成效率。

Details Motivation: 现有的扩散语言模型在从头训练时学习效率低于自回归模型,而直接转换AR模型为dLM存在注意力模式和训练目标上的局限,因此需要更有效的转换方法。 Method: 1) 采用块状注意力结构,在块内双向建模、块间保持因果性,以更好保留AR预训练权重分布;2) 提出位置依赖的掩码策略,测试时模拟从左到右的生成模式,缩小训练-测试差距;3) 设计连续预训练方案,实现平滑转换。 Result: 提出的Efficient-DLM系列模型在准确率和吞吐量上均优于现有SOTA模型,例如Efficient-DLM 8B相比Dream 7B和Qwen3 4B分别提升5.4%/2.7%准确率,并实现4.5倍/2.7倍吞吐量提升。 Conclusion: 通过改进注意力结构和掩码策略,AR-to-dLM转换可有效兼顾生成效率与任务精度,为构建高效语言模型提供了可行路径。 Abstract: Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.

[6] A Unified Sparse Attention via Multi-Granularity Compression

Siran Liu,Zane Cao,Yongchao He

Main category: cs.CL

TL;DR: UniSparse提出了一种统一的稀疏注意力机制,通过复合令牌和多粒度压缩,在保持高精度的同时显著提升长上下文场景下的计算效率。

Details Motivation: 现有稀疏注意力方法在训练成本、通用性或推理效率方面存在局限,难以兼顾准确性和跨模态适用性。 Method: 引入复合令牌来聚合多粒度上下文信息,并结合多粒度压缩与块级选择动态构建稀疏注意力。 Result: 在多种模态和任务上优于现有稀疏注意力方法,达到全注意力99%以上的精度,并比FlashAttention快2.61倍。 Conclusion: UniSparse是一种高效、硬件友好的稀疏注意力机制,适用于长上下文理解和多模态推理。 Abstract: Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving $\ge$ 99% of full-attention accuracy and up to 2.61$\times$ faster attention computation than FlashAttention.

[7] Multilingual and Continuous Backchannel Prediction: A Cross-lingual Study

Koji Inoue,Mikey Elmers,Yahui Fu,Zi Haur Pang,Taiga Mori,Divesh Lala,Keiko Ochi,Tatsuya Kawahara

Main category: cs.CL

TL;DR: 提出了一种多语言连续反馈预测模型,用于研究日语、英语和中文中的跨语言时序行为,发现不同语言在反馈时机上依赖不同的线索,并展示了该模型在实时系统中的应用潜力。

Details Motivation: 探究不同语言中反馈(backchannel)时机的差异,并构建一个统一的多语言模型来捕捉语言共性和特性,以改进对话系统的自然度和文化适应性。 Method: 基于Transformer架构,在帧级别操作,联合辅助任务对约300小时的双人对话数据进行多语言联合训练,并进行零样本迁移、扰动分析和上下文长度影响研究。 Result: 多语言模型在三种语言上达到或超过单语言基线;零样本迁移效果有限,表明跨语言差异显著;扰动分析显示日语更依赖短期语言信息,而英语和中文更敏感于静默时长和韵律变化;中文在较长上下文下表现更好,而日语对短上下文更鲁棒;多语言训练减少了中文对音高的过度依赖。 Conclusion: 该模型能有效学习跨语言反馈时序的共性与差异,为构建更自然、文化敏感的对话系统提供了实证依据和实用工具。 Abstract: We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.

[8] CogMem: A Cognitive Memory Architecture for Sustained Multi-Turn Reasoning in Large Language Models

Yiran Zhang,Jincheng Hu,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: CogMem是一种受认知启发的记忆增强型大语言模型架构,通过分层记忆系统支持持续的迭代推理,有效缓解了多轮交互中的推理错误、上下文膨胀和一致性下降问题。

Details Motivation: 大语言模型在单轮推理中表现优异,但在多轮交互中常出现推理偏差、任务漂移、幻觉、过度自信和记忆衰减等问题,且传统方法因重复附加完整对话历史导致上下文无限增长,计算成本高,推理效率低。 Method: 提出CogMem架构,包含三层记忆系统:长时记忆(LTM)用于跨会话的推理策略整合,直接访问记忆(DA)维护会话级笔记并检索相关长期记忆,注意力焦点(FoA)机制在每轮动态重构简洁且与任务相关的上下文。 Result: 在TurnBench上的实验表明,CogMem能有效减少推理失败,控制上下文增长,并提升长链推理中的一致性。 Conclusion: CogMem通过结构化的持久记忆机制,提升了大语言模型在多轮交互中的准确性与连贯性,推动其实现更可靠、类人的持续推理能力。 Abstract: Large language models (LLMs) excel at single-turn reasoning but often lose accuracy and coherence over extended, multi-turn interactions. Recent evaluations such as TurnBench highlight recurring failure modes-reasoning bias, task drift, hallucination, overconfidence, and memory decay. Current approaches typically append full conversational histories, causing unbounded context growth, higher computational costs, and degraded reasoning efficiency. We introduce CogMem, a cognitively inspired, memory-augmented LLM architecture that supports sustained iterative reasoning through structured, persistent memory. CogMem incorporates three layers: a Long-Term Memory (LTM) that consolidates cross-session reasoning strategies; a Direct Access (DA) memory that maintains session-level notes and retrieves relevant long-term memories; and a Focus of Attention (FoA) mechanism that dynamically reconstructs concise, task-relevant context at each turn. Experiments on TurnBench show that this layered design mitigates reasoning failures, controls context growth, and improves consistency across extended reasoning chains, moving toward more reliable, human-like reasoning in LLMs.

[9] Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents

Hongqiu Ni,Jiabao Zhang,Guopeng Li,Zilong Wang,Ruiqi Wu,Chi Zhang,Haisheng Tan

Main category: cs.CL

TL;DR: Astraea是一种面向大语言模型智能体工作流的服务引擎,通过状态感知的分层调度和自适应KV缓存管理,优化全局请求生命周期,显著降低端到端延迟。

Details Motivation: 现有推理系统(如vLLM)以局部段优化为主,无法有效减少多阶段、混合I/O与计算的智能体工作流的整体作业完成时间(JCT)。 Method: 提出Astraea,采用状态感知的分层调度算法,结合请求历史状态与未来预测,动态分类I/O与计算密集型请求,并使用改进的HRRN策略进行调度;同时设计自适应KV缓存管理器,在I/O等待期间根据内存压力管理agent状态。 Result: 实验表明,Astraea相比基线方法平均JCT降低最多25.5%,在高负载和不同模型规模下均表现出强健性和稳定性。 Conclusion: Astraea通过将优化粒度从局部段提升至全局请求生命周期,有效提升了LLM智能体工作流的端到端性能。 Abstract: Large Language Models (LLMs) are increasingly being deployed as intelligent agents. Their multi-stage workflows, which alternate between local computation and calls to external network services like Web APIs, introduce a mismatch in their execution pattern and the scheduling granularity of existing inference systems such as vLLM. Existing systems typically focus on per-segment optimization which prevents them from minimizing the end-to-end latency of the complete agentic workflow, i.e., the global Job Completion Time (JCT) over the entire request lifecycle. To address this limitation, we propose Astraea, a service engine designed to shift the optimization from local segments to the global request lifecycle. Astraea employs a state-aware, hierarchical scheduling algorithm that integrates a request's historical state with future predictions. It dynamically classifies requests by their I/O and compute intensive nature and uses an enhanced HRRN policy to balance efficiency and fairness. Astraea also implements an adaptive KV cache manager that intelligently handles the agent state during I/O waits based on the system memory pressure. Extensive experiments show that Astraea reduces average JCT by up to 25.5\% compared to baseline methods. Moreover, our approach demonstrates strong robustness and stability under high load across various model scales.

[10] A Comparative Analysis of Retrieval-Augmented Generation Techniques for Bengali Standard-to-Dialect Machine Translation Using LLMs

K. M. Jubair Sami,Dipto Sumit,Ariyan Hossain,Farig Sadeque

Main category: cs.CL

TL;DR: 本文提出了两种用于标准孟加拉语到方言翻译的新型RAG管道,其中基于标准化句对的管道在多个方言和LLM上表现更优,显著降低了词错误率,并使小型模型超越大型模型,为低资源方言翻译提供了无需微调的有效解决方案。

Details Motivation: 由于数据稀缺和语言差异,将标准语言翻译成其方言是NLP中的一个重要挑战,尤其在孟加拉语中尤为突出。 Method: 提出并比较了两种RAG管道:基于转录文本的管道和基于结构化方言-标准句对的管道,在六个孟加拉方言和多种大语言模型上进行评估。 Result: 基于句对的管道表现更优,使Chittagong方言的词错误率从76%降至55%,且小型模型(如Llama-3.1-8B)超过了大型模型(如GPT-OSS-120B)。 Conclusion: 良好的检索策略比模型规模更重要,该方法为保护语言多样性提供了实用、无需微调的低资源方言翻译方案。 Abstract: Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local\_dialect:standard\_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76\% to 55\% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to outperform much larger models (e.g., GPT-OSS-120B), demonstrating that a well-designed retrieval strategy can be more crucial than model size. This work contributes an effective, fine-tuning-free solution for low-resource dialect translation, offering a practical blueprint for preserving linguistic diversity.

[11] Ladder Up, Memory Down: Low-Cost Fine-Tuning With Side Nets

Estelle Zheng,Nathan Cerisara,Sébastien Warichet,Emmanuel Helbert,Christophe Cerisara

Main category: cs.CL

TL;DR: 本文提出了一种名为Ladder Side Tuning (LST) 的参数高效微调方法,通过引入轻量级侧网络显著降低显存占用,在保持与QLoRA相当性能的同时将峰值显存减少50%,并支持在单个12GB消费级GPU上对7B模型进行长上下文(2k token)微调。此外,作者还提出了深度扩展变体xLadder,通过跨层连接增强推理深度而不增加显存开销。

Details Motivation: 现有的大模型微调方法如QLoRA虽减少了可训练参数,但反向传播仍导致高显存消耗,限制了在显存受限设备上的应用。因此需要一种更高效的微调方法以突破这一瓶颈。 Method: 重新审视并改进Ladder Side Tuning (LST),使用一个轻量级侧网络进行微调,并设计其深度扩展版本xLadder,通过跨连接提升有效深度和推理能力,同时控制总参数量。 Result: LST在多个下游任务(包括自然语言理解、数学和LLM-critic任务)上表现与QLoRA相当,但峰值显存降低50%,可在单个12GB GPU上完成7B模型、2k上下文的微调而无需梯度检查点;xLadder进一步提升了复杂推理能力。 Conclusion: LST是一种在显存受限条件下优于QLoRA的高效微调方法,兼具高性能与低显存需求;xLadder在此基础上增强了模型的推理深度,为高效训练提供了新方向。 Abstract: Fine-tuning large language models (LLMs) is often limited by the memory available on commodity GPUs. Parameter-efficient fine-tuning (PEFT) methods such as QLoRA reduce the number of trainable parameters, yet still incur high memory usage induced by the backward pass in the full model. We revisit Ladder Side Tuning (LST), a rarely explored PEFT technique that adds a lightweight side network, and show that it matches QLoRA's compute scaling slope while cutting peak memory by 50\%. Across different downstream benchmarks spanning natural language understanding, mathematical and LLM-critic tasks, LST has competitive performance with QLoRA's accuracy on average while being much more memory-efficient. This efficiency enables fine-tuning of 7B-parameter models on a single 12 GB consumer GPU with 2k-token contexts, requiring no gradient checkpointing\textemdash conditions under which QLoRA exhausts memory. Beyond memory efficiency, we also establish scaling laws showing that LST scales similarly to QLoRA. We exploit Ladder's architectural flexibility by introducing xLadder, a depth-extended variant that increases effective depth via cross-connections and shortens chain-of-thought (CoT) at fixed parameter count. Ladder is strong when memory is the bottleneck; xLadder builds on this by enabling deeper reasoning without additional memory overhead.

[12] Two CFG Nahuatl for automatic corpora expansion

Juan-José Guzmán-Landa,Juan-Manuel Torres-Moreno,Miguel Figueroa-Saavedra,Ligia Quintana-Torres,Graham Ranger Martha-Lorena Avendaño-Garrido

Main category: cs.CL

TL;DR: 本文提出了两种用于纳瓦特尔语语料库扩展的上下文无关文法(CFG),通过生成大量语法正确的虚拟句子来扩充语料,进而提升非上下文嵌入模型的学习效果,并在句子语义相似性任务中验证了其有效性。

Details Motivation: 由于纳瓦特尔语属于数字资源稀缺的π型语言,现有语料库不足以支持大语言模型的学习,因此需要通过人工方法扩展语料库。 Method: 设计了两种新的纳瓦特尔语上下文无关文法(CFG),并以生成模式使用它们来产生大量语法正确的虚拟句子,用于扩展原始语料库,并训练和评估词嵌入模型。 Result: 使用扩展后的语料库训练的嵌入模型在句子语义相似性任务中表现优于仅使用原始语料库的结果,且经济型嵌入模型的表现常优于某些大语言模型。 Conclusion: 通过CFG生成的人工语料能有效扩展低资源语言的语料库,显著提升嵌入模型性能,为资源匮乏语言的自然语言处理提供了可行路径。 Abstract: The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the $π$-language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared to the results obtained using only the original corpus without artificial expansion, and also demonstrate that economic embeddings often perform better than some LLMs.

[13] From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition

Yiqing Zhou,Yu Lei,Shuzheng Si,Qingyan Sun,Wei Wang,Yifei Wu,Hao Wen,Gang Chen,Fanchao Qi,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出了一种基于EDU的显式上下文压缩框架LingoEDU,通过将文本转化为结构化的关系树并选择与查询相关的子树,有效保留了全局结构和细粒度信息,显著提升了长上下文任务性能并降低成本。

Details Motivation: 现有上下文压缩方法在处理长文本时存在破坏局部连贯性、位置偏差或与闭源API不兼容等问题,难以兼顾结构保持与高效计算。 Method: 提出EDU-based Context Compressor:首先利用LingoEDU将线性文本转换为基于基本话语单元(EDU)的结构关系树,并锚定原文索引;然后通过轻量级排序模块选择相关子树进行线性化输出。 Result: 在StructBench数据集上验证了该方法在结构预测准确率方面达到SOTA,显著优于前沿LLM,同时降低计算成本,并在多种下游任务(如长文档问答和Deep Search)中表现更优。 Conclusion: 该结构感知的显式压缩方法能有效平衡上下文长度、信息完整性和计算效率,为LLM处理长输入提供了可解释且实用的解决方案。 Abstract: Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.

[14] Inflation Attitudes of Large Language Models

Nikoleta Anesti,Edward Hill,Andreas Joseph

Main category: cs.CL

TL;DR: 该研究探讨了大型语言模型(如GPT-3.5-turbo)基于宏观经济价格信号形成通胀感知与预期的能力,并将其输出与家庭调查数据和官方统计进行比较。利用GPT训练截止于2021年9月的特点,研究发现其在短期预测中能追踪总体趋势,在细分层面也复现了收入、住房状况和社会阶层等关键经验规律。通过Shapley值分解分析提示内容对模型输出的影响,发现GPT对食品通胀信息敏感,但缺乏一致的消费者价格通胀模型。该方法可用于评估LLM在社会科学中的应用潜力。

Details Motivation: 探究大型语言模型是否能像人类一样基于宏观经济信息形成通胀感知与预期,并评估其在社会科学研究中的潜在应用价值。 Method: 将GPT-3.5-turbo的输出与英国央行通胀态度调查(IAS)的家庭调查数据和官方统计数据进行比较,采用准实验设计,利用GPT训练数据截止于2021年9月的特性,分析其对2021年后英国通胀飙升的响应;使用适用于合成调查场景的Shapley值分解方法分析提示内容对模型输出的驱动作用。 Result: GPT在短期范围内能够跟踪总体通胀预期和官方统计数据;在细分层面复制了家庭通胀感知的关键经验规律(如收入、住房、社会阶层);对食品通胀信息表现出与人类相似的高度敏感性;但缺乏一致的消费者价格通胀理解模型。 Conclusion: 尽管GPT能在特定条件下模拟人类的通胀感知模式,展现出在社会科学中作为分析工具的潜力,但其缺乏稳定的通胀认知结构,限制了其作为独立预测工具的可靠性。 Abstract: This paper investigates the ability of Large Language Models (LLMs), specifically GPT-3.5-turbo (GPT), to form inflation perceptions and expectations based on macroeconomic price signals. We compare the LLM's output to household survey data and official statistics, mimicking the information set and demographic characteristics of the Bank of England's Inflation Attitudes Survey (IAS). Our quasi-experimental design exploits the timing of GPT's training cut-off in September 2021 which means it has no knowledge of the subsequent UK inflation surge. We find that GPT tracks aggregate survey projections and official statistics at short horizons. At a disaggregated level, GPT replicates key empirical regularities of households' inflation perceptions, particularly for income, housing tenure, and social class. A novel Shapley value decomposition of LLM outputs suited for the synthetic survey setting provides well-defined insights into the drivers of model outputs linked to prompt content. We find that GPT demonstrates a heightened sensitivity to food inflation information similar to that of human respondents. However, we also find that it lacks a consistent model of consumer price inflation. More generally, our approach could be used to evaluate the behaviour of LLMs for use in the social sciences, to compare different models, or to assist in survey design.

[15] Step-Tagging: Toward controlling the generation of Language Reasoning Models through step monitoring

Yannis Belkhiter,Seshu Tirupathi,Giulio Zizzo,John D. Kelleher

Main category: cs.CL

TL;DR: 本文提出了Step-Tagging框架和ReasonType分类法,用于实时标注和监控语言推理模型(LRM)的推理步骤,实现有效的早期停止,显著减少计算开销同时保持准确性。

Details Motivation: 现有的语言推理模型在推理过程中存在低效问题,如过度生成验证和反思步骤,导致计算资源浪费。需要一种轻量级方法来提升对推理过程的控制能力。 Method: 提出Step-Tagging框架,结合轻量级句子分类器与ReasonType推理步骤分类体系,实现实时标注推理步骤类型,并利用特定步骤的数量进行在线监控,构建可解释的早期停止机制。 Result: 在多个标准数据集(如MATH500、GSM8K、AIME、GPQA、MMLU-Pro)上验证了该框架的有效性,实现了20%到50%的token减少,且保持与标准生成相当的准确率,尤其在计算密集型任务上效果更显著。 Conclusion: Step-Tagging框架为提高语言推理模型的推理效率和可控性提供了新工具,有助于深入研究LRM的行为模式,并推动高效推理的发展。 Abstract: The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. To address this challenge, we introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating. To monitor reasoning behaviors, we introduced ReasonType: a novel taxonomy of reasoning steps. Building on this framework, we demonstrated that online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences. We evaluate the Step-tagging framework on three open-source reasoning models across standard benchmark datasets: MATH500, GSM8K, AIME and non-mathematical tasks (GPQA and MMLU-Pro). We achieve 20 to 50\% token reduction while maintaining comparable accuracy to standard generation, with largest gains observed on more computation-heavy tasks. This work offers a novel way to increase control over the generation of LRMs, and a new tool to study behaviors of LRMs.

[16] Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of Large Language Models

Gabriele Prato,Shagun Sodhani,Alessandro Sordoni,Sarath Chandar

Main category: cs.CL

TL;DR: 研究了不同文档打包策略对大语言模型潜在多跳推理能力的影响,发现打包能提升性能但增加计算开销。

Details Motivation: 探索文档打包过程对大语言模型能力的影响尚不明确,尤其是多跳推理能力。 Method: 通过实验比较不同文档打包策略下的模型表现,并进行消融研究以识别关键影响因素。 Result: 打包策略相比单文档训练可提升模型性能,但需要更多计算资源。 Conclusion: 文档打包有助于提升大语言模型的多跳推理能力,研究为优化模型训练提供了实践指导。 Abstract: The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. However, the impact of this process on the models' capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.

[17] SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models

Shizhuo Mao,Song Chen,Yi Kang

Main category: cs.CL

TL;DR: SASQ是一种轻量级的量化感知训练框架,专门用于优化大语言模型的激活量化因子,通过仅优化量化因子而不修改预训练权重,在保持部署效率的同时实现了高精度。

Details Motivation: 大语言模型因规模不断增大而面临GPU内存不足的部署挑战,现有量化方法在计算开销、准确性或训练成本上存在权衡问题。 Method: 提出SASQ框架,仅优化激活量化因子而非模型权重,并自适应截断异常值以降低量化难度同时保留激活分布特征,支持静态推理。 Result: 在LLaMA2-7B上,SASQ在WikiText2数据集上的困惑度比QuaRot低5.2%,比FP16模型低4.7%,且优于现有最先进量化方案。 Conclusion: SASQ在不增加部署复杂性的情况下显著提升了量化模型的性能,为大语言模型的高效部署提供了有效解决方案。 Abstract: Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision, but existing solutions face fundamental trade-offs: dynamic quantization incurs high computational overhead and poses deployment challenges on edge devices, while static quantization sacrifices accuracy. Existing approaches of quantization-aware training (QAT) further suffer from weight training costs. We propose SASQ: a lightweight QAT framework specifically tailored for activation quantization factors. SASQ exclusively optimizes only the quantization factors (without changing pre-trained weights), enabling static inference with high accuracy while maintaining deployment efficiency. SASQ adaptively truncates some outliers, thereby reducing the difficulty of quantization while preserving the distributional characteristics of the activations. SASQ not only surpasses existing SOTA quantization schemes but also outperforms the corresponding FP16 models. On LLaMA2-7B, it achieves 5.2% lower perplexity than QuaRot and 4.7% lower perplexity than the FP16 model on WikiText2.

[18] C-ing Clearly: Enhanced Binary Code Explanations using C code

Teodor Poncu,Ioana Pintilie,Marius Dragoi,Dragos Tantaru,Florin Brad

Main category: cs.CL

TL;DR: 提出了一种名为C-ing Clearly的合成数据生成方法,利用对应的C代码增强大语言模型对汇编语言的理解,通过微调提升二进制代码摘要和漏洞检测性能。

Details Motivation: 大语言模型在高级编程语言上表现良好,但在低级语言如汇编语言上的表现较差,因此需要提升其对汇编代码的理解能力。 Method: 提出C-ing Clearly方法,利用C代码生成合成数据,并用于微调大语言模型。 Result: 在多个大语言模型家族和不同模型规模上,该方法均显著提升了二进制代码摘要和漏洞检测的性能。 Conclusion: 通过引入与汇编对应的C代码作为辅助信息,可有效增强大语言模型对低级代码的理解能力,具有广泛适用性。 Abstract: Large Language Models (LLMs) typically excel at coding tasks involving high-level programming languages, as opposed to lower-level programming languages, such as assembly. We propose a synthetic data generation method named C-ing Clearly, which leverages the corresponding C code to enhance an LLM's understanding of assembly. By fine-tuning on data generated through our method, we demonstrate improved LLM performance for binary code summarization and vulnerability detection. Our approach demonstrates consistent gains across different LLM families and model sizes.

[19] Linguists should learn to love speech-based deep learning models

Marianne de Heer Kloots,Paul Boersma,Willem Zuidema

Main category: cs.CL

TL;DR: 本文讨论了基于生成文本的大型语言模型在与语言学互动方面的局限性,并提出基于音频的深度学习模型应发挥关键作用。

Details Motivation: 由于当前研究主要集中在基于文本的大型语言模型上,许多关于人类语言的重要问题无法被充分探讨,因此需要引入基于音频的模型以更好地连接语言学理论。 Method: 通过分析Futrell和Mahowald提出的框架,指出其局限性,并论证基于音频的深度学习模型的优势和必要性。 Result: 提出基于音频的深度学习模型能够更全面地捕捉人类语言现象,从而促进语言学与深度学习之间的更有成效的互动。 Conclusion: 为了更好地理解人类语言,未来的研究应当重视并整合基于音频的深度学习方法。 Abstract: Futrell and Mahowald present a useful framework bridging technology-oriented deep learning systems and explanation-oriented linguistic theories. Unfortunately, the target article's focus on generative text-based LLMs fundamentally limits fruitful interactions with linguistics, as many interesting questions on human language fall outside what is captured by written text. We argue that audio-based deep learning models can and should play a crucial role.

[20] VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse

Ying Nie,Kai Han,Hongguang Li,Hang Zhou,Tianyu Guo,Enhua Wu,Xinghao Chen,Yunhe Wang

Main category: cs.CL

TL;DR: 提出了一种名为VersatileFFN的新颖前馈网络,通过宽度和深度两个维度的参数复用,在不增加内存成本的情况下提升模型表征能力。

Details Motivation: 现有参数高效方法如剪枝和量化主要压缩预训练模型,无法提升架构容量,导致模型表征能力受限。 Method: 设计了双路径自适应结构:宽度多功能路径从共享FFN生成子专家混合体,模拟稀疏专家路由;深度多功能路径递归应用同一FFN以实现复杂令牌的深层处理,并通过难度感知门控动态平衡两条路径。 Result: 在多个基准和模型规模上的实验表明,该方法在不增加参数的情况下有效提升了模型性能。 Conclusion: VersatileFFN通过计算换容量的方式实现了灵活高效的模型扩展,突破了基础模型的表征瓶颈。 Abstract: The rapid scaling of Large Language Models (LLMs) has achieved remarkable performance, but it also leads to prohibitive memory costs. Existing parameter-efficient approaches such as pruning and quantization mainly compress pretrained models without enhancing architectural capacity, thereby hitting the representational ceiling of the base model. In this work, we propose VersatileFFN, a novel feed-forward network (FFN) that enables flexible reuse of parameters in both width and depth dimensions within a fixed parameter budget. Inspired by the dual-process theory of cognition, VersatileFFN comprises two adaptive pathways: a width-versatile path that generates a mixture of sub-experts from a single shared FFN, mimicking sparse expert routing without increasing parameters, and a depth-versatile path that recursively applies the same FFN to emulate deeper processing for complex tokens. A difficulty-aware gating dynamically balances the two pathways, steering "easy" tokens through the efficient width-wise route and allocating deeper iterative refinement to "hard" tokens. Crucially, both pathways reuse the same parameters, so all additional capacity comes from computation rather than memory. Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/VersatileFFN.

[21] Dual Language Models: Balancing Training Efficiency and Overfitting Resilience

David Samuel,Lucas Georges Gabriel Charpentier

Main category: cs.CL

TL;DR: 本文结合了自回归和掩码扩散训练目标,无需架构修改,实现了优于单一目标模型的灵活语言模型。

Details Motivation: 自回归模型训练高效但容易过拟合,而掩码扩散模型抗过拟合能力强但训练效率低,本文旨在结合两者优势。 Method: 通过在不同数据重复水平下训练和评估50个语言模型,探索两种目标的最优组合比例。 Result: 双目标训练在所有设置下均优于单一目标模型,且最优比例在不同下游任务中具有一致性。 Conclusion: 结合自回归与掩码扩散目标能兼顾训练效率与抗过拟合能力,实现更优的语言模型性能。 Abstract: This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal ratio between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal ratio is similar whether targeting autoregressive or masked-diffusion downstream performance.

Nguyen Tien Dong,Minh-Anh Nguyen,Thanh Dat Hoang,Nguyen Tuan Ngoc,Dao Xuan Quang Minh,Phan Phi Hai,Nguyen Thi Ngoc Anh,Dang Van Tu,Binh Vu

Main category: cs.CL

TL;DR: VLegal-Bench是首个针对越南法律任务的大规模语言模型评估基准,基于Bloom认知分类设计,包含10,450个由法律专家标注验证的样本,涵盖多种实际法律应用场景。

Details Motivation: 越南法律法规结构复杂、更新频繁,现有方法难以有效评估大语言模型在该语境下的法律知识理解与应用能力。 Method: 基于Bloom认知分类理论构建多层级评测体系,通过法律专家参与的严格标注流程创建高质量数据集,涵盖问答、检索增强生成、多步推理和情景问题解决等任务。 Result: 构建了包含10,450个样本的VLegal-Bench基准,所有样本均基于权威法律文件并反映真实法律辅助工作流,具备标准化、透明性和认知合理性。 Conclusion: VLegal-Bench为评估越南法律语境下的大语言模型提供了可靠基础,有助于推动更可信、可解释且符合伦理的AI法律辅助系统发展。 Abstract: The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom's cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.

[23] Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

Hongli Li,Che Han Chen,Kevin Fan,Chiho Young-Johnson,Soyoung Lim,Yali Feng

Main category: cs.CL

TL;DR: 本研究综述了2022年至2025年发表的65项关于大语言模型(LLM)与人类评分者在自动作文评分(AES)中一致性的研究,发现LLM与人类评分的一致性总体为中等到良好,但存在较大差异,主要受研究设计和报告标准不一的影响。

Details Motivation: 探讨大语言模型在自动作文评分中的可靠性,并系统评估其与人类评分者的一致性水平。 Method: 遵循PRISMA 2020指南,对2022年1月至2025年8月期间的65项已发表和未发表的研究进行系统综述,分析LLM与人类评分者之间的一致性指标(如二次加权Kappa、皮尔逊相关系数和斯皮尔曼rho)。 Result: LLM与人类评分者的一致性指数大多在0.30至0.80之间,整体呈中等到良好水平,但各研究间存在显著异质性。 Conclusion: 尽管LLM在AES中展现出潜力,但当前研究在方法和报告上缺乏标准化,未来需建立统一的评估框架以提升可比性和可靠性。 Abstract: Despite the growing promise of large language models (LLMs) in automatic essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLMs and human raters in AES. Across studies, reported LLM-human agreement was generally moderate to good, with agreement indices (e.g., Quadratic Weighted Kappa, Pearson correlation, and Spearman's rho) mostly ranging between 0.30 and 0.80. Substantial variability in agreement levels was observed across studies, reflecting differences in study-specific factors as well as the lack of standardized reporting practices. Implications and directions for future research are discussed.

[24] Polypersona: Persona-Grounded LLM for Synthetic Survey Responses

Tejaswani Dash,Dinesh Karri,Anudeep Vurity,Gautam Datla,Tazeem Ahmad,Saima Rafi,Rohith Tangudu

Main category: cs.CL

TL;DR: 本文提出了PolyPersona,一种用于生成多领域人格条件化调查响应的框架,通过指令微调小型语言模型并利用LoRA适配器和4位量化实现资源自适应训练。

Details Motivation: 为了在多领域中生成一致且可靠的人格条件化调查数据,同时降低计算成本,研究者希望探索小型语言模型在特定人格引导下的生成能力。 Method: 采用基于对话的数据管道,结合LoRA适配器与4位量化技术对小型模型(如TinyLlama和Phi-2)进行指令微调,保留人格线索并确保生成响应的行为一致性。 Result: 构建了包含3,568个合成响应、涵盖10个领域和433种人格的数据集;实验显示小型模型性能媲美7B-8B大模型,最高BLEU达0.090,ROUGE-1达0.429。 Conclusion: 人格条件化微调可使小型语言模型高效生成连贯且可靠的合成调查数据,该框架为可扩展评估和偏见分析提供了透明、可复现的解决方案。 Abstract: This paper introduces PolyPersona, a generative framework for synthesizing persona-conditioned survey responses across multiple domains. The framework instruction-tunes compact chat models using parameter-efficient LoRA adapters with 4-bit quantization under a resource-adaptive training setup. A dialogue-based data pipeline explicitly preserves persona cues, ensuring consistent behavioral alignment across generated responses. Using this pipeline, we construct a dataset of 3,568 synthetic survey responses spanning ten domains and 433 distinct personas, enabling controlled instruction tuning and systematic multi-domain evaluation. We evaluate the generated responses using a multi-metric evaluation suite that combines standard text generation metrics, including BLEU, ROUGE, and BERTScore, with survey-specific metrics designed to assess structural coherence, stylistic consistency, and sentiment alignment.Experimental results show that compact models such as TinyLlama 1.1B and Phi-2 achieve performance comparable to larger 7B to 8B baselines, with a highest BLEU score of 0.090 and ROUGE-1 of 0.429. These findings demonstrate that persona-conditioned fine-tuning enables small language models to generate reliable and coherent synthetic survey data. The proposed framework provides an efficient and reproducible approach for survey data generation, supporting scalable evaluation while facilitating bias analysis through transparent and open protocols.

[25] Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies

Ekaterina Artemova,Laurie Burchell,Daryna Dementieva,Shu Okabe,Mariya Shmatova,Pedro Ortiz Suarez

Main category: cs.CL

TL;DR: 本教程为从事多语言和低资源语言的NLP从业者提供实用工具包,涵盖从数据收集到下游应用的端到端NLP管道构建方法,强调公平、可复现和社区参与的开发方式。

Details Motivation: 针对低资源语言在自然语言处理中面临的数据稀缺和文化差异问题,推动更公平、更具社会影响力的语言技术发展。 Method: 提供从网络爬虫、平行句对挖掘、机器翻译到文本分类和多模态推理等端到端NLP pipeline的实践策略与建模框架,并结合真实场景和多种语言案例进行展示。 Result: 参与者能够掌握构建低资源语言NLP系统的实用技能,并了解如何在不同语言和文化背景下实现公平且可复现的研究与应用。 Conclusion: 通过社区参与和实证方法,可以有效应对低资源语言的技术挑战,促进语言技术的包容性与多样性。 Abstract: This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages -- from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.

[26] Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer

Adarsha Shrestha,Basanta Pokharel,Binit Shrestha,Smriti Adhikari,Dinesh Gothe

Main category: cs.CL

TL;DR: 本研究基于GPT-2架构,结合GPT-3的训练策略,构建了一个适用于尼泊尔语的高效语言模型,采用专有BPE分词器并在大规模尼泊尔语文本上预训练,显著提升了尼泊尔语文本生成能力。

Details Motivation: 尼泊尔语作为一种低资源语言,面临语法复杂、形态丰富和高质量语料稀缺的问题,现有NLP模型难以有效处理其文本生成任务。 Method: 采用GPT-2架构,引入优化的学习率调度、批量扩展和架构改进;训练了一个专用于尼泊尔语的16k BPE分词器,并使用10.75GB清理后的NepBERTa语料库及网络爬取的新闻文本进行预训练,集成FlashAttention以降低内存消耗。 Result: 经过两个训练周期,模型达到训练损失3.168177、验证损失3.081982、最终困惑度为21.80,能够生成连贯的尼泊尔语新闻文本。 Conclusion: 该模型在尼泊尔语文本生成任务中表现出良好性能,为低资源语言的NLP研究提供了可行的技术路径。 Abstract: Nepali, a low-resource language spoken by over 32 million people, continues to face challenges in natural language processing (NLP) due to its complex grammar, agglutinative morphology, and limited availability of high-quality corpora. Most efforts to date have centered on basic encoder architectures; they remain insufficient for Nepali-specific text generation. This study presents a GPT-2-based Nepali language model trained using several training strategies inspired by GPT-3, including optimized learning rate schedules, batch scaling, and architectural refinements. A custom 16k Byte-Pair Encoding (BPE) tokenizer was trained exclusively on Nepali text to ensure more consistent segmentation and improved input representation. The model was pretrained on a combined dataset comprising a 10.75GB cleaned NepBERTa corpus and additional web-scraped Nepali news articles. FlashAttention was integrated to reduce memory usage and stabilize training. After two epochs, the model achieved a training loss of 3.168177, a validation loss of 3.081982, and a final perplexity of 21.80, demonstrating its capability to generate coherent Nepali news-style text.

[27] JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Atsuyuki Miyai,Shota Onohara,Jeonghun Baek,Kiyoharu Aizawa

Main category: cs.CL

TL;DR: 本文提出了JMMMU-Pro,一个基于图像的日本多学科多模态理解基准,以及Vibe基准构建方法,用于高效构建高质量视觉问答基准。

Details Motivation: 为了更严格地评估大语言模型在日语多模态理解方面的能力,并解决现有基准在视觉-文本整合理解上的不足。 Method: 提出Vibe基准构建方法,使用图像生成模型(如Nano Banana Pro)生成候选视觉问题,由人工验证并根据需要调整提示重新生成,以确保质量。 Result: 构建了覆盖广泛背景和布局设计的高质量、低成本的JMMMU-Pro基准,实验结果显示所有开源LMMs在该基准上表现不佳。 Conclusion: JMMMU-Pro为评估大语言模型的日语能力提供了更严格的工具,而Vibe基准构建方法为未来图像型VQA基准的发展提供了高效的指导。 Abstract: This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.

[28] TiME: Tiny Monolingual Encoders for Efficient NLP Pipelines

David Schulmeister,Valentin Hartmann,Lars Klein,Robert West

Main category: cs.CL

TL;DR: 本文提出了TiME(Tiny Monolingual Encoders),一种用于效率关键型NLP应用的小型单语编码器模型,通过知识蒸馏等现代训练技术,在性能、吞吐量、延迟和能耗之间实现了更优的权衡。

Details Motivation: 大型语言模型虽然功能强大,但在处理特定任务时效率低下,能耗高,难以满足实时性和可持续性需求,尤其在资源受限设备上部署困难。因此需要更小、更高效的专用模型。 Method: 采用知识蒸馏等现代训练技术,训练小型单语编码器模型(TiME);探索从多语言教师模型蒸馏单语学生模型,以及从使用相对位置编码的教师模型蒸馏使用绝对位置编码的学生模型。 Result: TiME模型在多个常见NLP任务上表现出色,在保持良好性能的同时显著提升了吞吐量、降低了延迟和能耗;验证了跨类型蒸馏的可行性(如多语言到单语、相对位置编码到绝对位置编码)。 Conclusion: 小型化模型结合现代蒸馏技术可在效率关键场景下优于大型通用模型,TiME为低资源语言支持和节能NLP系统提供了有效解决方案。 Abstract: Today, a lot of research on language models is focused on large, general-purpose models. However, many NLP pipelines only require models with a well-defined, small set of capabilities. While large models are capable of performing the tasks of those smaller models, they are simply not fast enough to process large amounts of data or offer real-time responses. Furthermore, they often use unnecessarily large amounts of energy, leading to sustainability concerns and problems when deploying them on battery-powered devices. In our work, we show how to train small models for such efficiency-critical applications. As opposed to many off-the-shelf NLP pipelines, our models use modern training techniques such as distillation, and offer support for low-resource languages. We call our models TiME (Tiny Monolingual Encoders) and comprehensively evaluate them on a range of common NLP tasks, observing an improved trade-off between benchmark performance on one hand, and throughput, latency and energy consumption on the other. Along the way, we show that distilling monolingual models from multilingual teachers is possible, and likewise distilling models with absolute positional embeddings from teachers with relative positional embeddings.

[29] Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

Lanxiang Hu,Siqi Kou,Yichao Fu,Samyam Rajbhandari,Tajana Rosing,Yuxiong He,Zhijie Deng,Hao Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为Jacobi Forcing的渐进式蒸馏范式,用于将自回归模型平滑转变为高效的并行解码器,同时保持其预训练的因果推理特性,并结合多块解码与拒绝重用机制,显著提升了推理速度。

Details Motivation: 现有的扩散大语言模型(dLLMs)在将预训练模型转换为并行解码时存在预训练与后训练数据分布不匹配的问题,且双向注意力与预训练中的因果先验冲突,限制了KV缓存复用和加速效果。 Method: 提出Jacobi Forcing,通过让模型在其自身生成的并行解码轨迹上进行训练,逐步将其从自回归模型转化为并行解码器;并设计多块解码与拒绝重用机制以提升每轮迭代的接受token数量。 Result: Jacobi Forcing模型在编程与数学基准上实现了3.8倍的实测加速,结合多块解码后可达近4.0倍加速,且性能损失极小,每轮迭代的token接受数提升达4.5倍。 Conclusion: Jacobi Forcing提供了一种有效路径,将自回归模型转化为高效并行解码器,解决了预训练与后训练间的分布差异与注意力机制冲突问题,显著降低推理延迟。 Abstract: Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.

[30] Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization

Yen-Ju Lu,Kunxiao Gao,Mingrui Liang,Helin Wang,Thomas Thebaud,Laureano Moro-Velazquez,Najim Dehak,Jesus Villalba

Main category: cs.CL

TL;DR: 本文介绍了Spoken DialogSum,这是首个将原始对话音频与事实性摘要、情感丰富摘要及话语级标签(如说话者年龄、性别和情绪)对齐的数据集,旨在推动情感感知和口语对话摘要的研究。

Details Motivation: 当前情感感知或口语对话摘要研究受限于缺乏连接语音、摘要和副语言线索的数据,因此需要构建一个包含丰富情感信息的多模态对话数据集。 Method: 首先利用大语言模型重写DialogSum脚本并添加类似Switchboard的填充词和反馈词,然后为每句话标注情感、音高和语速;接着使用富有表现力的TTS引擎合成与副语言标签对齐的语音。 Result: Spoken DialogSum包含13,460个情感多样的对话,每个对话均配有事实性和情感性两种摘要,实验表明基于音频的大模型在情感摘要生成上的ROUGE-L得分比级联ASR-LLM系统相对提升28%。 Conclusion: Spoken DialogSum填补了情感感知对话摘要领域的数据空白,验证了端到端语音建模在情感摘要生成中的优势,促进了相关技术的发展。 Abstract: Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. The dataset is available online at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.

[31] MMGR: Multi-Modal Generative Reasoning

Zefan Cai,Haoyi Qiu,Tianyi Ma,Haozhe Zhao,Gengze Zhou,Kung-Hsiang Huang,Parisa Kordjamshidi,Minjia Zhang,Xiao Wen,Jiuxiang Gu,Nanyun Peng,Junjie Hu

Main category: cs.CL

TL;DR: 本文提出了MMGR,一个基于多模态生成推理评估的基准,用于评估视频和图像生成模型在物理、逻辑、时空和空间推理方面的能力,揭示了现有模型在抽象推理和长期空间规划任务上的严重不足。

Details Motivation: 现有的视频生成评价指标(如FVD)主要关注感知质量,忽视了对因果性、物理规律和全局一致性等推理能力的评估,因此需要一种能够衡量生成模型是否真正理解世界动态的新型评估框架。 Method: 提出MMGR评估框架,涵盖五种推理能力:物理、逻辑、3D空间、2D空间和时间推理,并在三个领域(抽象推理、具身导航、物理常识)进行测试,采用细粒度指标要求视频与图像生成的整体正确性。 Result: 对多个领先模型(如Veo-3、Sora-2、GPT-4o-image等)的评测显示,模型在物理常识任务上表现尚可,但在抽象推理(ARC-AGI准确率低于10%)和具身导航中的长视野规划任务上表现较差。 Conclusion: 当前生成模型仍过度依赖感知数据,缺乏因果正确性和全局状态一致性;MMGR提供了一个统一的诊断工具,有助于推动具备推理能力的生成式世界模型的发展。 Abstract: Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.

cs.CV [Back]

[32] Complex Mathematical Expression Recognition: Benchmark, Large-Scale Dataset and Strong Baseline

Weikang Bai,Yongkun Du,Yuchen Su,Yazhen Xie,Zhineng Chen

Main category: cs.CV

TL;DR: 本文提出了一个用于评估数学表达式识别(MER)的基准CMER-Bench,并构建了大规模复杂表达式数据集MER-17M和CMER-3M,提出新的表达式分词器与结构化数学语言表示方法,进而开发出高性能的小参数模型CMERNet,在复杂表达式识别上显著优于现有方法。

Details Motivation: 现有的数学表达式识别模型在处理复杂、多行、多符号的表达式时性能显著下降,主要因为训练数据集中简单样本居多,缺乏对复杂表达式的覆盖,亟需更合适的基准和数据集来推动该领域发展。 Method: 构建了分级基准CMER-Bench和两个大规模复杂表达式数据集MER-17M与CMER-3M;提出结构化数学语言(SML)和新型表达式分词器,以更好建模表达式的层次与空间结构;基于编码器-解码器架构设计轻量模型CMERNet,并在CMER-3M上训练。 Result: 实验表明,当前模型在简单和中等难度表达式上表现良好,但在复杂表达式上性能明显下降;CMERNet仅用1.25亿参数在CMER-Bench上显著优于现有MER模型和多模态大语言模型。 Conclusion: 通过构建高质量基准与数据集,并引入能显式建模表达式结构的新表示与模型,可有效提升复杂数学表达式的识别性能,CMERNet为该任务提供了高效解决方案。 Abstract: Mathematical Expression Recognition (MER) has made significant progress in recognizing simple expressions, but the robust recognition of complex mathematical expressions with many tokens and multiple lines remains a formidable challenge. In this paper, we first introduce CMER-Bench, a carefully constructed benchmark that categorizes expressions into three difficulty levels: easy, moderate, and complex. Leveraging CMER-Bench, we conduct a comprehensive evaluation of existing MER models and general-purpose multimodal large language models (MLLMs). The results reveal that while current methods perform well on easy and moderate expressions, their performance degrades significantly when handling complex mathematical expressions, mainly because existing public training datasets are primarily composed of simple samples. In response, we propose MER-17M and CMER-3M that are large-scale datasets emphasizing the recognition of complex mathematical expressions. The datasets provide rich and diverse samples to support the development of accurate and robust complex MER models. Furthermore, to address the challenges posed by the complicated spatial layout of complex expressions, we introduce a novel expression tokenizer, and a new representation called Structured Mathematical Language, which explicitly models the hierarchical and spatial structure of expressions beyond LaTeX format. Based on these, we propose a specialized model named CMERNet, built upon an encoder-decoder architecture and trained on CMER-3M. Experimental results show that CMERNet, with only 125 million parameters, significantly outperforms existing MER models and MLLMs on CMER-Bench.

[33] Human-AI Collaboration Mechanism Study on AIGC Assisted Image Production for Special Coverage

Yajie Yang,Yuqing Zhao,Xiaochao Xi,Yinan Zhu

Main category: cs.CV

TL;DR: 本文探讨了人工智能生成内容(AIGC)在新闻特殊报道中图像生成的应用,通过两个实验提出可控生成路径,并倡导人机协同机制以解决真实性、语义对齐与社会技术信任问题。

Details Motivation: 由于AIGC工具多为黑箱模型,导致在新闻报道中存在误导信息、真实性缺失、语义偏差和可解释性不足等问题,亟需建立可控、可追溯的图像生成机制以满足新闻准确性与社会信任需求。 Method: 开展两项实验:实验一测试标准化提示词在三个平台上的跨平台适应性;实验二构建包含高精度分割、语义对齐与风格调控的人机协同模块化流程,并引入CLIP语义评分、NSFW/OCR/YOLO过滤及可验证内容凭证确保编辑保真性。 Result: 发现不同平台在语义对齐、文化特异性和视觉真实感方面存在差异,主要源于训练数据偏见与平台过滤机制;所提人机协同流程能有效提升内容可控性、语义一致性和可追溯性。 Conclusion: 提出适用于新闻特殊报道的AIGC辅助图像生成人机协作框架,建议采用角色身份稳定性(CIS)、文化表达准确性(CEA)和用户-公众适宜性(U-PA)作为评估标准,推动透明、可信的AIGC新闻应用。 Abstract: Artificial Intelligence Generated Content (AIGC) assisting image production triggers controversy in journalism while attracting attention from media agencies. Key issues involve misinformation, authenticity, semantic fidelity, and interpretability. Most AIGC tools are opaque "black boxes," hindering the dual demands of content accuracy and semantic alignment and creating ethical, sociotechnical, and trust dilemmas. This paper explores pathways for controllable image production in journalism's special coverage and conducts two experiments with projects from China's media agency: (1) Experiment 1 tests cross-platform adaptability via standardized prompts across three scenes, revealing disparities in semantic alignment, cultural specificity, and visual realism driven by training-corpus bias and platform-level filtering. (2) Experiment 2 builds a human-in-the-loop modular pipeline combining high-precision segmentation (SAM, GroundingDINO), semantic alignment (BrushNet), and style regulating (Style-LoRA, Prompt-to-Prompt), ensuring editorial fidelity through CLIP-based semantic scoring, NSFW/OCR/YOLO filtering, and verifiable content credentials. Traceable deployment preserves semantic representation. Consequently, we propose a human-AI collaboration mechanism for AIGC assisted image production in special coverage and recommend evaluating Character Identity Stability (CIS), Cultural Expression Accuracy (CEA), and User-Public Appropriateness (U-PA).

[34] DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models

Md. Najib Hasan,Imran Ahmad,Sourav Basak Shuvo,Md. Mahadi Hasan Ankon,Sunanda Das,Nazmul Siddique,Hui Wang

Main category: cs.CV

TL;DR: 提出了一种结合深度学习图像分类与大语言模型(LLM)临床推理的新框架,用于胃部内窥镜图像分析。通过新模型MobileCoAtNet实现高精度分类,并驱动LLM生成结构化临床解释。在专家验证的基准上评估32个LLM,发现尽管分类性能提升有助于解释质量,但LLM仍缺乏稳定性,提示当前LLM在高风险医疗决策中仍不可靠。

Details Motivation: 现有医学图像分类器缺乏可解释性,而大语言模型虽能生成临床文本但在视觉推理和解释稳定性方面表现不佳,导致模型输出与临床医生期望的推理之间存在差距。 Method: 设计了一个混合模型MobileCoAtNet用于胃部内窥镜图像的高精度分类,并将其输出作为输入驱动多个大语言模型生成结构化临床推理;构建两个由专家验证的基准数据集,涵盖病因、症状、治疗等多个维度,用于评估LLM生成的解释质量。 Result: MobileCoAtNet在八类胃部相关疾病上表现出高分类准确率;基于其输出,32个LLM被评估,结果显示强分类性能可提升解释质量,但所有LLM在不同提示下均表现出推理不稳定,未能达到人类水平的稳定性。 Conclusion: 深度学习与LLM结合可生成有用的临床叙述,但当前LLM在高风险医疗决策中仍不可靠;所提出的框架有助于揭示其局限性,并为构建更安全的推理系统提供路径。 Abstract: Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none of the models reach human-level stability. Even the best LLMs change their reasoning when prompts vary. Our study shows that combining DL with LLMs can produce useful clinical narratives, but current LLMs remain unreliable for high-stakes medical decisions. The framework provides a clearer view of their limits and a path for building safer reasoning systems. The complete source code and datasets used in this study are available at https://github.com/souravbasakshuvo/DL3M.

[35] Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making

Siyuan Dai,Lunxiao Li,Kun Zhao,Eardi Lila,Paul K. Crane,Heng Huang,Dongkuan Xu,Haoteng Tang,Liang Zhan

Main category: cs.CV

TL;DR: 当前多模态大语言模型在生物医学决策任务中表现不佳,文本推理优于视觉或多模态输入,表明其缺乏 grounded 视觉理解能力;通过上下文学习、视觉描述生成和视觉塔微调等策略可部分缓解该问题。

Details Motivation: 研究当前先进的多模态大语言模型在医学决策任务中表现欠佳的原因,尤其是在视觉差异细微的疾病分类任务中。 Method: 使用两个具有挑战性的数据集进行实证研究:三阶段阿尔茨海默病分类和MIMIC-CXR胸部X光片分类;比较文本、视觉和多模态输入下的模型性能,并尝试三种改进策略:带推理标注的上下文学习、基于图像描述的纯文本推断和带分类监督的视觉塔少样本微调。 Result: 发现纯文本推理 consistently 优于视觉或视觉-文本方法,多模态输入常不如仅用文本;三种策略可在一定程度上提升性能。 Conclusion: 当前多模态大语言模型在医学图像理解方面缺乏真正的视觉 grounding,需通过特定策略增强其视觉表征能力以改善医疗决策性能。 Abstract: With the rapid progress of large language models (LLMs), advanced multimodal large language models (MLLMs) have demonstrated impressive zero-shot capabilities on vision-language tasks. In the biomedical domain, however, even state-of-the-art MLLMs struggle with basic Medical Decision Making (MDM) tasks. We investigate this limitation using two challenging datasets: (1) three-stage Alzheimer's disease (AD) classification (normal, mild cognitive impairment, dementia), where category differences are visually subtle, and (2) MIMIC-CXR chest radiograph classification with 14 non-mutually exclusive conditions. Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. To mitigate this, we explore three strategies: (1) in-context learning with reason-annotated exemplars, (2) vision captioning followed by text-only inference, and (3) few-shot fine-tuning of the vision tower with classification supervision. These findings reveal that current MLLMs lack grounded visual understanding and point to promising directions for improving multimodal decision making in healthcare.

[36] STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

Jie Qin,Jiancheng Huang,Limeng Qiao,Lin Ma

Main category: cs.CV

TL;DR: 本文提出了一种名为STAR的堆叠自回归方案,用于任务渐进式统一多模态学习,通过分阶段(理解、生成、编辑)堆叠同构模块,在保持理解能力的同时提升生成性能,并引入高容量VQ和隐式推理机制提升复杂条件下的生成质量。

Details Motivation: 现有的多模态大模型在统一理解和生成任务时面临优化冲突和性能权衡问题,难以同时兼顾两种能力,因此需要一种能逐步扩展功能并避免任务间干扰的框架。 Method: STAR将多模态学习分解为理解、生成和编辑三个阶段,冻结基础自回归模型参数,逐步堆叠同构的自回归模块;引入高容量向量量化(VQ)以增强图像表示的细粒度,并采用隐式推理机制提升生成质量。 Result: STAR在GenEval(0.91)、DPG-Bench(87.44)和ImgEdit(4.34)上实现了最先进的性能,验证了其在统一多模态学习中的有效性。 Conclusion: STAR通过任务渐进式架构设计,成功实现了多模态理解与生成能力的协同提升,避免了任务间的优化冲突,为构建统一的多模态模型提供了有效路径。 Abstract: Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.

[37] Time-aware UNet and super-resolution deep residual networks for spatial downscaling

Mika Sipilä,Sabrina Maggio,Sandra De Iaco,Klaus Nordhausen,Monica Palma,Sara Taskinen

Main category: cs.CV

TL;DR: 本文研究了利用深度学习模型对卫星观测的对流层臭氧数据进行空间降尺度的方法,引入轻量级时间模块以提升模型性能。

Details Motivation: 卫星观测的大气污染物数据通常空间分辨率较低,限制了其在局地环境分析和决策中的应用,因此需要有效的空间降尺度方法。 Method: 采用两种深度学习架构(SRDRN和UNet),并引入基于正弦或径向基函数(RBF)的时间编码模块,融合时间特征与空间表示,用于臭氧浓度的高分辨率重建。 Result: 在意大利区域的案例研究中,加入时间模块的模型相比基线模型显著提升了降尺度性能和收敛速度,且计算开销增加很小。 Conclusion: 引入轻量级时间模块能有效增强深度学习模型在臭氧数据空间降尺度中的表现,为高分辨率环境污染监测提供了可行方案。 Abstract: Satellite data of atmospheric pollutants are often available only at coarse spatial resolution, limiting their applicability in local-scale environmental analysis and decision-making. Spatial downscaling methods aim to transform the coarse satellite data into high-resolution fields. In this work, two widely used deep learning architectures, the super-resolution deep residual network (SRDRN) and the encoder-decoder-based UNet, are considered for spatial downscaling of tropospheric ozone. Both methods are extended with a lightweight temporal module, which encodes observation time using either sinusoidal or radial basis function (RBF) encoding, and fuses the temporal features with the spatial representations in the networks. The proposed time-aware extensions are evaluated against their baseline counterparts in a case study on ozone downscaling over Italy. The results suggest that, while only slightly increasing computational complexity, the temporal modules significantly improve downscaling performance and convergence speed.

[38] Nexels: Neurally-Textured Surfels for Real-Time Novel View Synthesis with Sparse Geometries

Victor Rong,Jan Held,Victor Chu,Daniel Rebain,Marc Van Droogenbroeck,Kiriakos N. Kutulakos,Andrea Tagliasacchi,David B. Lindell

Main category: cs.CV

TL;DR: 提出一种解耦几何与外观的紧凑表示方法,使用surfels和神经场在减少图元数量和内存的同时保持高质量渲染。

Details Motivation: 高斯点阵在复杂场景中需要大量图元,导致存储和计算开销大,缺乏紧凑性。 Method: 采用surfels表示几何,结合全局神经场和每个图元颜色来表示外观,每像素固定数量图元以控制计算量。 Result: 在外景和内景分别减少9.7倍和31倍图元,节省5.5倍和3.7倍内存,渲染速度提升一倍且视觉质量更好。 Conclusion: 该方法实现了比3D高斯点阵更高效、更紧凑的渲染,在保持视觉质量的同时显著降低资源消耗。 Abstract: Though Gaussian splatting has achieved impressive results in novel view synthesis, it requires millions of primitives to model highly textured scenes, even when the geometry of the scene is simple. We propose a representation that goes beyond point-based rendering and decouples geometry and appearance in order to achieve a compact representation. We use surfels for geometry and a combination of a global neural field and per-primitive colours for appearance. The neural field textures a fixed number of primitives for each pixel, ensuring that the added compute is low. Our representation matches the perceptual quality of 3D Gaussian splatting while using $9.7\times$ fewer primitives and $5.5\times$ less memory on outdoor scenes and using $31\times$ fewer primitives and $3.7\times$ less memory on indoor scenes. Our representation also renders twice as fast as existing textured primitives while improving upon their visual quality.

[39] VajraV1 -- The most accurate Real Time Object Detector of the YOLO family

Naman Balbir Singh Makkar

Main category: cs.CV

TL;DR: 本文提出了VajraV1目标检测模型,在YOLO系列基础上进行架构改进,实现了实时检测中更高的精度,同时保持了竞争力的推理速度。

Details Motivation: 为了提升实时目标检测的精度,同时不牺牲推理速度,探索现有YOLO模型的有效设计组合。 Method: 结合先前YOLO模型中的高效设计策略,提出VajraV1新型架构,并在不同规模(Nano到Xlarge)上进行优化与验证。 Result: 在COCO验证集上,VajraV1各版本均优于YOLOv12和YOLOv13对应版本,其中VajraV1-Nano达到44.3% mAP,VajraV1-Xlarge达到56.2% mAP,为当前最优。 Conclusion: VajraV1通过有效架构设计,在保持实时性的同时显著提升了检测精度,成为当前最先进的实时目标检测器。 Abstract: Recent years have seen significant advances in real-time object detection, with the release of YOLOv10, YOLO11, YOLOv12, and YOLOv13 between 2024 and 2025. This technical report presents the VajraV1 model architecture, which introduces architectural enhancements over existing YOLO-based detectors. VajraV1 combines effective design choices from prior YOLO models to achieve state-of-the-art accuracy among real-time object detectors while maintaining competitive inference speed. On the COCO validation set, VajraV1-Nano achieves 44.3% mAP, outperforming YOLOv12-N by 3.7% and YOLOv13-N by 2.7% at latency competitive with YOLOv12-N and YOLOv11-N. VajraV1-Small achieves 50.4% mAP, exceeding YOLOv12-S and YOLOv13-S by 2.4%. VajraV1-Medium achieves 52.7% mAP, outperforming YOLOv12-M by 0.2%. VajraV1-Large achieves 53.7% mAP, surpassing YOLOv13-L by 0.3%. VajraV1-Xlarge achieves 56.2% mAP, outperforming all existing real-time object detectors.

[40] MoLingo: Motion-Language Alignment for Text-to-Motion Generation

Yannan He,Garvita Tiwari,Xiaohan Zhang,Pankaj Bora,Tolga Birdal,Jan Eric Lenssen,Gerard Pons-Moll

Main category: cs.CV

TL;DR: 本文提出了MoLingo,一种基于连续潜在空间去噪的文本到动作生成模型,在语义对齐的潜在空间和交叉注意力文本条件机制下实现了最先进的运动生成效果。

Details Motivation: 研究如何在连续运动潜在空间中使扩散模型更有效地生成逼真的、符合文本描述的人类动作,重点关注潜在空间的语义对齐和文本条件注入方式。 Method: 提出语义对齐的运动编码器,利用帧级文本标签训练以保持相似语义的潜在表示接近;比较单token与多token交叉注意力的文本条件方法,并采用后者提升文本-动作对齐和真实感;结合自回归生成与扩散模型。 Result: 在标准指标和用户研究中均达到当前最优性能,验证了语义对齐潜在空间和交叉注意力的有效性。 Conclusion: 通过构建语义对齐的潜在空间并采用多token交叉注意力进行文本条件控制,MoLingo显著提升了文本到动作生成的质量和对齐度,为该领域设定了新基准。 Abstract: We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.

[41] Improvise, Adapt, Overcome -- Telescopic Adapters for Efficient Fine-tuning of Vision Language Models in Medical Imaging

Ujjwal Mishra,Vinita Shukla,Praful Hambarde,Amit Shukla

Main category: cs.CV

TL;DR: 提出Telescopic Adapters,一种基于深度感知缩放的参数高效微调框架,通过在浅层到深层Transformer层中逐步增加适配器容量,显著提升医学图像分割模型的适应效率。

Details Motivation: 现有参数高效微调方法在所有Transformer层使用统一的适配器维度,导致参数分配次优和适应效率低下,难以满足医学影像领域对计算资源敏感的需求。 Method: 在CLIPSeg的视觉和文本编码器中引入轻量级瓶颈模块,根据层深度和语义相关性动态调整各层适配器维度,实现从浅层到深层逐渐增大的‘望远式’扩展结构。 Result: 仅用613k可训练参数(比端到端微调少244倍),在五个不同的医学数据集上(包括息肉分割、皮肤病变检测和乳腺超声成像)均实现了优于现有方法的性能。 Conclusion: Telescopic Adapters为医学视觉语言分割模型的高效微调建立了新范式,在资源受限的临床环境中保持了有竞争力的分割精度,验证了深层需要更大适应容量的假设。 Abstract: Adapting Vision Language Segmentation Models (VLSMs) to medical imaging domains requires significant computational overhead when using conventional fine-tuning approaches. Existing Parameter-Efficient Fine-Tuning (PEFT) methods apply uniform adapter dimensions across all transformer layers, leading to suboptimal parameter allocation and reduced adaptation efficiency. We introduce Telescopic Adapters, a novel PEFT framework that employs depth-aware scaling to progressively increase adapter capacity from shallow to deep transformer layers. Our method integrates lightweight bottleneck modules within CLIPSeg's vision and text encoders, with adapter dimensions dynamically scaled based on layer depth and semantic relevance. Using only 613k trainable parameters--244x fewer than end-to-end fine-tuning, Telescopic Adapters achieve superior performance across five diverse medical datasets spanning polyp segmentation, skin lesion detection, and breast ultrasound imaging. Comprehensive ablation studies demonstrate that deeper layers require substantially more adaptation capacity than shallow layers, validating our telescopic scaling hypothesis. Our approach establishes a new paradigm for efficient medical VLSM fine-tuning, enabling deployment in resource-constrained clinical environments while maintaining competitive segmentation accuracy.

[42] Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

Wenda Li,Meng Wu,Sungmin Eum,Heesung Kwon,Qing Qu

Main category: cs.CV

TL;DR: 本文提出了一种名为Coarse-to-Fine Hierarchical Alignment (CFHA) 的三阶段扩散模型框架,用于缩小合成数据与真实数据之间的域差距,以提升无人机(UAV)场景下基于合成数据训练的人体检测性能。

Details Motivation: 由于无人机场景中目标分布变化频繁且真实标注数据稀缺,依赖大量标注的检测模型难以实用。虽然合成数据可低成本生成标注,但其与真实数据间的域差距限制了模型性能,因此需要有效的方法来对齐两者分布。 Method: CFHA框架包含三个模块:(1) 全局风格迁移——使用扩散模型将合成图像的颜色、光照和纹理统计特征对齐到真实图像风格;(2) 局部细化——利用超分辨率扩散模型增强小目标(如人体)的细粒度真实感细节;(3) 幻觉去除——过滤不符合真实世界特征的人体实例,使人体外观更接近目标域分布。整个过程保留原始合成标签。 Result: 在公开的UAV Sim2Real检测基准上实验表明,该方法显著优于未转换的基线模型,在Semantic-Drone数据集上mAP50最高提升了+14.1。消融研究验证了全局与局部对齐阶段的互补性以及分层对齐的重要性。 Conclusion: CFHA通过分层次的全局与局部对齐策略,有效缩小了合成与真实图像之间的域差距,为无人机人体检测提供了一种高效、低成本的合成数据利用方案,并取得了显著的性能提升。 Abstract: Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer -- a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement -- a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal -- a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at \href{https://github.com/liwd190019/CFHA}{this url}.

[43] SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Jitesh Jain,Jialuo Li,Zixian Ma,Jieyu Zhang,Chris Dongjoo Kim,Sangho Lee,Rohun Tripathi,Tanmay Gupta,Christopher Clark,Humphrey Shi

Main category: cs.CV

TL;DR: 本文提出SAGE,一种受人类启发的多轮视频推理系统,能够灵活处理不同长度的视频,并通过合成数据和强化学习训练,在长视频上显著提升推理性能。

Details Motivation: 现有视频推理模型通常需处理大量帧且只能单轮推理,资源消耗大,缺乏对不同视频时长的灵活适应能力,而人类可以按需选择浏览策略,因此需要构建更高效、灵活的任何时域推理系统。 Method: 提出SAGE系统,包含核心控制器SAGE-MM,支持多轮推理;利用Gemini-2.5-Flash生成合成数据训练控制器;采用强化学习后训练策略以增强其任何时域推理能力;并构建长时视频评测集SAGE-Bench(平均时长超700秒)。 Result: 在开放性视频问答任务中最高提升6.1%,在超过10分钟的长视频上准确率提升达8.2%。 Conclusion: SAGE实现了高效的任何时域视频推理,结合合成数据与强化学习,显著提升了对长视频的理解与推理能力,具备实际应用潜力。 Abstract: As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes.

[44] Route-DETR: Pairwise Query Routing in Transformers for Object Detection

Ye Zhang,Qi Chen,Wenyou Huang,Rui Liu,Zhengjian Kang

Main category: cs.CV

TL;DR: 本文提出了Route-DETR,通过在DETR解码器自注意力中引入自适应成对路由机制,区分竞争性与互补性查询,提升检测效率与性能。

Details Motivation: DETR存在查询竞争低效问题,多个查询收敛到相似位置导致冗余计算,影响收敛速度和检测精度。 Method: 提出双路由机制:抑制路由(suppressor routes)减少竞争查询间的冗余,委派路由(delegator routes)促进查询探索不同区域;采用可学习的低秩注意力偏置实现非对称查询交互,并设计双分支训练策略,推理时无额外开销。 Result: 在COCO和Cityscapes上验证了方法的有效性,相比DINO在ResNet-50上提升+1.7% mAP,在Swin-L上达到57.6% mAP,超越先前最优模型。 Conclusion: Route-DETR通过自适应路由机制有效缓解了DETR中的查询竞争问题,在不增加推理成本的前提下显著提升了检测性能,具有良好的通用性和实用性。 Abstract: Detection Transformer (DETR) offers an end-to-end solution for object detection by eliminating hand-crafted components like non-maximum suppression. However, DETR suffers from inefficient query competition where multiple queries converge to similar positions, leading to redundant computations. We present Route-DETR, which addresses these issues through adaptive pairwise routing in decoder self-attention layers. Our key insight is distinguishing between competing queries (targeting the same object) versus complementary queries (targeting different objects) using inter-query similarity, confidence scores, and geometry. We introduce dual routing mechanisms: suppressor routes that modulate attention between competing queries to reduce duplication, and delegator routes that encourage exploration of different regions. These are implemented via learnable low-rank attention biases enabling asymmetric query interactions. A dual-branch training strategy incorporates routing biases only during training while preserving standard attention for inference, ensuring no additional computational cost. Experiments on COCO and Cityscapes demonstrate consistent improvements across multiple DETR baselines, achieving +1.7% mAP gain over DINO on ResNet-50 and reaching 57.6% mAP on Swin-L, surpassing prior state-of-the-art models.

[45] KLO-Net: A Dynamic K-NN Attention U-Net with CSP Encoder for Efficient Prostate Gland Segmentation from MRI

Anning Tian,Byunghyun Ko,Kaichen Qu,Mengyuan Liu,Jeongkyu Lee

Main category: cs.CV

TL;DR: 提出KLO-Net,一种结合动态K-最近邻注意力机制和CSP结构的U-Net模型,用于高效准确的前列腺MRI分割。

Details Motivation: 解决前列腺MRI分割在临床工作站实时部署中的计算负载和内存占用瓶颈,同时应对解剖结构变异带来的分割挑战。 Method: 设计动态K-NN注意力机制,自适应确定每个空间位置的注意力连接数;采用CSP编码器结构降低计算负载和内存消耗。 Result: 在PROMISE12和PROSTATEx两个公开数据集上进行了实验和消融研究,结果表明该模型在计算效率和分割质量方面均具优势。 Conclusion: KLO-Net在保持高分割精度的同时显著提升了计算效率,适合在资源受限的临床环境中部署。 Abstract: Real-time deployment of prostate MRI segmentation on clinical workstations is often bottlenecked by computational load and memory footprint. Deep learning-based prostate gland segmentation approaches remain challenging due to anatomical variability. To bridge this efficiency gap while still maintaining reliable segmentation accuracy, we propose KLO-Net, a dynamic K-Nearest Neighbor attention U-Net with Cross Stage Partial, i.e., CSP, encoder for efficient prostate gland segmentation from MRI scan. Unlike the regular K-NN attention mechanism, the proposed dynamic K-NN attention mechanism allows the model to adaptively determine the number of attention connections for each spatial location within a slice. In addition, CSP blocks address the computational load to reduce memory consumption. To evaluate the model's performance, comprehensive experiments and ablation studies are conducted on two public datasets, i.e., PROMISE12 and PROSTATEx, to validate the proposed architecture. The detailed comparative analysis demonstrates the model's advantage in computational efficiency and segmentation quality.

[46] An evaluation of SVBRDF Prediction from Generative Image Models for Appearance Modeling of 3D Scenes

Alban Gauthier,Valentin Deschaintre,Alexandre Lanvin,Fredo Durand,Adrien Bousseau,George Drettakis

Main category: cs.CV

TL;DR: 本文探讨了在快速外观建模流程中SVBRDF预测的挑战与机遇,提出利用多视图生成图像和条件信息提升材质估计的一致性与精度,并发现标准UNet架构表现优异。

Details Motivation: 随着深度生成模型的发展,如何在多视图纹理生成中实现一致且准确的SVBRDF预测成为关键问题,现有方法存在多视图不一致的问题。 Method: 比较多种神经网络架构和条件输入,分析其在多视图一致性与预测精度上的表现,利用生成的RGB图像及其条件信号辅助SVBRDF估计。 Result: 实验表明,尽管模型复杂度不同,标准UNet在准确性和多视图一致性方面可与更复杂的架构相媲美。 Conclusion: 标准UNet结合生成图像的多模态条件信息,是实现高效、一致SVBRDF纹理图集生成的有效方案。 Abstract: Digital content creation is experiencing a profound change with the advent of deep generative models. For texturing, conditional image generators now allow the synthesis of realistic RGB images of a 3D scene that align with the geometry of that scene. For appearance modeling, SVBRDF prediction networks recover material parameters from RGB images. Combining these technologies allows us to quickly generate SVBRDF maps for multiple views of a 3D scene, which can be merged to form a SVBRDF texture atlas of that scene. In this paper, we analyze the challenges and opportunities for SVBRDF prediction in the context of such a fast appearance modeling pipeline. On the one hand, single-view SVBRDF predictions might suffer from multiview incoherence and yield inconsistent texture atlases. On the other hand, generated RGB images, and the different modalities on which they are conditioned, can provide additional information for SVBRDF estimation compared to photographs. We compare neural architectures and conditions to identify designs that achieve high accuracy and coherence. We find that, surprisingly, a standard UNet is competitive with more complex designs. Project page: http://repo-sam.inria.fr/nerphys/svbrdf-evaluation

[47] From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation

Dawid Malarz,Artur Kasymov,Filip Manjak,Maciej Zięba,Przemysław Spurek

Main category: cs.CV

TL;DR: 本文提出了“去品牌化”(unbranding)这一新任务,旨在细粒度地移除文本到图像生成中的商标及隐含的品牌结构特征,并构建了基准数据集和基于视觉语言模型的评估指标,以应对扩散模型中品牌侵权问题。

Details Motivation: 现有文本到图像模型容易复制受保护的品牌标识,但以往研究多关注通用概念,未能有效处理具体品牌标识,尤其是非显式、结构性的品牌特征(如产品轮廓或设计风格),因此需要专门针对品牌元素进行去除的研究。 Method: 提出unbranding任务,构建包含品牌与去品牌图像的基准数据集;设计基于视觉语言模型(VLM)的评估指标,通过问答方式检测图像中显性和隐性品牌特征;并对SD、SDXL、FLUX等不同代际扩散模型进行实验分析。 Result: 实验证明,随着扩散模型生成质量提升(如SDXL、FLUX),品牌标识更易被复现,加剧了品牌泄露问题;所提VLM-based评估指标能有效衡量去品牌效果,验证了unbranding是一个独特且亟需解决的任务。 Conclusion: 去品牌化是保护知识产权的重要方向,需发展专门技术来消除生成图像中的显性和隐性品牌特征,同时保持图像语义一致性,未来工作应结合更强的品牌感知评估体系推动该领域发展。 Abstract: The rapid progress of text-to-image diffusion models raises significant concerns regarding the unauthorized reproduction of trademarked content. While prior work targets general concepts (e.g., styles, celebrities), it fails to address specific brand identifiers. Crucially, we note that brand recognition is multi-dimensional, extending beyond explicit logos to encompass distinctive structural features (e.g., a car's front grille). To tackle this, we introduce unbranding, a novel task for the fine-grained removal of both trademarks and subtle structural brand features, while preserving semantic coherence. To facilitate research, we construct a comprehensive benchmark dataset. Recognizing that existing brand detectors are limited to logos and fail to capture abstract trade dress (e.g., the shape of a Coca-Cola bottle), we introduce a novel evaluation metric based on Vision Language Models (VLMs). This VLM-based metric uses a question-answering framework to probe images for both explicit logos and implicit, holistic brand characteristics. Furthermore, we observe that as model fidelity increases, with newer systems (SDXL, FLUX) synthesizing brand identifiers more readily than older models (Stable Diffusion), the urgency of the unbranding challenge is starkly highlighted. Our results, validated by our VLM metric, confirm unbranding is a distinct, practically relevant problem requiring specialized techniques. Project Page: https://gmum.github.io/UNBRANDING/.

[48] Quality-Driven and Diversity-Aware Sample Expansion for Robust Marine Obstacle Segmentation

Miaohua Zhang,Mohammad Ali Armin,Xuesong Li,Sisi Liang,Lars Petersson,Changming Sun,David Ahmedt-Aristizabal,Zeeshan Hayder

Main category: cs.CV

TL;DR: 本文提出了一种质量驱动且多样性感知的样本扩展方法,用于在推理时生成海洋障碍物检测的训练数据,无需重新训练扩散模型。

Details Motivation: 海洋图像中存在光照、雾气和波浪等干扰因素,且数据集缺乏多样性和标注稀缺,限制了现有分割模型的鲁棒性。 Method: 结合类感知风格库生成高熵语义提示,并设计自适应退火采样器与COD引导的比例控制器,在保持布局保真度的同时增强生成样本的多样性。 Result: 在多个海洋障碍物基准上,该方法生成的合成数据显著提升了不同骨干网络的分割性能,尤其增强了稀有类和纹理敏感类别的视觉变化。 Conclusion: 所提框架能有效提升小样本、低多样性条件下的语义分割鲁棒性,为海上障碍物检测提供了高效的数据增强解决方案。 Abstract: Marine obstacle detection demands robust segmentation under challenging conditions, such as sun glitter, fog, and rapidly changing wave patterns. These factors degrade image quality, while the scarcity and structural repetition of marine datasets limit the diversity of available training data. Although mask-conditioned diffusion models can synthesize layout-aligned samples, they often produce low-diversity outputs when conditioned on low-entropy masks and prompts, limiting their utility for improving robustness. In this paper, we propose a quality-driven and diversity-aware sample expansion pipeline that generates training data entirely at inference time, without retraining the diffusion model. The framework combines two key components:(i) a class-aware style bank that constructs high-entropy, semantically grounded prompts, and (ii) an adaptive annealing sampler that perturbs early conditioning, while a COD-guided proportional controller regulates this perturbation to boost diversity without compromising layout fidelity. Across marine obstacle benchmarks, augmenting training data with these controlled synthetic samples consistently improves segmentation performance across multiple backbones and increases visual variation in rare and texture-sensitive classes.

[49] XAI-Driven Diagnosis of Generalization Failure in State-Space Cerebrovascular Segmentation Models: A Case Study on Domain Shift Between RSNA and TopCoW Datasets

Youssef Abuzeid,Shimaa El-Bana,Ahmad Al-Kabbany

Main category: cs.CV

TL;DR: 本文提出了一种两阶段方法,利用可解释AI(XAI)诊断状态空间模型(如UMamaba)在脑血管分割任务中因域移而导致的泛化失败问题,发现模型注意力机制偏离真实解剖结构,依赖虚假相关性,导致性能急剧下降。

Details Motivation: 深度学习模型在医学影像中的临床部署受到域移问题的严重阻碍,传统性能指标难以揭示模型失败的根本原因,因此需要借助可解释AI来深入理解模型行为,识别数据集偏差。 Method: 首先量化源域(RSNA CTA Aneurysm)与目标域(TopCoW CT)之间的域差距(如Z分辨率和背景噪声差异),评估UMamaba模型在跨域时的性能下降;随后使用Seg-XRes-CAM生成注意力图,通过计算注意力与真实标注及预测掩码的重叠度(IoU)来诊断失败原因。 Result: 模型在源域Dice得分为0.8604,在目标域降至0.2902;注意力分析显示其关注区域与真实血管的IoU仅为0.101(阈值0.3),但与自身错误预测的IoU达0.282,表明模型注意力已偏离真实解剖特征。 Conclusion: 模型因学习到虚假相关性而在新域中失败,验证了可解释AI在诊断新兴架构中数据集偏见和泛化问题方面的有效性,强调应将XAI作为医学AI部署的关键诊断工具。 Abstract: The clinical deployment of deep learning models in medical imaging is severely hindered by domain shift. This challenge, where a high-performing model fails catastrophically on external datasets, is a critical barrier to trustworthy AI. Addressing this requires moving beyond simple performance metrics toward deeper understanding, making Explainable AI (XAI) an essential diagnostic tool in medical image analysis. We present a rigorous, two-phase approach to diagnose the generalization failure of state-of-the-art State-Space Models (SSMs), specifically UMamaba, applied to cerebrovascular segmentation. We first established a quantifiable domain gap between our Source (RSNA CTA Aneurysm) and Target (TopCoW Circle of Willis CT) datasets, noting significant differences in Z-resolution and background noise. The model's Dice score subsequently plummeted from 0.8604 (Source) to 0.2902 (Target). In the second phase, which is our core contribution, we utilized Seg-XRes-CAM to diagnose the cause of this failure. We quantified the model's focus by measuring the overlap between its attention maps and the Ground Truth segmentations, and between its attention maps and its own Prediction Mask. Our analysis proves the model failed to generalize because its attention mechanism abandoned true anatomical features in the Target domain. Quantitative metrics confirm the model's focus shifted away from the Ground Truth vessels (IoU~0.101 at 0.3 threshold) while still aligning with its own wrong predictions (IoU~0.282 at 0.3 threshold). This demonstrates the model learned spurious correlations, confirming XAI is a powerful diagnostic tool for identifying dataset bias in emerging architectures.

[50] FocalComm: Hard Instance-Aware Multi-Agent Perception

Dereje Shenkut,Vijayakumar Bhagavatula

Main category: cs.CV

TL;DR: 本文提出了一种名为FocalComm的新型多智能体协同感知框架,专注于在协作智能体之间交换面向难例的特征,以提升自动驾驶中行人等小物体的3D检测性能。

Details Motivation: 现有协同感知方法主要优化车辆检测性能,对行人等安全关键的小物体检测效果不佳,且通常进行全特征交换,通信效率低。 Method: FocalComm包含两个核心设计:(1) 可学习的渐进式难例挖掘(HIM)模块,用于提取各智能体的难例特征;(2) 基于查询的特征级融合机制,动态加权协作过程中的特征。 Result: FocalComm在V2X-Real和DAIR-V2X两个真实世界数据集上优于现有的最先进方法,尤其在V2X-Real上的行人检测表现出显著性能提升。 Conclusion: FocalComm通过聚焦难例特征交换,有效提升了多智能体协同感知中对小而关键物体(如行人)的检测能力,兼顾性能与通信效率。 Abstract: Multi-agent collaborative perception (CP) is a promising paradigm for improving autonomous driving safety, particularly for vulnerable road users like pedestrians, via robust 3D perception. However, existing CP approaches often optimize for vehicle detection performance metrics, underperforming on smaller, safety-critical objects such as pedestrians, where detection failures can be catastrophic. Furthermore, previous CP methods rely on full feature exchange rather than communicating only salient features that help reduce false negatives. To this end, we present FocalComm, a novel collaborative perception framework that focuses on exchanging hard-instance-oriented features among connected collaborative agents. FocalComm consists of two key novel designs: (1) a learnable progressive hard instance mining (HIM) module to extract hard instance-oriented features per agent, and (2) a query-based feature-level (intermediate) fusion technique that dynamically weights these identified features during collaboration. We show that FocalComm outperforms state-of-the-art collaborative perception methods on two challenging real-world datasets (V2X-Real and DAIR-V2X) across both vehicle-centric and infrastructure-centric collaborative setups. FocalComm also shows a strong performance gain in pedestrian detection in V2X-Real.

[51] Repurposing 2D Diffusion Models for 3D Shape Completion

Yao He,Youngjoong Kwon,Tiange Xiang,Wenxiao Cai,Ehsan Adeli

Main category: cs.CV

TL;DR: 提出Shape Atlas框架,将2D扩散模型用于从不完整点云进行3D形状补全,通过紧凑的2D表示实现跨模态对齐和高效生成。

Details Motivation: 由于高质量3D数据集稀缺以及3D输入与2D潜在空间之间存在模态差距,3D扩散模型发展滞后。 Method: 引入Shape Atlas作为3D几何的紧凑2D表示,利用预训练2D扩散模型,并在统一的2D空间中实现条件输入与输出的模态对齐。 Result: 在PCN和ShapeNet-55数据集上验证了方法的有效性,能够生成高质量、细节保留的3D形状补全结果,并可应用于生成艺术家级别的网格模型。 Conclusion: Shape Atlas有效克服了3D数据稀缺和模态不匹配问题,充分利用2D扩散模型的强大生成能力,实现了高性能的3D形状补全。 Abstract: We present a framework that adapts 2D diffusion models for 3D shape completion from incomplete point clouds. While text-to-image diffusion models have achieved remarkable success with abundant 2D data, 3D diffusion models lag due to the scarcity of high-quality 3D datasets and a persistent modality gap between 3D inputs and 2D latent spaces. To overcome these limitations, we introduce the Shape Atlas, a compact 2D representation of 3D geometry that (1) enables full utilization of the generative power of pretrained 2D diffusion models, and (2) aligns the modalities between the conditional input and output spaces, allowing more effective conditioning. This unified 2D formulation facilitates learning from limited 3D data and produces high-quality, detail-preserving shape completions. We validate the effectiveness of our results on the PCN and ShapeNet-55 datasets. Additionally, we show the downstream application of creating artist-created meshes from our completed point clouds, further demonstrating the practicality of our method.

[52] Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Shufan Li,Jiuxiang Gu,Kangning Liu,Zhe Lin,Zijun Wei,Aditya Grover,Jason Kuen

Main category: cs.CV

TL;DR: Sparse-LaViDa 是一种加速掩码离散扩散模型(MDM)推理的新框架,通过动态剪枝冗余掩码令牌并引入寄存器令牌保持生成质量,在多种任务中实现最高2倍的加速。

Details Motivation: MDM在多模态任务中表现优异,但推理速度慢,因每步需重复处理冗余的掩码令牌。 Method: 提出Sparse-LaViDa框架,动态截断不必要的掩码令牌;引入寄存器令牌保留被截断信息,并设计特定注意力掩码以对齐训练与推理过程。 Result: 在文本到图像生成、图像编辑和数学推理等任务上,基于LaViDa-O实现了最高2倍的加速,同时保持生成质量。 Conclusion: Sparse-LaViDa有效提升了MDM的推理效率,为实际应用提供了更快且高质量的解决方案。 Abstract: Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as compact representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.

[53] KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding

Zongyao Li,Kengo Ishida,Satoshi Yamazaki,Xiaotong Ji,Jianquan Liu

Main category: cs.CV

TL;DR: 本文提出了KFS-Bench,首个用于长视频问答中关键帧采样的基准,通过多场景标注直接评估采样策略,并设计了新的采样质量度量和自适应平衡采样方法,提升了问答性能。

Details Motivation: 现有工作仅通过问答准确率间接评估关键帧采样质量,缺乏对采样策略的直接、鲁棒评估,因此需要一个具备多场景标注的基准来直接分析不同采样方法在长视频中的表现。 Method: 构建KFS-Bench基准,包含多段长视频及每道问题对应的关键场景标注;提出一种新的关键帧采样质量度量指标,并开发基于问题-视频相关性的自适应平衡采样方法,以兼顾采样多样性与问题-帧相似性。 Result: 实验表明采样精度、场景覆盖和采样平衡是影响问答性能的关键因素;所提采样质量指标与问答准确率高度相关;新采样方法在关键帧采样和问答任务上均优于现有方法。 Conclusion: KFS-Bench为关键帧采样提供了可衡量的标准,揭示了影响长视频理解的关键因素,并通过自适应采样策略有效提升多模态大模型在长视频问答中的效率与准确性。 Abstract: We propose KFS-Bench, the first benchmark for key frame sampling in long video question answering (QA), featuring multi-scene annotations to enable direct and robust evaluation of sampling strategies. Key frame sampling is crucial for efficient long-form video understanding. In long video QA, selecting informative frames enables multimodal large language models (MLLMs) to improve both accuracy and efficiency. KFS-Bench addresses the limitation of prior works that only indirectly assess frame selection quality via QA accuracy. By providing ground-truth annotations of multiple disjoint scenes required per question, KFS-Bench allows us to directly analyze how different sampling approaches capture essential content across an entire long video. Using KFS-Bench, we conduct a comprehensive study of key frame sampling methods and identify that not only sampling precision but also scene coverage and sampling balance are the key factors influencing QA performance. Regarding all the factors, we design a novel sampling quality metric that correlates with QA accuracy. Furthermore, we develop a novel key frame sampling method that leverages question-video relevance to balance sampling diversity against question-frame similarity, thereby improving coverage of relevant scenes. Our adaptively balanced sampling approach achieves superior performance in both key frame sampling and QA performance. The benchmark is available at https://github.com/NEC-VID/KFS-Bench.

[54] Deep Learning Perspective of Scene Understanding in Autonomous Robots

Afia Maham,Dur E Nayab Tashfa

Main category: cs.CV

TL;DR: 本文综述了深度学习在自主机器人场景理解中的应用,涵盖目标检测、语义与实例分割、深度估计、3D重建和视觉SLAM等方面的创新。

Details Motivation: 传统几何模型在复杂环境中存在局限性,难以应对遮挡、无纹理表面等问题,需借助深度学习提升感知与语义理解能力。 Method: 综述深度学习在多个感知模块中的应用,包括物体检测、分割、深度估计、3D重建和视觉SLAM,并探讨其在动态非结构化环境中的集成方法。 Result: 深度学习显著提升了机器人在深度感知、语义推理和环境理解方面的能力,增强了其在导航、决策和交互中的表现。 Conclusion: 尽管已取得进展,但仍存在挑战,未来研究方向应聚焦于提升基于学习的场景理解系统的鲁棒性、泛化能力和实时性。 Abstract: This paper provides a review of deep learning applications in scene understanding in autonomous robots, including innovations in object detection, semantic and instance segmentation, depth estimation, 3D reconstruction, and visual SLAM. It emphasizes how these techniques address limitations of traditional geometric models, improve depth perception in real time despite occlusions and textureless surfaces, and enhance semantic reasoning to understand the environment better. When these perception modules are integrated into dynamic and unstructured environments, they become more effective in decisionmaking, navigation and interaction. Lastly, the review outlines the existing problems and research directions to advance learning-based scene understanding of autonomous robots.

[55] Unleashing the Power of Image-Tabular Self-Supervised Learning via Breaking Cross-Tabular Barriers

Yibing Fu,Yunpeng Zhao,Zhitao Zeng,Cheng Chen,Yueming Jin

Main category: cs.CV

TL;DR: 本文提出了一种新的自监督学习框架CITab,通过语义感知的表格建模和原型引导的混合线性层,实现跨数据队列的多模态医学表示学习,在阿尔茨海默病诊断任务中优于现有方法。

Details Motivation: 现有自监督学习方法在处理异构表格数据时受限于固定的表格建模机制,难以在不同数据队列间迁移知识,限制了多模态模型的泛化与扩展能力。 Method: 提出CITab框架,利用列标题作为语义线索进行语义感知的表格建模,并设计原型引导的混合线性层(P-MoLin)以适应表格数据异质性,支持跨队列表格数据的联合预训练。 Result: 在三个公开阿尔茨海默病数据集(共4,461名受试者)上的实验表明,CITab在诊断任务上显著优于现有最先进方法。 Conclusion: CITab实现了可迁移、可扩展的跨表格多模态学习,为利用多样化医疗数据源进行自监督预训练提供了有效解决方案。 Abstract: Multi-modal learning integrating medical images and tabular data has significantly advanced clinical decision-making in recent years. Self-Supervised Learning (SSL) has emerged as a powerful paradigm for pretraining these models on large-scale unlabeled image-tabular data, aiming to learn discriminative representations. However, existing SSL methods for image-tabular representation learning are often confined to specific data cohorts, mainly due to their rigid tabular modeling mechanisms when modeling heterogeneous tabular data. This inter-tabular barrier hinders the multi-modal SSL methods from effectively learning transferrable medical knowledge shared across diverse cohorts. In this paper, we propose a novel SSL framework, namely CITab, designed to learn powerful multi-modal feature representations in a cross-tabular manner. We design the tabular modeling mechanism from a semantic-awareness perspective by integrating column headers as semantic cues, which facilitates transferrable knowledge learning and the scalability in utilizing multiple data sources for pretraining. Additionally, we propose a prototype-guided mixture-of-linear layer (P-MoLin) module for tabular feature specialization, empowering the model to effectively handle the heterogeneity of tabular data and explore the underlying medical concepts. We conduct comprehensive evaluations on Alzheimer's disease diagnosis task across three publicly available data cohorts containing 4,461 subjects. Experimental results demonstrate that CITab outperforms state-of-the-art approaches, paving the way for effective and scalable cross-tabular multi-modal learning.

[56] Robust Single-shot Structured Light 3D Imaging via Neural Feature Decoding

Jiaheng Li,Qiyu Dai,Lihan Li,Praneeth Chakravarthula,He Sun,Baoquan Chen,Wenzheng Chen

Main category: cs.CV

TL;DR: 提出一种基于学习的结构光解码框架,通过在特征空间而非像素域中进行匹配,结合神经特征匹配和大规模单目深度估计先验,显著提升了单次结构光3D成像的鲁棒性和精度。

Details Motivation: 传统结构光方法在像素域进行匹配,面对遮挡、细节结构和非朗伯表面时鲁棒性差,亟需更鲁棒的对应关系解码方法。 Method: 提取投影图案和红外图像的神经特征,在特征空间构建包含几何先验的成本体积进行匹配,并引入基于大模型先验的深度细化模块;使用物理仿真渲染管线生成百万级合成数据用于训练。 Result: 在仅使用合成数据训练的情况下,该方法在真实室内环境中表现良好,无需重新训练即可处理多种图案类型,在深度估计质量上优于商用结构光系统和被动双目RGB方法。 Conclusion: 基于特征空间匹配和深度先验的学习框架显著提升了单次结构光系统的鲁棒性和泛化能力,验证了合成数据训练在实际应用中的潜力。 Abstract: We consider the problem of active 3D imaging using single-shot structured light systems, which are widely employed in commercial 3D sensing devices such as Apple Face ID and Intel RealSense. Traditional structured light methods typically decode depth correspondences through pixel-domain matching algorithms, resulting in limited robustness under challenging scenarios like occlusions, fine-structured details, and non-Lambertian surfaces. Inspired by recent advances in neural feature matching, we propose a learning-based structured light decoding framework that performs robust correspondence matching within feature space rather than the fragile pixel domain. Our method extracts neural features from the projected patterns and captured infrared (IR) images, explicitly incorporating their geometric priors by building cost volumes in feature space, achieving substantial performance improvements over pixel-domain decoding approaches. To further enhance depth quality, we introduce a depth refinement module that leverages strong priors from large-scale monocular depth estimation models, improving fine detail recovery and global structural coherence. To facilitate effective learning, we develop a physically-based structured light rendering pipeline, generating nearly one million synthetic pattern-image pairs with diverse objects and materials for indoor settings. Experiments demonstrate that our method, trained exclusively on synthetic data with multiple structured light patterns, generalizes well to real-world indoor environments, effectively processes various pattern types without retraining, and consistently outperforms both commercial structured light systems and passive stereo RGB-based depth estimation methods. Project page: https://namisntimpot.github.io/NSLweb/.

[57] ACE-SLAM: Scene Coordinate Regression for Neural Implicit Real-Time SLAM

Ignacio Alzugaray,Marwan Taher,Andrew J. Davison

Main category: cs.CV

TL;DR: 本文提出了一种基于场景坐标回归(SCR)的新型神经隐式RGB-D SLAM系统,首次在神经SLAM中使用SCR作为核心地图表示,实现了严格的实时性能,并具备低内存、快速重定位和隐私保护等优势。

Details Motivation: 现有的神经隐式SLAM系统在实时性、内存效率和重定位速度方面存在局限,本文旨在探索SCR作为一种高效、轻量且适合实时SLAM的隐式地图表示方法。 Method: 设计了一种专用于SLAM的新型SCR网络架构,并将其集成到实时SLAM流水线中,支持稀疏和稠密特征输入,利用轻量网络将2D图像特征直接映射为3D全局坐标以构建隐式地图。 Result: 系统在多个合成与真实世界基准上表现出与当前最先进方法相当的性能,首次在神经隐式RGB-D SLAM中实现严格实时运行,同时具备快速重定位和对动态环境的鲁棒性。 Conclusion: SCR是一种极具潜力的隐式地图表示方式,适用于高效、低资源消耗且注重隐私的实时SLAM应用,为神经SLAM提供了新的可行路径。 Abstract: We present a novel neural RGB-D Simultaneous Localization And Mapping (SLAM) system that learns an implicit map of the scene in real time. For the first time, we explore the use of Scene Coordinate Regression (SCR) as the core implicit map representation in a neural SLAM pipeline, a paradigm that trains a lightweight network to directly map 2D image features to 3D global coordinates. SCR networks provide efficient, low-memory 3D map representations, enable extremely fast relocalization, and inherently preserve privacy, making them particularly suitable for neural implicit SLAM. Our system is the first one to achieve strict real-time in neural implicit RGB-D SLAM by relying on a SCR-based representation. We introduce a novel SCR architecture specifically tailored for this purpose and detail the critical design choices required to integrate SCR into a live SLAM pipeline. The resulting framework is simple yet flexible, seamlessly supporting both sparse and dense features, and operates reliably in dynamic environments without special adaptation. We evaluate our approach on established synthetic and real-world benchmarks, demonstrating competitive performance against the state of the art. Project Page: https://github.com/ialzugaray/ace-slam

[58] ASAP-Textured Gaussians: Enhancing Textured Gaussians with Adaptive Sampling and Anisotropic Parameterization

Meng Wei,Cheng Zhang,Jianmin Zheng,Hamid Rezatofighi,Jianfei Cai

Main category: cs.CV

TL;DR: 提出ASAP方法,通过自适应采样和各向异性参数化优化纹理分配,提升3D高斯点阵的渲染效率与质量。

Details Motivation: 现有纹理化高斯方法在规范空间定义纹理且均匀分配参数,导致采样效率低和过度参数化问题。 Method: 采用基于高斯密度分布的自适应采样和基于渲染误差的各向异性参数化策略,动态分配纹理资源。 Result: 显著减少纹理参数数量的同时实现高质量渲染,在质量-效率权衡上优于现有方法。 Conclusion: ASAP Textured Gaussians有效解决了纹理利用率低和参数冗余问题,提升了3D高斯点阵的内存效率与渲染性能。 Abstract: Recent advances have equipped 3D Gaussian Splatting with texture parameterizations to capture spatially varying attributes, improving the performance of both appearance modeling and downstream tasks. However, the added texture parameters introduce significant memory efficiency challenges. Rather than proposing new texture formulations, we take a step back to examine the characteristics of existing textured Gaussian methods and identify two key limitations in common: (1) Textures are typically defined in canonical space, leading to inefficient sampling that wastes textures' capacity on low-contribution regions; and (2) texture parameterization is uniformly assigned across all Gaussians, regardless of their visual complexity, resulting in over-parameterization. In this work, we address these issues through two simple yet effective strategies: adaptive sampling based on the Gaussian density distribution and error-driven anisotropic parameterization that allocates texture resources according to rendering error. Our proposed ASAP Textured Gaussians, short for Adaptive Sampling and Anisotropic Parameterization, significantly improve the quality efficiency tradeoff, achieving high-fidelity rendering with far fewer texture parameters.

[59] ChartAgent: A Chart Understanding Framework with Tool Integrated Reasoning

Boran Wang,Xinming Wang,Yi Chen,Xiang Li,Jian Xu,Jing Yuan,Chenglin Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为ChartAgent的图表理解框架,基于工具集成推理(TIR),通过模块化工具库实现对缺乏文本标注图表的鲁棒分析,提升多模态模型在图表理解中的可追溯性与可扩展性。

Details Motivation: 现有视觉语言模型在图表理解中依赖显式文本标注,当关键数字缺失时性能显著下降,因此需要更鲁棒、可解释的方法来解析复杂图表。 Method: 提出Tool-Integrated Reasoning(TIR)框架,将图表分析分解为可观察步骤,使用包含关键元素检测、实例分割和OCR等模块化工具库,动态协调工具执行,并将中间结果整合为结构化的Evidence Package以支持推理。 Result: 实验表明,ChartAgent在稀疏标注场景下显著提升了图表理解的鲁棒性,相比传统MLLM方法具有更强的解析能力和透明度。 Conclusion: ChartAgent通过引入可验证的推理过程和模块化工具协同机制,推动了可信赖、可扩展的图表理解系统的发展,为解决非依赖文本标注的图表分析提供了有效路径。 Abstract: With their high information density and intuitive readability, charts have become the de facto medium for data analysis and communication across disciplines. Recent multimodal large language models (MLLMs) have made notable progress in automated chart understanding, yet they remain heavily dependent on explicit textual annotations and the performance degrades markedly when key numerals are absent. To address this limitation, we introduce ChartAgent, a chart understanding framework grounded in Tool-Integrated Reasoning (TIR). Inspired by human cognition, ChartAgent decomposes complex chart analysis into a sequence of observable, replayable steps. Supporting this architecture is an extensible, modular tool library comprising more than a dozen core tools, such as keyelement detection, instance segmentation, and optical character recognition (OCR), which the agent dynamically orchestrates to achieve systematic visual parsing across diverse chart types. Leveraging TIRs transparency and verifiability, ChartAgent moves beyond the black box paradigm by standardizing and consolidating intermediate outputs into a structured Evidence Package, providing traceable and reproducible support for final conclusions. Experiments show that ChartAgent substantially improves robustness under sparse annotation settings, offering a practical path toward trustworthy and extensible systems for chart understanding.

[60] OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

Zhenguo Zhang,Haohan Zhen,Yishen Wang,Le Xu,Tianchen Deng,Xuefeng Chen,Qu Chen,Bo Zhang,Wuxiong Huang

Main category: cs.CV

TL;DR: 本文提出OmniDrive-R1,一种用于自动驾驶的端到端视觉语言模型框架,通过交错多模态思维链(iMCoT)机制和基于强化学习的视觉定位能力,解决了传统方法中感知与推理分离及依赖密集标注的问题。

Details Motivation: 现有视觉语言模型在自动驾驶等安全关键领域存在物体幻觉问题,主要源于其依赖未对齐的文本推理,且当前多模态推理方法存在感知与推理分离、依赖昂贵标注数据的问题。 Method: 提出OmniDrive-R1框架,采用交错多模态思维链(iMCoT),并通过纯两阶段强化学习训练流程和Clip-GRPO算法实现无需标注的视觉定位奖励机制,确保视觉关注与文本推理之间的实时跨模态一致性。 Result: 在DriveLMM-o1数据集上实验表明,相比Qwen2.5VL-7B基线模型,OmniDrive-R1将整体推理得分从51.77%提升至80.35%,最终答案准确率从37.81%提升至73.62%。 Conclusion: OmniDrive-R1通过端到端联合优化和无标注的强化学习机制,有效提升了自动驾驶场景下VLM的可靠性和推理性能,为安全关键应用提供了新思路。 Abstract: The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning.While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels.Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.

[61] SELECT: Detecting Label Errors in Real-world Scene Text Data

Wenjun Liu,Qian Wu,Yifeng Hu,Yuke Li

Main category: cs.CV

TL;DR: 提出SELECT方法,利用多模态训练检测真实场景文本数据集中的标签错误,结合图像-文本编码器和字符级分词器,并引入模拟真实错误的SSLC策略,有效处理变长序列、对齐问题和字符级错误。

Details Motivation: 现有方法难以有效检测真实场景文本数据集中因标注不一致、视觉相似性导致的标签错误,尤其在处理变长序列和字符级错误时表现不佳。 Method: 提出SELECT框架,采用图像-文本编码器与字符级分词器进行多模态训练;引入SSLC(基于相似性的序列标签污染)策略,在训练中模拟视觉上易混淆的字符错误,增强模型对真实标签错误的鲁棒性。 Result: SELECT在多个真实场景文本数据集上显著优于现有方法,能有效检测标签错误并提升文本识别精度,尤其在处理变长标签和字符级错标方面表现突出。 Conclusion: SELECT是首个成功应对真实场景文本数据中变长标签错误检测的方法,结合SSLC训练策略提升了模型实用性与准确性,具有广泛的应用前景。 Abstract: We introduce SELECT (Scene tExt Label Errors deteCTion), a novel approach that leverages multi-modal training to detect label errors in real-world scene text datasets. Utilizing an image-text encoder and a character-level tokenizer, SELECT addresses the issues of variable-length sequence labels, label sequence misalignment, and character-level errors, outperforming existing methods in accuracy and practical utility. In addition, we introduce Similarity-based Sequence Label Corruption (SSLC), a process that intentionally introduces errors into the training labels to mimic real-world error scenarios during training. SSLC not only can cause a change in the sequence length but also takes into account the visual similarity between characters during corruption. Our method is the first to detect label errors in real-world scene text datasets successfully accounting for variable-length labels. Experimental results demonstrate the effectiveness of SELECT in detecting label errors and improving STR accuracy on real-world text datasets, showcasing its practical utility.

[62] HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

HyperAI Team,Yuchen Liu,Kaiyang Han,Zhiqiang Xia,Yuhang Dong,Chen Song,Kangyu Tang,Jiaming Xu,Xiushi Feng,WenXuan Yu,Li Peng,Mingyang Wang,Kai Wang,Changpeng Yang,Yang Li,Haoyu Lu,Hao Wang,Bingna Xu,Guangyao Liu,Long Huang,Kaibin Guo,Jinyang Wu,Dan Wu,Hongzhen Wang,Peng Zhou,Shuai Nie,Shande Wang,Runyu Shi,Ying Huang

Main category: cs.CV

TL;DR: HyperVL是一种高效的多模态大语言模型,专为设备端推理设计,通过图像分块、视觉分辨率压缩和双一致性学习技术,显著降低延迟和功耗,同时保持高性能。

Details Motivation: 现有的多模态大模型计算和内存需求高,难以在设备端部署;标准ViT编码器在处理高分辨率输入时存在延迟和内存消耗瓶颈。 Method: 采用图像分块策略控制峰值内存,提出视觉分辨率压缩(VRC)来自适应选择编码分辨率,并引入双一致性学习(DCL)对齐多尺度ViT编码器,实现共享LLM下的动态视觉分支切换。 Result: 在多个基准测试中达到同类模型中的最先进性能,并在真实移动设备上显著降低延迟和功耗。 Conclusion: HyperVL有效解决了设备端多模态推理的效率问题,具备高性能、低延迟和低功耗优势,适合实际应用。 Abstract: Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.

[63] FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling

Kim Sung-Bin,Joohyun Chang,David Harwath,Tae-Hyun Oh

Main category: cs.CV

TL;DR: 本文提出了一种统一的动态说话人脸合成框架——语音条件下的面部运动补全,并提出了FacEDiT模型,通过流匹配训练的扩散Transformer实现高质量的说话人脸编辑与生成。

Details Motivation: 现有的说话人脸编辑和生成通常被视作独立任务,缺乏统一建模;同时缺少标准的编辑评估基准。本文旨在通过自监督的面部运动补全 pretext 任务统一这两个任务,并推动其发展。 Method: 提出FacEDiT,一种基于扩散Transformer的模型,采用流匹配进行训练;受掩码自编码器启发,通过周围面部动作和语音条件来恢复被掩码的面部动作;引入偏置注意力机制和时序平滑约束以提升边界连续性和唇音同步性。 Result: FacEDiT在说话人脸编辑与生成任务上均表现优异,能精确对齐语音、保持身份特征并实现视觉上的平滑过渡;同时构建了首个说话人脸编辑基准FacEDiTBench,包含多样化的编辑类型与长度及新评估指标。 Conclusion: 说话人脸编辑与生成可统一为语音条件下的面部运动补全任务;FacEDiT通过自监督学习有效支持多种局部编辑操作,并在生成质量、一致性与泛化能力方面优于现有方法。 Abstract: Talking face editing and face generation have often been studied as distinct problems. In this work, we propose viewing both not as separate tasks but as subtasks of a unifying formulation, speech-conditional facial motion infilling. We explore facial motion infilling as a self-supervised pretext task that also serves as a unifying formulation of dynamic talking face synthesis. To instantiate this idea, we propose FacEDiT, a speech-conditional Diffusion Transformer trained with flow matching. Inspired by masked autoencoders, FacEDiT learns to synthesize masked facial motions conditioned on surrounding motions and speech. This formulation enables both localized generation and edits, such as substitution, insertion, and deletion, while ensuring seamless transitions with unedited regions. In addition, biased attention and temporal smoothness constraints enhance boundary continuity and lip synchronization. To address the lack of a standard editing benchmark, we introduce FacEDiTBench, the first dataset for talking face editing, featuring diverse edit types and lengths, along with new evaluation metrics. Extensive experiments validate that talking face editing and generation emerge as subtasks of speech-conditional motion infilling; FacEDiT produces accurate, speech-aligned facial edits with strong identity preservation and smooth visual continuity while generalizing effectively to talking face generation.

[64] Real-time prediction of workplane illuminance distribution for daylight-linked controls using non-intrusive multimodal deep learning

Zulin Zhuang,Yu Bian

Main category: cs.CV

TL;DR: 提出一种基于多模态深度学习的实时室内照度预测框架,利用非侵入式图像和时空特征实现高精度日光照明控制,适用于动态占用的室内环境。

Details Motivation: 现有室内日光预测方法多针对静态场景,难以适应动态使用空间;需开发可在人员活动环境下准确、实时预测工作面照度的方法以提升节能潜力。 Method: 提出一个多模态深度学习框架,仅从侧窗区域提取非侵入式图像的时空特征来预测室内工作面照度分布,避免依赖室内人员或家具信息。通过在广州实验房间采集17,344个样本进行模型训练与验证。 Result: 在同分布测试集上R2 > 0.98,RMSE < 0.14;在未见日测试集上R2 > 0.82,RMSE < 0.17,表现出高预测精度和可接受的时间泛化能力。 Conclusion: 该方法能有效支持动态室内环境中基于日光的照明控制,具有良好的实际应用前景和节能潜力。 Abstract: Daylight-linked controls (DLCs) have significant potential for energy savings in buildings, especially when abundant daylight is available and indoor workplane illuminance can be accurately predicted in real time. Most existing studies on indoor daylight predictions were developed and tested for static scenes. This study proposes a multimodal deep learning framework that predicts indoor workplane illuminance distributions in real time from non-intrusive images with temporal-spatial features. By extracting image features only from the side-lit window areas rather than interior pixels, the approach remains applicable in dynamically occupied indoor spaces. A field experiment was conducted in a test room in Guangzhou (China), where 17,344 samples were collected for model training and validation. The model achieved R2 > 0.98 with RMSE < 0.14 on the same-distribution test set and R2 > 0.82 with RMSE < 0.17 on an unseen-day test set, indicating high accuracy and acceptable temporal generalization.

[65] Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution

Hao Chen,Junyang Chen,Jinshan Pan,Jiangxin Dong

Main category: cs.CV

TL;DR: 本文提出了一种可控的一次扩散网络CODSR,用于解决图像超分辨率中现有扩散方法存在的信息损失、生成先验激活不足和文本提示与语义区域错位的问题。通过引入LQ引导的特征调制、区域自适应生成先验激活和文本匹配引导策略,CODSR在保持高保真度的同时显著提升了感知质量,并支持高效的一次性推理。

Details Motivation: 现有基于扩散的一次超分辨率方法存在三个关键限制:低质量输入压缩导致的信息损失、生成先验的区域判别激活不足、文本提示与语义区域之间的错位。这些限制影响了重建图像的保真度和感知质量。 Method: 提出CODSR,包含三个核心模块:1)LQ引导的特征调制模块,利用未压缩的LQ输入信息提供高保真条件;2)区域自适应生成先验激活方法,增强感知丰富性而不损害局部结构保真度;3)文本匹配引导策略,充分挖掘文本提示的条件潜力。 Result: 大量实验表明,CODSR在感知质量和保真度方面均优于现有最先进方法,同时支持高效的单步推理。 Conclusion: CODSR通过引入LQ引导调制、区域自适应激活和文本匹配引导,有效克服了当前一阶扩散超分辨率方法的关键缺陷,在性能和效率之间实现了更好平衡。 Abstract: Recent diffusion-based one-step methods have shown remarkable progress in the field of image super-resolution, yet they remain constrained by three critical limitations: (1) inferior fidelity performance caused by the information loss from compression encoding of low-quality (LQ) inputs; (2) insufficient region-discriminative activation of generative priors; (3) misalignment between text prompts and their corresponding semantic regions. To address these limitations, we propose CODSR, a controllable one-step diffusion network for image super-resolution. First, we propose an LQ-guided feature modulation module that leverages original uncompressed information from LQ inputs to provide high-fidelity conditioning for the diffusion process. We then develop a region-adaptive generative prior activation method to effectively enhance perceptual richness without sacrificing local structural fidelity. Finally, we employ a text-matching guidance strategy to fully harness the conditioning potential of text prompts. Extensive experiments demonstrate that CODSR achieves superior perceptual quality and competitive fidelity compared with state-of-the-art methods with efficient one-step inference.

[66] SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Shuang Cheng,Yuhua Jiang,Zineng Zhou,Dawei Liu,Wang Tao,Linfeng Zhang,Biqing Qi,Bowen Zhou

Main category: cs.CV

TL;DR: 本文提出了SDAR-VL,是首个将块状离散扩散系统应用于大规模视觉-语言理解的工作,并通过集成框架实现高效稳定的训练,在21个基准上达到了扩散模型中的最先进水平,并匹敌或超越了强自回归基线。

Details Motivation: 块状离散扩散在并行生成和因果建模之间具有优势,但其高训练成本、收敛慢和不稳定限制了实际应用,本文旨在解决这些问题以推动其在视觉-语言理解中的实用化。 Method: 提出了一种集成训练框架,包含异步块状噪声调度、有效掩码比率缩放和渐进式Beta噪声课程,以提升训练效率与稳定性。 Result: 在21个单图、多图和视频任务上,SDAR-VL显著提升了训练效率、收敛稳定性和任务性能,优于传统块扩散方法,并在相同设置下匹敌或超过LLaVA-OneVision等强自回归模型及LLaDA-V等全局扩散模型。 Conclusion: SDAR-VL验证了块状离散扩散作为视觉-语言理解实用骨干网络的潜力,为扩散模型在该领域的应用提供了可行路径。 Abstract: Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.

[67] GaussianPlant: Structure-aligned Gaussian Splatting for 3D Reconstruction of Plants

Yang Yang,Risa Shinoda,Hiroaki Santo,Fumio Okura

Main category: cs.CV

TL;DR: 提出GaussianPlant,一种基于3D高斯点阵的分层表示方法,用于从多视角图像中联合恢复植物外观与内部结构。

Details Motivation: 传统3D高斯点阵(3DGS)虽能高保真重建场景外观,但缺乏对植物结构(如分支模式)的显式表达,限制了其在植物表型分析等任务中的应用。 Method: 提出结构基元(StPs)和外观基元(ApPs)的分层表示:StPs用圆柱和圆盘建模枝叶几何结构,并通过自组织优化区分枝叶属性;ApPs基于3D高斯描述外观,并与StPs绑定。二者通过重渲染损失及梯度传递联合优化。 Result: 实验表明该方法在外观重建和结构提取方面均表现优异,可准确恢复植物分支结构与叶片实例,支持定性与实际应用验证。 Conclusion: GaussianPlant实现了外观与结构的解耦建模,兼顾高保真渲染与精确结构恢复,拓展了3DGS在植物表型分析等结构敏感任务中的应用潜力。 Abstract: We present a method for jointly recovering the appearance and internal structure of botanical plants from multi-view images based on 3D Gaussian Splatting (3DGS). While 3DGS exhibits robust reconstruction of scene appearance for novel-view synthesis, it lacks structural representations underlying those appearances (e.g., branching patterns of plants), which limits its applicability to tasks such as plant phenotyping. To achieve both high-fidelity appearance and structural reconstruction, we introduce GaussianPlant, a hierarchical 3DGS representation, which disentangles structure and appearance. Specifically, we employ structure primitives (StPs) to explicitly represent branch and leaf geometry, and appearance primitives (ApPs) to the plants' appearance using 3D Gaussians. StPs represent a simplified structure of the plant, i.e., modeling branches as cylinders and leaves as disks. To accurately distinguish the branches and leaves, StP's attributes (i.e., branches or leaves) are optimized in a self-organized manner. ApPs are bound to each StP to represent the appearance of branches or leaves as in conventional 3DGS. StPs and ApPs are jointly optimized using a re-rendering loss on the input multi-view images, as well as the gradient flow from ApP to StP using the binding correspondence information. We conduct experiments to qualitatively evaluate the reconstruction accuracy of both appearance and structure, as well as real-world experiments to qualitatively validate the practical performance. Experiments show that the GaussianPlant achieves both high-fidelity appearance reconstruction via ApPs and accurate structural reconstruction via StPs, enabling the extraction of branch structure and leaf instances.

[68] TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Jun Zhang,Teng Wang,Yuying Ge,Yixiao Ge,Xinhao Li,Ying Shan,Limin Wang

Main category: cs.CV

TL;DR: 本文提出了TimeLens,通过对数据质量和算法设计的系统性研究,建立了视频时间定位(VTG)的新基准,发布了高质量数据集TimeLens-Bench和TimeLens-100K,并提出了一系列高效训练方法,实现了开源模型中领先的VTG性能。

Details Motivation: 现有的VTG基准存在数据质量不足的问题,导致模型评估不可靠;同时缺乏针对多模态大语言模型(MLLMs)优化VTG能力的系统研究。 Method: 构建了高质量评测集TimeLens-Bench和大规模训练集TimeLens-100K;采用交错文本编码表示时间信息,提出无需推理的可验证奖励强化学习(RLVR)训练范式,并设计了有效的训练策略。 Result: 在重新标注的基准上观察到显著的模型重排序现象,验证了原有基准的不可靠性;TimeLens模型在VTG任务上达到开源模型中的最先进水平,甚至超越GPT-5和Gemini-2.5-Flash等闭源模型。 Conclusion: 高质量的数据标注和精细的算法设计对VTG至关重要,TimeLens为MLLM在视频时间定位任务上的优化提供了可靠基准和有效路径。 Abstract: This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.

[69] ProtoFlow: Interpretable and Robust Surgical Workflow Modeling with Learned Dynamic Scene Graph Prototypes

Felix Holm,Ghazal Ghazaei,Nassir Navab

Main category: cs.CV

TL;DR: 本文提出了一种名为ProtoFlow的新框架,通过学习动态场景图原型来建模复杂的手术流程,结合自监督预训练和基于原型的微调,实现了可解释且鲁棒的手术工作流分析。

Details Motivation: 现有的AI辅助手术研究受限于高昂的标注成本、数据稀缺以及缺乏可解释模型,而场景图虽有潜力但尚未被充分挖掘。 Method: 采用图神经网络(GNN)编码器-解码器架构,结合自监督预训练与基于原型的微调,发现并优化代表常见临床交互模式的核心原型。 Result: 在CAT-SG数据集上,ProtoFlow优于标准GNN基线,尤其在少样本场景下表现出色,即使仅用一个手术视频训练仍保持高性能,并能识别手术子技术及异常情况。 Conclusion: ProtoFlow通过结合强表示学习与内在可解释性,推动了透明、可靠且数据高效AI系统的发展,有助于加速其在临床中的应用。 Abstract: Purpose: Detailed surgical recognition is critical for advancing AI-assisted surgery, yet progress is hampered by high annotation costs, data scarcity, and a lack of interpretable models. While scene graphs offer a structured abstraction of surgical events, their full potential remains untapped. In this work, we introduce ProtoFlow, a novel framework that learns dynamic scene graph prototypes to model complex surgical workflows in an interpretable and robust manner. Methods: ProtoFlow leverages a graph neural network (GNN) encoder-decoder architecture that combines self-supervised pretraining for rich representation learning with a prototype-based fine-tuning stage. This process discovers and refines core prototypes that encapsulate recurring, clinically meaningful patterns of surgical interaction, forming an explainable foundation for workflow analysis. Results: We evaluate our approach on the fine-grained CAT-SG dataset. ProtoFlow not only outperforms standard GNN baselines in overall accuracy but also demonstrates exceptional robustness in limited-data, few-shot scenarios, maintaining strong performance when trained on as few as one surgical video. Our qualitative analyses further show that the learned prototypes successfully identify distinct surgical sub-techniques and provide clear, interpretable insights into workflow deviations and rare complications. Conclusion: By uniting robust representation learning with inherent explainability, ProtoFlow represents a significant step toward developing more transparent, reliable, and data-efficient AI systems, accelerating their potential for clinical adoption in surgical training, real-time decision support, and workflow optimization.

[70] Quality-Aware Framework for Video-Derived Respiratory Signals

Nhi Nguyen,Constantino Álvarez Casado,Le Nguyen,Manuel Lage Cañellas,Miguel Bordallo López

Main category: cs.CV

TL;DR: 提出了一种质量感知的视频呼吸率估计框架,通过融合多种信号源和动态可靠性评估,提升了估计精度。

Details Motivation: 由于不同提取方法间信号质量不一致,基于视频的呼吸率估计常不可靠,因此需要一个能动态评估并融合多源信号的框架。 Method: 从面部远程光电容积描记(rPPG)、上身运动和深度学习管道中提取十种信号,并使用四种频谱估计算法进行分析,结合片段级质量指标训练机器学习模型以预测准确性或选择最可靠信号。 Result: 在三个公开数据集上的实验表明,该框架在大多数情况下比单一方法误差更低,性能增益依赖于数据集特征。 Conclusion: 质量驱动的预测建模有望实现可扩展且泛化性强的视频呼吸监测方案。 Abstract: Video-based respiratory rate (RR) estimation is often unreliable due to inconsistent signal quality across extraction methods. We present a predictive, quality-aware framework that integrates heterogeneous signal sources with dynamic assessment of reliability. Ten signals are extracted from facial remote photoplethysmography (rPPG), upper-body motion, and deep learning pipelines, and analyzed using four spectral estimators: Welch's method, Multiple Signal Classification (MUSIC), Fast Fourier Transform (FFT), and peak detection. Segment-level quality indices are then used to train machine learning models that predict accuracy or select the most reliable signal. This enables adaptive signal fusion and quality-based segment filtering. Experiments on three public datasets (OMuSense-23, COHFACE, MAHNOB-HCI) show that the proposed framework achieves lower RR estimation errors than individual methods in most cases, with performance gains depending on dataset characteristics. These findings highlight the potential of quality-driven predictive modeling to deliver scalable and generalizable video-based respiratory monitoring solutions.

[71] AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

Sisi Dai,Kai Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为AnchorHOI的新框架,利用视频扩散模型和基于锚点的先验蒸馏策略,提升文本驱动的4D人-物交互生成的多样性与泛化能力。

Details Motivation: 现有的4D人-物交互生成方法受限于大规模4D数据集的缺乏,且零样本方法在交互线索提取方面不足,难以适应多样化场景。 Method: 引入AnchorHOI框架,结合视频扩散模型与图像扩散模型,并设计基于锚点的先验蒸馏策略,使用锚NeRF和锚关键点分别指导交互构成和运动合成。 Result: 实验表明,AnchorHOI在生成结果的多样性、真实性和泛化性方面优于现有方法。 Conclusion: AnchorHOI通过融合混合先验和锚点引导机制,有效提升了零样本4D人-物交互生成的质量与适用范围。 Abstract: Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.

[72] OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration

Ruitong Sun,Tianze Yang,Wei Niu,Jin Sun

Main category: cs.CV

TL;DR: 本文提出了OUSAC框架,通过优化指导调度和自适应缓存来加速扩散Transformer模型的生成过程,在减少计算成本的同时提升生成质量。

Details Motivation: 扩散模型在高质量图像生成中表现优异,但其迭代去噪过程计算开销大;Classifier-Free Guidance(CFG)虽提升质量却使计算量翻倍,因此需要一种高效的方法来降低计算成本并保持生成质量。 Method: 提出OUSAC框架,第一阶段使用进化算法联合优化跳过的 timestep 和指导尺度,减少无条件前向传递;第二阶段引入自适应秩分配,针对不同Transformer模块动态调整缓存策略,以维持变指导下的缓存有效性。 Result: 实验显示OUSAC在DiT-XL/2上实现53%计算节省且质量提升15%,在PixArt-alpha上节省60%计算量且质量提升16.1%,在FLUX上达到5倍加速同时CLIP Score优于50步基线。 Conclusion: OUSAC通过可变指导调度与自适应缓存机制,有效平衡了扩散模型生成速度与质量,显著优于现有加速方法。 Abstract: Diffusion models have emerged as the dominant paradigm for high-quality image generation, yet their computational expense remains substantial due to iterative denoising. Classifier-Free Guidance (CFG) significantly enhances generation quality and controllability but doubles the computation by requiring both conditional and unconditional forward passes at every timestep. We present OUSAC (Optimized gUidance Scheduling with Adaptive Caching), a framework that accelerates diffusion transformers (DiT) through systematic optimization. Our key insight is that variable guidance scales enable sparse computation: adjusting scales at certain timesteps can compensate for skipping CFG at others, enabling both fewer total sampling steps and fewer CFG steps while maintaining quality. However, variable guidance patterns introduce denoising deviations that undermine standard caching methods, which assume constant CFG scales across steps. Moreover, different transformer blocks are affected at different levels under dynamic conditions. This paper develops a two-stage approach leveraging these insights. Stage-1 employs evolutionary algorithms to jointly optimize which timesteps to skip and what guidance scale to use, eliminating up to 82% of unconditional passes. Stage-2 introduces adaptive rank allocation that tailors calibration efforts per transformer block, maintaining caching effectiveness under variable guidance. Experiments demonstrate that OUSAC significantly outperforms state-of-the-art acceleration methods, achieving 53% computational savings with 15% quality improvement on DiT-XL/2 (ImageNet 512x512), 60% savings with 16.1% improvement on PixArt-alpha (MSCOCO), and 5x speedup on FLUX while improving CLIP Score over the 50-step baseline.

[73] ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models

Ruishu Zhu,Zhihao Huang,Jiacheng Sun,Ping Luo,Hongyuan Zhang,Xuelong Li

Main category: cs.CV

TL;DR: ViewMask-1-to-3是一种基于离散扩散模型的多视角图像生成方法,通过MAGVIT-v2将图像转为视觉token,并结合文本输入进行掩码预测,实现从单图和文本生成多视角图像,无需复杂3D先验且在多个指标上表现领先。

Details Motivation: 现有方法依赖复杂的3D感知结构或多视角训练数据来保证几何一致性,限制了其泛化性和实用性,本文旨在提出一种更简洁、不依赖复杂几何约束的多视角生成方法。 Method: 将多视角合成视为离散序列建模问题,使用MAGVIT-v2对图像进行token化,通过统一语言与视觉的掩码token预测机制,利用自注意力实现跨视角一致性,逐步解码生成多视角图像。 Result: 在GSO和3D-FUTURE数据集上,PSNR、SSIM和LPIPS综合性能排名第一,验证了方法的有效性与优越性。 Conclusion: 离散扩散模型可作为多视角图像生成的一种简单而有效的替代方案,ViewMask-1-to-3通过简单的随机掩码和自注意力机制实现了良好的跨视角一致性,无需复杂的3D结构设计。 Abstract: Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.

[74] Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries

Emanuele Mezzi,Gertjan Burghouts,Maarten Kruithof

Main category: cs.CV

TL;DR: RUNE是一种结合大语言模型与神经符号AI的遥感文本到图像检索方法,通过一阶逻辑表达式显式推理实体兼容性,提升检索性能、鲁棒性和可解释性。

Details Motivation: 现有遥感大视觉语言模型(RS-LVLMs)在可解释性和复杂空间关系处理方面存在不足,限制了其在实际场景中的应用。 Method: 提出RUNE方法,利用LLM将文本查询转化为一阶逻辑(FOL)表达式,结合检测到的实体进行显式推理;采用逻辑分解策略提高可扩展性,并引入RRQC和RRIU两个新指标评估检索鲁棒性。 Result: 在增强版DOTA数据集上验证,RUNE在复杂查询下的检索性能优于现有RS-LVLMs,且具备更高可解释性和更短执行时间。 Conclusion: RUNE通过神经符号推理提升了遥感文本-图像检索的效果,在真实应用场景(如灾后影像检索)中展现出优越性能和实用潜力。 Abstract: Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM's effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.

[75] Selective, Controlled and Domain-Agnostic Unlearning in Pretrained CLIP: A Training- and Data-Free Approach

Ashish Mishra,Gyanaranjan Nayak,Tarun Kumar,Arpit Shah,Suparna Bhattacharya,Martin Foltin

Main category: cs.CV

TL;DR: 提出了一种无需训练和数据的CLIP模型选择性遗忘框架,通过多模态零空间实现全局、领域特定或完全的选择性知识移除。

Details Motivation: 现有模型遗忘方法依赖重训练和额外数据,难以灵活控制遗忘范围且计算成本高。 Method: 利用CLIP的文本提示和合成视觉原型,在联合嵌入空间中构建多模态零空间,实现无需训练和数据的知识移除。 Result: 实现了三种遗忘模式:全局对象遗忘、领域特定知识移除和特定领域完全遗忘,同时保持其他任务性能。 Conclusion: 该方法为预训练模型提供了灵活、高效且可控的遗忘机制,适用于现实场景中的隐私保护和合规需求。 Abstract: Pretrained models like CLIP have demonstrated impressive zero-shot classification capabilities across diverse visual domains, spanning natural images, artistic renderings, and abstract representations. However, real-world applications often demand the removal (or "unlearning") of specific object classes without requiring additional data or retraining, or affecting the model's performance on unrelated tasks. In this paper, we propose a novel training- and data-free unlearning framework that enables three distinct forgetting paradigms: (1) global unlearning of selected objects across all domains, (2) domain-specific knowledge removal (e.g., eliminating sketch representations while preserving photo recognition), and (3) complete unlearning in selective domains. By leveraging a multimodal nullspace through synergistic integration of text prompts and synthesized visual prototypes derived from CLIP's joint embedding space, our method efficiently removes undesired class information while preserving the remaining knowledge. This approach overcomes the limitations of existing retraining-based methods and offers a flexible and computationally efficient solution for controlled model forgetting.

[76] MFE-GAN: Efficient GAN-based Framework for Document Image Enhancement and Binarization with Multi-scale Feature Extraction

Rui-Yang Ju,KokSheik Wong,Yanlin Jin,Jen-Shiun Chiang

Main category: cs.CV

TL;DR: 提出了一种名为MFE-GAN的高效GAN框架,用于文档图像增强和二值化,通过多尺度特征提取和Haar小波变换显著减少了训练和推理时间,同时在多个数据集上保持了与最先进方法相当的性能。

Details Motivation: 现有方法使用多个独立的生成对抗网络(GAN)处理不同颜色通道,导致训练和推理时间过长,难以高效地进行文档图像增强和二值化。 Method: 提出MFE-GAN框架,结合多尺度特征提取(MFE)、Haar小波变换(HWT)和归一化预处理技术,并设计了新的生成器、判别器和损失函数,以提升模型效率和性能。 Result: 在Benchmark、Nabuco和CMATERdb数据集上的实验表明,MFE-GAN显著缩短了训练和推理时间,同时保持与SOTA方法相当的OCR识别性能。 Conclusion: MFE-GAN是一种高效且有效的文档图像增强与二值化方法,能够在不牺牲识别精度的前提下大幅提升处理速度,适合实际应用部署。 Abstract: Document image enhancement and binarization are commonly performed prior to document analysis and recognition tasks for improving the efficiency and accuracy of optical character recognition (OCR) systems. This is because directly recognizing text in degraded documents, particularly in color images, often results in unsatisfactory recognition performance. To address these issues, existing methods train independent generative adversarial networks (GANs) for different color channels to remove shadows and noise, which, in turn, facilitates efficient text information extraction. However, deploying multiple GANs results in long training and inference times. To reduce both training and inference times of document image enhancement and binarization models, we propose MFE-GAN, an efficient GAN-based framework with multi-scale feature extraction (MFE), which incorporates Haar wavelet transformation (HWT) and normalization to process document images before feeding them into GANs for training. In addition, we present novel generators, discriminators, and loss functions to improve the model's performance, and we conduct ablation studies to demonstrate their effectiveness. Experimental results on the Benchmark, Nabuco, and CMATERdb datasets demonstrate that the proposed MFE-GAN significantly reduces the total training and inference times while maintaining comparable performance with respect to state-of-the-art (SOTA) methods. The implementation of this work is available at https://ruiyangju.github.io/MFE-GAN.

[77] SportsGPT: An LLM-driven Framework for Interpretable Sports Motion Assessment and Training Guidance

Wenbo Tian,Ruting Lin,Hongxian Zheng,Yaodong Yang,Geng Wu,Zihao Zhang,Zhang Zhang

Main category: cs.CV

TL;DR: 本文提出SportsGPT,一个基于大语言模型的可解释体育动作评估与训练指导框架,通过MotionDTW、KISMAM和SportsRAG实现从动作数据到专业指导的闭环。

Details Motivation: 现有智能体育分析系统多集中于打分与可视化,缺乏自动性能诊断与可解释的训练建议,难以满足专业训练需求。 Method: 提出三阶段框架:1)MotionDTW进行两阶段时间序列对齐以提取关键帧;2)KISMAM结合目标模型生成可解释的评估指标;3)SportsRAG基于检索增强生成(RAG)利用60亿token知识库驱动大模型生成专业训练指导。 Result: 实验表明MotionDTW在时序误差和IoU上优于传统方法;消融实验证实KISMAM与SportsRAG的有效性;SportsGPT在诊断准确性和专业性上超越通用大模型。 Conclusion: SportsGPT实现了从原始动作数据到可解释评估与专业训练指导的闭环,为智能体育训练提供了更精准、可解释的新范式。 Abstract: Existing intelligent sports analysis systems mainly focus on "scoring and visualization," often lacking automatic performance diagnosis and interpretable training guidance. Recent advances of Large Language Models (LMMs) and motion analysis techniques provide new opportunities to address the above limitations. In this paper, we propose SportsGPT, an LLM-driven framework for interpretable sports motion assessment and training guidance, which establishes a closed loop from motion time-series input to professional training guidance. First, given a set of high-quality target models, we introduce MotionDTW, a two-stage time series alignment algorithm designed for accurate keyframe extraction from skeleton-based motion sequences. Subsequently, we design a Knowledge-based Interpretable Sports Motion Assessment Model (KISMAM) to obtain a set of interpretable assessment metrics (e.g., insufficient extension) by constrasting the keyframes with the targe models. Finally, we propose SportsRAG, a RAG-based training guidance model based on Qwen3. Leveraging a 6B-token knowledge base, it prompts the LLM to generate professional training guidance by retrieving domain-specific QA pairs. Experimental results demonstrate that MotionDTW significantly outperforms traditional methods with lower temporal error and higher IoU scores. Furthermore, ablation studies validate the KISMAM and SportsRAG, confirming that SportsGPT surpasses general LLMs in diagnostic accuracy and professionalism.

[78] Consistent Instance Field for Dynamic Scene Understanding

Junyi Wu,Van Nguyen Nguyen,Benjamin Planche,Jiachen Tao,Changchang Sun,Zhongpai Gao,Zhenghao Zhao,Anwesa Choudhuri,Gengyu Zhang,Meng Zheng,Feiran Wang,Terrence Chen,Yan Yan,Ziyan Wu

Main category: cs.CV

TL;DR: 提出了一种基于连续概率时空表示的动态场景理解方法Consistent Instance Field,通过可变形3D高斯模型联合建模辐射和语义信息,在新视角全景分割和开放词汇4D查询任务中显著优于现有方法。

Details Motivation: 现有方法依赖离散跟踪或视图相关特征,难以在时空上保持实例身份的一致性,因此需要一种能够解耦可见性与对象身份的连续表示方法。 Method: 提出Consistent Instance Field,使用占据概率和条件实例分布建模每个时空点;引入基于可变形3D高斯的实例嵌入表示,通过可微渲染从RGB图像和实例掩码中学习,并设计机制校准高斯身份并重采样至语义活跃区域。 Result: 在HyperNeRF和Neu3D数据集上,该方法在新视角全景分割和开放词汇4D查询任务中显著优于当前最先进方法。 Conclusion: 所提方法实现了跨时空一致的实例表示,有效提升了动态场景理解的性能,为4D场景分析提供了新的解决方案。 Abstract: We introduce Consistent Instance Field, a continuous and probabilistic spatio-temporal representation for dynamic scene understanding. Unlike prior methods that rely on discrete tracking or view-dependent features, our approach disentangles visibility from persistent object identity by modeling each space-time point with an occupancy probability and a conditional instance distribution. To realize this, we introduce a novel instance-embedded representation based on deformable 3D Gaussians, which jointly encode radiance and semantic information and are learned directly from input RGB images and instance masks through differentiable rasterization. Furthermore, we introduce new mechanisms to calibrate per-Gaussian identities and resample Gaussians toward semantically active regions, ensuring consistent instance representations across space and time. Experiments on HyperNeRF and Neu3D datasets demonstrate that our method significantly outperforms state-of-the-art methods on novel-view panoptic segmentation and open-vocabulary 4D querying tasks.

[79] Erasing CLIP Memories: Non-Destructive, Data-Free Zero-Shot class Unlearning in CLIP Models

Ashish Mishra,Tarun Kumar,Gyanaranjan Nayak,Arpit Shah,Suparna Bhattacharya,Martin Foltin

Main category: cs.CV

TL;DR: 提出一种基于零空间投影的闭式选择性遗忘方法,用于多模态模型(如CLIP)中高效、精确地移除特定类别信息,无需重训练或使用遗忘集图像。

Details Motivation: 为了应对多模态模型中的隐私保护和模型去污染问题,需要一种高效且精确的选择性遗忘方法,避免传统方法对再训练和大量数据的依赖。 Method: 利用零空间投影技术,在预训练模型(如CLIP)的最后一层投影层中,通过计算目标文本嵌入张成子空间的正交基并将其投影,从而消除对应类别信息。 Result: 该方法显著降低了目标类别的零样本分类性能,同时保留了模型整体的多模态知识;即使部分投影也能在完全遗忘与信息保留之间取得平衡。 Conclusion: 所提方法是一种计算高效、无需训练的选择性遗忘方案,适用于隐私保护和模型去污染,为多模态模型的可控遗忘提供了新思路。 Abstract: We introduce a novel, closed-form approach for selective unlearning in multimodal models, specifically targeting pretrained models such as CLIP. Our method leverages nullspace projection to erase the target class information embedded in the final projection layer, without requiring any retraining or the use of images from the forget set. By computing an orthonormal basis for the subspace spanned by target text embeddings and projecting these directions, we dramatically reduce the alignment between image features and undesired classes. Unlike traditional unlearning techniques that rely on iterative fine-tuning and extensive data curation, our approach is both computationally efficient and surgically precise. This leads to a pronounced drop in zero-shot performance for the target classes while preserving the overall multimodal knowledge of the model. Our experiments demonstrate that even a partial projection can balance between complete unlearning and retaining useful information, addressing key challenges in model decontamination and privacy preservation.

[80] SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing

Han Zou,Yan Zhang,Ruiqi Yu,Cong Xie,Jie Huang,Zhenpeng Zhan

Main category: cs.CV

TL;DR: 本文提出了一种名为SketchAssist的交互式草图编辑助手,通过统一指令引导的全局编辑与线条引导的区域重绘,实现了在保持草图结构和风格的同时高效进行语义修改和局部重画。

Details Motivation: 现有图像编辑系统难以在保留线稿稀疏性和风格敏感性的同时支持高层次语义修改和精确局部重绘,因此需要一种专为草图设计的可控编辑方法。 Method: 提出一个可控的数据生成流程,构建属性添加序列并生成多步编辑链,并结合风格保持的属性移除模型扩展数据多样性;基于此数据,采用DiT-based编辑器的统一框架,复用RGB通道编码输入,并在LoRA层中引入任务引导的专家混合机制,根据文本和视觉线索切换编辑模式。 Result: 实验表明,SketchAssist在指令遵循、风格保持和结构保真方面均优于现有基线方法,实现了最先进的性能。 Conclusion: 所提出的数据集和SketchAssist系统为草图创作与修订提供了一个实用且可控的解决方案,有效平衡了语义控制与风格结构保持的需求。 Abstract: Sketch editing is central to digital illustration, yet existing image editing systems struggle to preserve the sparse, style-sensitive structure of line art while supporting both high-level semantic changes and precise local redrawing. We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction-guided global edits with line-guided region redrawing, while keeping unrelated regions and overall composition intact. To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute-addition sequences from attribute-free base sketches, (ii) forms multi-step edit chains via cross-sequence sampling, and (iii) expands stylistic coverage with a style-preserving attribute-removal model applied to diverse sketches. Building on this data, SketchAssist employs a unified sketch editing framework with minimal changes to DiT-based editors. We repurpose the RGB channels to encode the inputs, enabling seamless switching between instruction-guided edits and line-guided redrawing within a single input interface. To further specialize behavior across modes, we integrate a task-guided mixture-of-experts into LoRA layers, routing by text and visual cues to improve semantic controllability, structural fidelity, and style preservation. Extensive experiments show state-of-the-art results on both tasks, with superior instruction adherence and style/structure preservation compared to recent baselines. Together, our dataset and SketchAssist provide a practical, controllable assistant for sketch creation and revision.

[81] TorchTraceAP: A New Benchmark Dataset for Detecting Performance Anti-Patterns in Computer Vision Models

Hanning Chen,Keyu Man,Kevin Zhu,Chenguang Zhu,Haonan Li,Tongbo Luo,Xizhou Feng,Wei Sun,Sreen Tallam,Mohsen Imani,Partha Kanuparthy

Main category: cs.CV

TL;DR: 本文提出了首个用于评估和改进机器学习模型在检测PyTorch训练轨迹中性能反模式能力的基准数据集,包含600多个来自多种计算机视觉模型的轨迹,并提出一种结合轻量级模型与大语言模型的迭代方法,有效提升反模式检测精度并克服LLM上下文长度限制。

Details Motivation: 检测机器学习模型中的性能反模式对高效训练和推理至关重要,但当前依赖专家手动分析执行轨迹,耗时且难以自动化,尤其对缺乏资源的研究人员不友好。 Method: 构建了一个包含600多个跨硬件平台PyTorch轨迹的数据集,覆盖分类、检测、分割和生成等模型;提出一种两阶段迭代方法:先用轻量级模型定位可疑轨迹段,再用大语言模型进行细粒度分类和反馈生成。 Result: 实验表明该方法在检测反模式区域上显著优于无监督聚类和基于规则的统计方法,同时能有效缓解大语言模型在上下文长度和推理效率上的局限性。 Conclusion: 该工作为自动化诊断ML系统性能问题提供了可扩展的解决方案,通过轻量模型与LLM协同的框架,在降低人工负担的同时提升了反模式识别的准确性和实用性。 Abstract: Identifying and addressing performance anti-patterns in machine learning (ML) models is critical for efficient training and inference, but it typically demands deep expertise spanning system infrastructure, ML models and kernel development. While large tech companies rely on dedicated ML infrastructure engineers to analyze torch traces and benchmarks, such resource-intensive workflows are largely inaccessible to computer vision researchers in general. Among the challenges, pinpointing problematic trace segments within lengthy execution traces remains the most time-consuming task, and is difficult to automate with current ML models, including LLMs. In this work, we present the first benchmark dataset specifically designed to evaluate and improve ML models' ability to detect anti patterns in traces. Our dataset contains over 600 PyTorch traces from diverse computer vision models classification, detection, segmentation, and generation collected across multiple hardware platforms. We also propose a novel iterative approach: a lightweight ML model first detects trace segments with anti patterns, followed by a large language model (LLM) for fine grained classification and targeted feedback. Experimental results demonstrate that our method significantly outperforms unsupervised clustering and rule based statistical techniques for detecting anti pattern regions. Our method also effectively compensates LLM's limited context length and reasoning inefficiencies.

[82] CIS-BA: Continuous Interaction Space Based Backdoor Attack for Object Detection in the Real-World

Shuxin Zhao,Bo Lang,Nan Xiao,Yilang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的后门攻击范式CIS-BA,通过建模对象间的连续交互模式,实现多触发-多目标攻击,并在复杂环境中展现出高攻击成功率和鲁棒性。

Details Motivation: 现有后门攻击方法受限于单触发-单目标映射和脆弱的像素级特征,在实际应用中能力与鲁棒性不足。 Method: 提出CIS-BA,将触发器设计从静态对象特征转向对象间连续交互模式,构建空间触发器;设计CIS-Frame框架,通过交互分析生成类-几何约束用于样本投毒,并在检测器训练中嵌入后门。 Result: 在MS-COCO和真实视频上实验表明,CIS-BA在复杂环境下攻击成功率超过97%,动态多触发条件下仍保持95%以上效果,并能规避三种最先进的防御方法。 Conclusion: CIS-BA拓展了密集交互场景下的后门攻击研究,为物体检测系统的安全性提供了新视角。 Abstract: Object detection models deployed in real-world applications such as autonomous driving face serious threats from backdoor attacks. Despite their practical effectiveness,existing methods are inherently limited in both capability and robustness due to their dependence on single-trigger-single-object mappings and fragile pixel-level cues. We propose CIS-BA, a novel backdoor attack paradigm that redefines trigger design by shifting from static object features to continuous inter-object interaction patterns that describe how objects co-occur and interact in a scene. By modeling these patterns as a continuous interaction space, CIS-BA introduces space triggers that, for the first time, enable a multi-trigger-multi-object attack mechanism while achieving robustness through invariant geometric relations. To implement this paradigm, we design CIS-Frame, which constructs space triggers via interaction analysis, formalizes them as class-geometry constraints for sample poisoning, and embeds the backdoor during detector training. CIS-Frame supports both single-object attacks (object misclassification and disappearance) and multi-object simultaneous attacks, enabling complex and coordinated effects across diverse interaction states. Experiments on MS-COCO and real-world videos show that CIS-BA achieves over 97% attack success under complex environments and maintains over 95% effectiveness under dynamic multi-trigger conditions, while evading three state-of-the-art defenses. In summary, CIS-BA extends the landscape of backdoor attacks in interaction-intensive scenarios and provides new insights into the security of object detection systems.

[83] FastDDHPose: Towards Unified, Efficient, and Disentangled 3D Human Pose Estimation

Qingyuan Cai,Linxin Zhang,Xuecai Hu,Saihui Hou,Yongzhen Huang

Main category: cs.CV

TL;DR: 提出Fast3DHPE,一个用于单目3D人体姿态估计的模块化框架,实现方法的快速复现与公平比较,并在此基础上提出基于解耦扩散模型的FastDDHPose方法,显著提升性能、泛化性与鲁棒性。

Details Motivation: 现有3D人体姿态估计方法缺乏统一的训练和评估框架,导致难以公平比较;同时层次误差累积和复杂关节建模影响性能。 Method: 构建模块化框架Fast3DHPE,标准化训练与评估流程;提出FastDDHPose,采用解耦扩散模型分别建模骨长和骨方向分布,并设计基于运动学层次结构的时空去噪网络以减少冗余拓扑建模。 Result: 在Human3.6M和MPI-INF-3DHP数据集上验证了框架的有效性,实现了高效训练与公平比较;FastDDHPose达到SOTA性能,在野外场景中表现出强泛化性和鲁棒性。 Conclusion: Fast3DHPE为3D人体姿态估计提供了可复现、高效且公平的基准框架,FastDDHPose通过解耦建模与层次去噪设计提升了估计精度与实用性。 Abstract: Recent approaches for monocular 3D human pose estimation (3D HPE) have achieved leading performance by directly regressing 3D poses from 2D keypoint sequences. Despite the rapid progress in 3D HPE, existing methods are typically trained and evaluated under disparate frameworks, lacking a unified framework for fair comparison. To address these limitations, we propose Fast3DHPE, a modular framework that facilitates rapid reproduction and flexible development of new methods. By standardizing training and evaluation protocols, Fast3DHPE enables fair comparison across 3D human pose estimation methods while significantly improving training efficiency. Within this framework, we introduce FastDDHPose, a Disentangled Diffusion-based 3D Human Pose Estimation method which leverages the strong latent distribution modeling capability of diffusion models to explicitly model the distributions of bone length and bone direction while avoiding further amplification of hierarchical error accumulation. Moreover, we design an efficient Kinematic-Hierarchical Spatial and Temporal Denoiser that encourages the model to focus on kinematic joint hierarchies while avoiding unnecessary modeling of overly complex joint topologies. Extensive experiments on Human3.6M and MPI-INF-3DHP show that the Fast3DHPE framework enables fair comparison of all methods while significantly improving training efficiency. Within this unified framework, FastDDHPose achieves state-of-the-art performance with strong generalization and robustness in in-the-wild scenarios. The framework and models will be released at: https://github.com/Andyen512/Fast3DHPE

[84] Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes

Joseph Hoche,Andrei Bursuc,David Brellmann,Gilles Louppe,Pavel Izmailov,Angela Yao,Gianni Franchi

Main category: cs.CV

TL;DR: 提出了一种基于语义嵌入几何结构的贝叶斯框架SGPU,用于大规模视觉-语言模型中的不确定性估计,避免了传统聚类方法的脆弱性,在多个模型和数据集上实现了最优的校准和判别性能。

Details Motivation: 现有基于外部模型聚类生成回答的方法对细微表述变化敏感,导致语义不确定性估计不可靠。 Method: 将生成的回答映射到密集语义空间,计算其嵌入的Gram矩阵,并通过特征谱分析语义结构,再利用高斯过程分类器学习语义一致性模式与预测不确定性的关系。 Result: 在六个大模型和八个数据集上验证了SGPU在ECE、AUROC和AUARC指标上的SOTA表现,并证明其具有跨模型和跨模态的可迁移性。 Conclusion: SGPU通过谱表示捕捉语义不确定性的通用模式,提供了一种鲁棒、可泛化的不确定性估计方法。 Abstract: Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.

[85] Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere

Francesco Di Sario,Daniel Rebain,Dor Verbin,Marco Grangetto,Andrea Tagliasacchi

Main category: cs.CV

TL;DR: 提出Spherical Voronoi(SV)作为3D高斯点阵中外观表示的统一框架,相比传统的球谐函数(SH)能更好处理高频信号和镜面反射,优化更简单且效果达到SOTA。

Details Motivation: 传统辐射场方法如3D高斯点阵依赖球谐函数(SH)建模外观,但SH在高频信号、吉布斯振铃和镜面反射建模上存在局限,现有替代方案又增加优化复杂度,因此需要一种更高效且通用的表示方法。 Method: 提出Spherical Voronoi(SV)方法,通过将方向域划分为具有平滑边界的可学习区域,分别用于漫反射和镜面反射建模;对反射部分,将反射方向作为输入,借鉴经典图形学中的反射探针思想进行学习。 Result: 在合成和真实数据集上均达到最先进水平,尤其在镜面反射建模上显著优于SH及其他替代方法,同时保持更简单的优化过程。 Conclusion: SV提供了一种原理清晰、高效且通用的外观建模方案,适用于显式3D表示,在处理视图相关效应方面具有优势。 Abstract: Radiance field methods (e.g. 3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations. SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and fail to capture specular reflections - a key component of realistic rendering. Although alternatives like spherical Gaussians offer improvements, they add significant optimization complexity. We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting. SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while keeping optimization simpler than existing alternatives. For reflections - where SH fail - we leverage SV as learnable reflection probes, taking reflected directions as input following principles from classical graphics. This formulation attains state-of-the-art results on synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations.

[86] Fracture Morphology Classification: Local Multiclass Modeling for Multilabel Complexity

Cassandra Krause,Mattias P. Heinrich,Ron Keuth

Main category: cs.CV

TL;DR: 本文提出了一种通过自动分配全局AO编码到相应骨折边界框来提取骨折形态的方法,将全局多标签任务转化为局部多分类任务,在公共数据集上提升了平均F1分数7.89%,但在使用不完美的骨折检测器时性能下降,显示出实际应用中的挑战。

Details Motivation: 儿童骨折发生率高,准确诊断至关重要,而骨折形态是关键诊断特征之一,因此需要有效提取骨折形态信息。 Method: 通过将全局AO编码自动分配给相应的骨折边界框,把全局多标签分类问题转化为局部多分类问题,并利用公共数据集进行训练和评估。 Result: 该方法在平均F1分数上提升了7.89%,但当使用不完美的骨折检测器时,性能出现下降。 Conclusion: 该方法能有效提升骨折形态的识别精度,但在实际部署中仍面临检测器性能影响的挑战。 Abstract: Between $15\,\%$ and $45\,\%$ of children experience a fracture during their growth years, making accurate diagnosis essential. Fracture morphology, alongside location and fragment angle, is a key diagnostic feature. In this work, we propose a method to extract fracture morphology by assigning automatically global AO codes to corresponding fracture bounding boxes. This approach enables the use of public datasets and reformulates the global multilabel task into a local multiclass one, improving the average F1 score by $7.89\,\%$. However, performance declines when using imperfect fracture detectors, highlighting challenges for real-world deployment. Our code is available on GitHub.

[87] Beyond a Single Light: A Large-Scale Aerial Dataset for Urban Scene Reconstruction Under Varying Illumination

Zhuoxiao Li,Wenzong Ma,Taoyu Wu,Jinjing Zhu,Zhenchao Q,Shuai Zhang,Jing Ou,Yinrui Ren,Weiqing Qi,Guobin Shen,Hui Xiong,Wufan Zhao

Main category: cs.CV

TL;DR: 本文提出了SkyLume,一个大规模真实无人机数据集,用于研究光照变化下城市场景的鲁棒3D重建,包含多时段图像、LiDAR真值和新评估指标TCC。

Details Motivation: 现有3D重建方法在多时相无人机图像中因光照不一致导致颜色伪影和几何误差,且缺乏系统性数据集支持相关研究。 Method: 采集10个城区在一天三个时段的超10万张高分辨率多视角无人机图像,提供LiDAR扫描和精确3D真值,并提出Temporal Consistency Coefficient(TCC)评估反向渲染中的反射率稳定性。 Result: SkyLume数据集支持几何与外观的精确评估,TCC指标可量化光照解耦的鲁棒性,推动大规模逆向渲染与新视角合成研究。 Conclusion: SkyLume为光照鲁棒的3D重建提供了基准资源,有助于提升真实场景中几何与外观一致性建模的能力。 Abstract: Recent advances in Neural Radiance Fields and 3D Gaussian Splatting have demonstrated strong potential for large-scale UAV-based 3D reconstruction tasks by fitting the appearance of images. However, real-world large-scale captures are often based on multi-temporal data capture, where illumination inconsistencies across different times of day can significantly lead to color artifacts, geometric inaccuracies, and inconsistent appearance. Due to the lack of UAV datasets that systematically capture the same areas under varying illumination conditions, this challenge remains largely underexplored. To fill this gap, we introduceSkyLume, a large-scale, real-world UAV dataset specifically designed for studying illumination robust 3D reconstruction in urban scene modeling: (1) We collect data from 10 urban regions data comprising more than 100k high resolution UAV images (four oblique views and nadir), where each region is captured at three periods of the day to systematically isolate illumination changes. (2) To support precise evaluation of geometry and appearance, we provide per-scene LiDAR scans and accurate 3D ground-truth for assessing depth, surface normals, and reconstruction quality under varying illumination. (3) For the inverse rendering task, we introduce the Temporal Consistency Coefficient (TCC), a metric that measuress cross-time albedo stability and directly evaluates the robustness of the disentanglement of light and material. We aim for this resource to serve as a foundation that advances research and real-world evaluation in large-scale inverse rendering, geometry reconstruction, and novel view synthesis.

[88] DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

Yang Bai,Liudi Yang,George Eskandar,Fengyi Shen,Mohammad Altillawi,Ziyuan Liu,Gitta Kutyniok

Main category: cs.CV

TL;DR: DRAW2ACT 是一种深度感知的轨迹条件视频生成框架,通过提取输入轨迹中的多正交表示(如深度、语义、形状和运动)并注入扩散模型,实现更可控且一致的机器人操作视频生成。

Details Motivation: 现有轨迹条件视频生成方法多依赖2D轨迹或单一模态条件,限制了对机器人操作的可控性和一致性,因此需要一种能融合多模态信息并提升空间-时间一致性的方法。 Method: 提出 DRAW2ACT 框架,从输入轨迹中提取深度、语义、形状和运动等多种正交表示,并将其注入扩散模型;联合生成空间对齐的RGB和深度视频,利用跨模态注意力机制和深度监督增强时空一致性;构建一个多模态策略模型,基于生成的RGB和深度序列回归机器人关节角度。 Result: 在 Bridge V2、Berkeley Autolab 和仿真基准上的实验表明,DRAW2ACT 在视觉保真度和一致性方面优于现有基线,并实现了更高的操作成功率。 Conclusion: DRAW2ACT 通过多模态条件建模与深度感知的联合生成,显著提升了轨迹条件视频生成的质量及其在机器人操作中的实用性。 Abstract: Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.

[89] History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

Xichen Ding,Jianzhe Gao,Cong Pan,Wenguan Wang,Jie Qin

Main category: cs.CV

TL;DR: 提出了一种名为HETT的历史增强双阶段Transformer框架,用于提升无人机在大规模城市环境中基于语言指令的视觉导航能力,通过粗到细的导航流程和历史网格地图,显著提高了全局环境推理和局部场景理解的平衡性。

Details Motivation: 现有的无人机导航系统通常采用单一粒度框架,在全局环境推理和局部场景理解之间难以平衡,限制了其在复杂城市环境中的导航性能。 Method: 提出了HETT框架,首先融合空间地标和历史上下文预测粗粒度目标位置,然后通过细粒度视觉分析优化动作;设计了历史网格地图以动态聚合视觉特征,形成结构化空间记忆。 Result: 在经过人工精炼标注的CityNav数据集上实验表明,HETT显著提升了导航性能,消融实验验证了各组件的有效性。 Conclusion: HETT通过粗到细的两阶段策略和历史增强的空间记忆机制,有效平衡了全局与局部感知,为航空视觉-语言导航提供了新的解决方案。 Abstract: Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.

[90] OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving

Tao Tang,Enhui Ma,xia zhou,Letian Wang,Tianyi Yan,Xueyang Zhang,Kun Zhan,Peng Jia,XianPeng Lang,Jia-Wang Bian,Kaicheng Yu,Xiaodan Liang

Main category: cs.CV

TL;DR: OminiGen 是一个统一的多模态传感器数据生成框架,通过共享的鸟瞰图空间和新型UAE解码方法,实现LiDAR与多视角相机数据的联合生成,并利用DiT与ControlNet实现可控生成。

Details Motivation: 现有生成模型主要关注单模态生成,导致多模态传感器数据存在不一致和效率低的问题,且真实世界数据采集成本高、覆盖场景有限。 Method: 提出OminiGen,采用共享BEV空间统一多模态特征,设计UAE(通用多模态重建方法)通过体渲染联合解码LiDAR和多视图图像数据,并结合DiT与ControlNet分支实现可控生成。 Result: 实验表明OminiGen能有效生成具有一致性的多模态传感器数据,支持灵活的传感器配置调整,在生成质量与多模态对齐方面表现优异。 Conclusion: OminiGen为自动驾驶提供了高效、可控的多模态仿真数据生成方案,推动了基于生成模型的自动驾驶训练与测试发展。 Abstract: Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.

[91] Multi-View MRI Approach for Classification of MGMT Methylation in Glioblastoma Patients

Rawan Alyahya,Asrar Alruwayqi,Atheer Alqarni,Asma Alkhaldi,Metab Alkubeyyer,Xin Gao,Mona Alshahrani

Main category: cs.CV

TL;DR: 本研究提出一种基于MRI的多视角深度学习方法,用于非侵入性地预测胶质母细胞瘤(GBM)患者的MGMT启动子甲基化状态,避免了传统3D模型的高计算开销,并通过新提出的肿瘤切片提取技术提升了性能。

Details Motivation: MGMT启动子甲基化状态影响GBM化疗效果,但目前依赖有创脑组织活检确认,亟需无创、精准的替代方法。 Method: 采用多视角MRI融合策略,结合深度学习模型,利用空间关系整合三个MRI视角信息;提出新的肿瘤切片选择方法,提升特征代表性,同时避免使用复杂的3D模型以降低计算负担。 Result: 所提方法在多个评估指标上优于现有技术,有效识别MGMT甲基化状态,且模型具有良好的可重复性和较低资源消耗。 Conclusion: 该研究验证了放射基因组学在无创检测MGMT甲基化中的潜力,推动了精准医疗在GBM治疗中的应用,并提供了可复现的模型流程以促进透明研究发展。 Abstract: The presence of MGMT promoter methylation significantly affects how well chemotherapy works for patients with Glioblastoma Multiforme (GBM). Currently, confirmation of MGMT promoter methylation relies on invasive brain tumor tissue biopsies. In this study, we explore radiogenomics techniques, a promising approach in precision medicine, to identify genetic markers from medical images. Using MRI scans and deep learning models, we propose a new multi-view approach that considers spatial relationships between MRI views to detect MGMT methylation status. Importantly, our method extracts information from all three views without using a complicated 3D deep learning model, avoiding issues associated with high parameter count, slow convergence, and substantial memory demands. We also introduce a new technique for tumor slice extraction and show its superiority over existing methods based on multiple evaluation metrics. By comparing our approach to state-of-the-art models, we demonstrate the efficacy of our method. Furthermore, we share a reproducible pipeline of published models, encouraging transparency and the development of robust diagnostic tools. Our study highlights the potential of non-invasive methods for identifying MGMT promoter methylation and contributes to advancing precision medicine in GBM treatment.

[92] ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Juze Zhang,Changan Chen,Xin Chen,Heng Yu,Tiange Xiang,Ali Sartaz Khan,Shrinidhi K. Lakshmikanth,Ehsan Adeli

Main category: cs.CV

TL;DR: ViBES是一种新型的对话式3D智能体,能够联合规划语言与动作,实现基于对话的语音-语言-行为协同生成,支持多模态交互与可控行为输出。

Details Motivation: 现有系统将人类行为建模为从固定语句到动作片段的翻译任务,缺乏代理层面的决策能力,导致动作时序僵硬、社会互动薄弱且模块割裂。 Method: 提出ViBES,一种语音-语言-行为(SLB)模型,采用模态专家混合(MoME)架构,包含分别处理语音、面部表情和身体动作的Transformer专家,通过模态路由和跨专家注意力实现信息共享,并利用预训练语音语言模型支持混合主动交互。 Result: 在多轮对话基准上,ViBES在对话-动作对齐和行为质量等自动指标上 consistently 超越强基线模型。 Conclusion: ViBES实现了语言、韵律和动作的联合生成,推动虚拟身体向可控制、具社会能力的交互代理迈进。 Abstract: Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: ai.stanford.edu/~juze/ViBES/

[93] 4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation

Jimmie Kwok,Holger Caesar,Andras Palffy

Main category: cs.CV

TL;DR: 本文提出了一种生成4D雷达点云的新框架4D-RaDiff,通过在潜在空间中应用扩散模型来解决标注雷达数据稀缺的问题,可用于数据增强和预训练,显著减少对真实标注数据的依赖。

Details Motivation: 由于标注雷达数据有限,制约了基于雷达的感知系统的发展,因此需要一种有效的方法生成高质量的雷达点云数据。 Method: 提出4D-RaDiff框架,在潜在点云表示上应用扩散模型,并通过物体级或场景级条件控制生成过程,将未标注的边界框转化为雷达标注,或将LiDAR数据转换为逼真的雷达场景。 Result: 实验表明,使用合成雷达数据进行数据增强可一致提升目标检测性能;预训练可减少高达90%的真实标注数据需求并达到相当的性能。 Conclusion: 4D-RaDiff能有效生成高质量4D雷达点云,显著降低对标注数据的依赖,为雷达感知系统的训练与评估提供了可行解决方案。 Abstract: Automotive radar has shown promising developments in environment perception due to its cost-effectiveness and robustness in adverse weather conditions. However, the limited availability of annotated radar data poses a significant challenge for advancing radar-based perception systems. To address this limitation, we propose a novel framework to generate 4D radar point clouds for training and evaluating object detectors. Unlike image-based diffusion, our method is designed to consider the sparsity and unique characteristics of radar point clouds by applying diffusion to a latent point cloud representation. Within this latent space, generation is controlled via conditioning at either the object or scene level. The proposed 4D-RaDiff converts unlabeled bounding boxes into high-quality radar annotations and transforms existing LiDAR point cloud data into realistic radar scenes. Experiments demonstrate that incorporating synthetic radar data of 4D-RaDiff as data augmentation method during training consistently improves object detection performance compared to training on real data only. In addition, pre-training on our synthetic data reduces the amount of required annotated radar data by up to 90% while achieving comparable object detection performance.

[94] Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding

Nando Metzger,Prune Truong,Goutam Bhat,Konrad Schindler,Federico Tombari

Main category: cs.CV

TL;DR: Elastic3D是一种基于条件潜在扩散模型的端到端单目转立体视频方法,通过引入引导式VAE解码器避免深度估计和扭曲伪影,生成高质量、视差一致的立体视频,并支持推理时通过标量参数控制立体效果强度。

Details Motivation: 随着对沉浸式3D内容需求的增长,需要自动化地将单目视频转换为立体视频,但现有方法常因显式深度估计和图像扭曲引入伪影,影响视觉质量。 Method: 提出Elastic3D,采用基于条件潜在扩散模型的端到端框架,设计一种新型引导式VAE解码器,确保输出立体视频的清晰度和极线一致性,同时允许用户在推理时通过一个标量参数调节视差范围以控制立体效果强度。 Result: 在三个真实世界立体视频数据集上的实验表明,该方法在视觉质量和立体一致性方面优于传统的基于扭曲和最新的无扭曲基线方法。 Conclusion: Elastic3D为可靠且可控的立体视频转换设立了新标准,有效解决了伪影问题并实现了灵活的立体感调控。 Abstract: The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present Elastic3D, a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (more precisely, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion. Please check the project page for the video samples https://elastic3d.github.io.

[95] Enhancing Visual Programming for Visual Reasoning via Probabilistic Graphs

Wentao Wan,Kaiyu Wu,Qingyang Ma,Nan Kang,Yunjie Chen,Liang Lin,Keze Wang

Main category: cs.CV

TL;DR: 本文提出了一种通过概率图增强视觉编程(VP)的方法EVPG,将非可微的VP执行过程转化为可微的概率推断过程,从而实现端到端的梯度优化,显著提升了在复杂视觉推理任务上的性能。

Details Motivation: 现有视觉编程方法忽视了对预训练子模块模型的优化,且由于缺乏子任务标签和VP过程的非可微性,难以进行端到端学习。 Method: 构建一个基于变量依赖关系的有向概率图,将VP执行过程重构为可微的精确概率推断过程,利用最终任务标签进行梯度驱动的端到端监督学习。 Result: 在GQA、NLVRv2和Open Images三个复杂视觉推理任务上,EVPG显著提升了VP框架的性能。 Conclusion: EVPG有效解决了VP中无法利用最终标签优化预训练模块的问题,实现了更高效的端到端学习,增强了视觉推理能力。 Abstract: Recently, Visual Programming (VP) based on large language models (LLMs) has rapidly developed and demonstrated significant potential in complex Visual Reasoning (VR) tasks. Previous works to enhance VP have primarily focused on improving the quality of LLM-generated visual programs. However, they have neglected to optimize the VP-invoked pre-trained models, which serve as modules for the visual sub-tasks decomposed from the targeted tasks by VP. The difficulty is that there are only final labels of targeted VR tasks rather than labels of sub-tasks. Besides, the non-differentiable nature of VP impedes the direct use of efficient gradient-based optimization methods to leverage final labels for end-to-end learning of the entire VP framework. To overcome these issues, we propose EVPG, a method to Enhance Visual Programming for visual reasoning via Probabilistic Graphs. Specifically, we creatively build a directed probabilistic graph according to the variable dependency relationships during the VP executing process, which reconstructs the non-differentiable VP executing process into a differentiable exact probability inference process on this directed probabilistic graph. As a result, this enables the VP framework to utilize the final labels for efficient, gradient-based optimization in end-to-end supervised learning on targeted VR tasks. Extensive and comprehensive experiments demonstrate the effectiveness and advantages of our EVPG, showing significant performance improvements for VP on three classical complex VR tasks: GQA, NLVRv2, and Open Images.

[96] DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance

Shreedhar Govil,Didier Stricker,Jason Rambach

Main category: cs.CV

TL;DR: 本文提出了DriverGaze360,一个大规模360度视野的驾驶员注意力数据集,以及相应的全景注意力预测模型DriverGaze360-Net,能够更全面地建模驾驶中的视线行为。

Details Motivation: 现有驾驶员注意力研究受限于狭窄的前向视野和驾驶场景多样性不足,难以捕捉变道、转弯及周边物体交互时的完整空间上下文信息。 Method: 构建了包含约100万帧带注视标签图像的大规模360度驾驶员注意力数据集DriverGaze360,并提出DriverGaze360-Net模型,通过引入辅助语义分割头联合学习注意力图和关注对象。 Result: 实验表明,DriverGaze360-Net在多个指标上实现了全景驾驶图像中注意力预测的最先进性能。 Conclusion: 该工作推动了全向驾驶员注意力建模的发展,提升了复杂交通场景下对驾驶员行为的理解能力。 Abstract: Predicting driver attention is a critical problem for developing explainable autonomous driving systems and understanding driver behavior in mixed human-autonomous vehicle traffic scenarios. Although significant progress has been made through large-scale driver attention datasets and deep learning architectures, existing works are constrained by narrow frontal field-of-view and limited driving diversity. Consequently, they fail to capture the full spatial context of driving environments, especially during lane changes, turns, and interactions involving peripheral objects such as pedestrians or cyclists. In this paper, we introduce DriverGaze360, a large-scale 360$^\circ$ field of view driver attention dataset, containing $\sim$1 million gaze-labeled frames collected from 19 human drivers, enabling comprehensive omnidirectional modeling of driver gaze behavior. Moreover, our panoramic attention prediction approach, DriverGaze360-Net, jointly learns attention maps and attended objects by employing an auxiliary semantic segmentation head. This improves spatial awareness and attention prediction across wide panoramic inputs. Extensive experiments demonstrate that DriverGaze360-Net achieves state-of-the-art attention prediction performance on multiple metrics on panoramic driving images. Dataset and method available at https://av.dfki.de/drivergaze360.

[97] Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in

Xiaoqian Shen,Min-Hung Chen,Yu-Chiang Frank Wang,Mohamed Elhoseiny,Ryo Hachiuma

Main category: cs.CV

TL;DR: 本文提出了一种名为Zoom-Zero的粗到精框架,用于提升基于大型视频语言模型的视频问答中的时间定位能力,通过引入缩放准确性奖励和令牌选择性信用分配机制,在多个基准上显著提升了时间接地和答案准确率。

Details Motivation: 现有的视频问答方法在时间感知方面存在局限,容易出现时间错位和幻觉问题,尤其是在处理长视频时难以兼顾细节与全局上下文。 Method: 提出Zoom-Zero框架:首先粗略定位与问题相关的视频片段,然后对最显著的帧进行细粒度‘放大’验证;引入两项关键技术——缩放准确性奖励(验证时间定位的保真度)和令牌选择性信用分配(区分负责时间定位和回答生成的token以优化奖励分配)。 Result: 在NExT-GQA和ReXTime数据集上时间定位性能分别提升5.2%和4.6%,平均答案准确率提高2.4%;在长视频基准上平均提升6.4%。 Conclusion: Zoom-Zero通过粗到精的时间聚焦机制有效增强了LVLMs的时间接地能力,缓解了GRPO在多维度奖励信号下的不足,显著提升了GVQA任务中尤其是长视频的理解与推理性能。 Abstract: Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO's issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2\% on NExT-GQA and 4.6\% on ReXTime, while also enhancing average answer accuracy by 2.4\%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4\% on long-video benchmarks.

[98] TUN: Detecting Significant Points in Persistence Diagrams with Deep Learning

Yu Chen,Hongwei Lin

Main category: cs.CV

TL;DR: 本文提出了Topology Understanding Net (TUN),一种用于自动检测一维持久图中显著点的多模态网络,结合了增强的PD描述符、自注意力机制、PointNet风格编码器和不平衡感知训练,有效提升了显著性识别性能。

Details Motivation: 持久图中的显著点识别困难限制了拓扑数据分析在实际应用中的自动化解释与决策,亟需一种可靠的自动检测方法。 Method: 提出TUN模型,融合增强的持久图描述符、自注意力机制、PointNet式点云编码器、可学习融合策略及逐点分类,并采用稳定预处理和不平衡感知训练策略。 Result: 实验表明TUN在检测持久图显著点方面优于经典方法,验证了其在真实场景中的有效性。 Conclusion: TUN为持久图中显著拓扑特征的自动识别提供了高效解决方案,推动了拓扑数据分析在下游任务中的实用化。 Abstract: Persistence diagrams (PDs) provide a powerful tool for understanding the topology of the underlying shape of a point cloud. However, identifying which points in PDs encode genuine signals remains challenging. This challenge directly hinders the practical adoption of topological data analysis in many applications, where automated and reliable interpretation of persistence diagrams is essential for downstream decision-making. In this paper, we study automatic significance detection for one-dimensional persistence diagrams. Specifically, we propose Topology Understanding Net (TUN), a multi-modal network that combines enhanced PD descriptors with self-attention, a PointNet-style point cloud encoder, learned fusion, and per-point classification, alongside stable preprocessing and imbalance-aware training. It provides an automated and effective solution for identifying significant points in PDs, which are critical for downstream applications. Experiments show that TUN outperforms classic methods in detecting significant points in PDs, illustrating its effectiveness in real-world applications.

[99] SS4D: Native 4D Generative Model via Structured Spacetime Latents

Zhibing Li,Mengchen Zhang,Tong Wu,Jing Tan,Jiaqi Wang,Dahua Lin

Main category: cs.CV

TL;DR: 提出SS4D,一种直接从单目视频生成动态3D对象的原生4D生成模型,通过在4D数据上直接训练实现高保真、时间连贯和结构一致的生成。

Details Motivation: 现有方法依赖于3D或视频生成模型构建4D表示,难以兼顾时间连贯性和结构一致性,且受限于4D数据稀缺。 Method: 引入结构化的时空潜在表示,利用预训练单图生成3D模型保证空间一致性;设计专用时序层增强时间一致性;采用因子分解的4D卷积和时序下采样块压缩时间轴以提升训练与推理效率。 Result: 实现了高保真、时间连贯且结构稳定的4D动态对象生成,支持长视频序列的高效建模。 Conclusion: SS4D通过直接在4D数据上训练并结合压缩的时空潜在空间,显著提升了动态3D生成的质量与效率。 Abstract: We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion

[100] PSMamba: Progressive Self-supervised Vision Mamba for Plant Disease Recognition

Abdullah Al Mamun,Miaohua Zhang,David Ahmedt-Aristizabal,Zeeshan Hayder,Mohammad Awrangjeb

Main category: cs.CV

TL;DR: 提出PSMamba,一种结合Vision Mamba与双学生层级蒸馏策略的渐进式自监督框架,用于植物病害图像中多尺度病变模式的表征学习。

Details Motivation: 现有自监督学习方法主要关注全局对齐,难以捕捉植物病害图像中的层次化、多尺度病灶模式。 Method: 设计PSMamba框架,采用共享全局教师和两个专业化学生网络:一个处理中等尺度视图以捕获病灶分布和叶脉结构,另一个聚焦局部视图以提取纹理异常和早期病灶等细粒度特征;通过一致性损失实现跨尺度对齐。 Result: 在三个基准数据集上实验表明,PSMamba在领域迁移和细粒度场景下均优于当前最先进的自监督方法。 Conclusion: PSMamba通过多粒度监督和跨尺度一致性学习,有效提升了植物病害图像表示学习的准确性与鲁棒性。 Abstract: Self-supervised Learning (SSL) has become a powerful paradigm for representation learning without manual annotations. However, most existing frameworks focus on global alignment and struggle to capture the hierarchical, multi-scale lesion patterns characteristic of plant disease imagery. To address this gap, we propose PSMamba, a progressive self-supervised framework that integrates the efficient sequence modelling of Vision Mamba (VM) with a dual-student hierarchical distillation strategy. Unlike conventional single teacher-student designs, PSMamba employs a shared global teacher and two specialised students: one processes mid-scale views to capture lesion distributions and vein structures, while the other focuses on local views to capture fine-grained cues such as texture irregularities and early-stage lesions. This multi-granular supervision facilitates the joint learning of contextual and detailed representations, with consistency losses ensuring coherent cross-scale alignment. Experiments on three benchmark datasets show that PSMamba consistently outperforms state-of-the-art SSL methods, delivering superior accuracy and robustness in both domain-shifted and fine-grained scenarios.

[101] From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region

Akila Premarathna,Kanishka Hewageegana,Garcia Andarcia Mariangel

Main category: cs.CV

TL;DR: 本研究提出了一种基于视觉-语言模型(VLM)的零样本和少样本方法,用于从高分辨率卫星图像中识别中东和北非地区的污水处理厂(WWTP),结果表明多个VLM在真阳性率上优于传统YOLOv8模型,尤其以Gemma-3表现最佳,验证了无需大量标注数据即可实现高效遥感分类的可行性。

Details Motivation: 中东和北非地区对污水处理厂的需求高,精准识别有助于可持续水资源管理;传统方法依赖大量人工标注,成本高且耗时,亟需一种更高效的替代方案。 Method: 构建了一个包含83,566张卫星图像的政府数据集,并采用零样本和少样本两种方式评估多种VLM(如LLaMA 3.2 Vision、Qwen 2.5 VL、Gemma-3等)在识别WWTP及其组件(如圆形/矩形池、曝气池)上的性能,使用专家设计的提示词生成带置信度和描述的JSON输出,与训练好的YOLOv8进行对比。 Result: 在1,207个验证的WWTP位置及同等数量的非WWTP站点数据上测试发现,多个VLM在零样本设置下的真阳性率超过YOLOv8,其中Gemma-3表现最优,能够准确区分干扰目标。 Conclusion: 视觉-语言模型,尤其是零样本模式下的VLM,可有效替代YOLOv8用于大规模、无需标注的污水处理厂遥感识别,为环境监测提供了更具可扩展性的解决方案。 Abstract: In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision-language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600mx600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8's true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.

[102] Semantic Mismatch and Perceptual Degradation: A New Perspective on Image Editing Immunity

Shuai Dong,Jie Zhang,Guoying Zhao,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的图像免疫方法SIFM,通过操纵扩散模型的中间特征来防御基于文本的图像编辑攻击,并引入了新的评估指标ISR,以更准确地衡量免疫效果。

Details Motivation: 现有的图像免疫评估方法主要依赖于比较保护前后图像的视觉差异,忽视了免疫的核心目标——破坏编辑语义一致性或引起感知退化,从而无法有效评估对恶意编辑的防御能力。 Method: 提出Synergistic Intermediate Feature Manipulation (SIFM),通过双重目标优化:最大化与原始编辑轨迹的特征差异以破坏语义对齐,同时最小化特征范数以引发感知退化;并引入基于多模态大语言模型(MLLMs)评估的免疫成功率(ISR)作为新指标。 Result: 实验表明,SIFM在防止扩散模型恶意编辑方面达到了最先进的性能,且ISR能更准确地反映免疫方法的实际有效性。 Conclusion: SIFM通过干预扩散过程中的中间特征,有效抵御了未经授权的文本引导图像编辑,结合ISR指标为图像免疫提供了更合理、可量化的解决方案。 Abstract: Text-guided image editing via diffusion models, while powerful, raises significant concerns about misuse, motivating efforts to immunize images against unauthorized edits using imperceptible perturbations. Prevailing metrics for evaluating immunization success typically rely on measuring the visual dissimilarity between the output generated from a protected image and a reference output generated from the unprotected original. This approach fundamentally overlooks the core requirement of image immunization, which is to disrupt semantic alignment with attacker intent, regardless of deviation from any specific output. We argue that immunization success should instead be defined by the edited output either semantically mismatching the prompt or suffering substantial perceptual degradations, both of which thwart malicious intent. To operationalize this principle, we propose Synergistic Intermediate Feature Manipulation (SIFM), a method that strategically perturbs intermediate diffusion features through dual synergistic objectives: (1) maximizing feature divergence from the original edit trajectory to disrupt semantic alignment with the expected edit, and (2) minimizing feature norms to induce perceptual degradations. Furthermore, we introduce the Immunization Success Rate (ISR), a novel metric designed to rigorously quantify true immunization efficacy for the first time. ISR quantifies the proportion of edits where immunization induces either semantic failure relative to the prompt or significant perceptual degradations, assessed via Multimodal Large Language Models (MLLMs). Extensive experiments show our SIFM achieves the state-of-the-art performance for safeguarding visual content against malicious diffusion-based manipulation.

[103] Dual Attention Guided Defense Against Malicious Edits

Jie Zhang,Shuai Dong,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 提出了一种双注意力引导的噪声扰动(DANP)免疫方法,通过干扰扩散模型中的注意力机制和噪声预测过程,有效防御恶意文本驱动的图像编辑。

Details Motivation: 现有的文本到图像扩散模型易被滥用生成有害内容,当前防御方法对恶意篡改的防护能力有限,亟需更鲁棒的免疫机制。 Method: DANP在多个时间步上操作,利用动态阈值生成掩码识别文本相关与无关区域,调控交叉注意力图并干扰噪声预测;在相关区域降低注意力,无关区域增加注意力,误导编辑方向,同时最大化注入噪声与预测噪声的差异。 Result: 实验表明DANP在防止恶意编辑方面表现优异,显著优于现有方法,实现了最先进的防御性能。 Conclusion: 通过联合操纵注意力机制与噪声预测,DANP为扩散模型提供了一种有效且鲁棒的免疫方案,具有较强的实用潜力。 Abstract: Recent progress in text-to-image diffusion models has transformed image editing via text prompts, yet this also introduces significant ethical challenges from potential misuse in creating deceptive or harmful content. While current defenses seek to mitigate this risk by embedding imperceptible perturbations, their effectiveness is limited against malicious tampering. To address this issue, we propose a Dual Attention-Guided Noise Perturbation (DANP) immunization method that adds imperceptible perturbations to disrupt the model's semantic understanding and generation process. DANP functions over multiple timesteps to manipulate both cross-attention maps and the noise prediction process, using a dynamic threshold to generate masks that identify text-relevant and irrelevant regions. It then reduces attention in relevant areas while increasing it in irrelevant ones, thereby misguides the edit towards incorrect regions and preserves the intended targets. Additionally, our method maximizes the discrepancy between the injected noise and the model's predicted noise to further interfere with the generation. By targeting both attention and noise prediction mechanisms, DANP exhibits impressive immunity against malicious edits, and extensive experiments confirm that our method achieves state-of-the-art performance.

[104] Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

Jooyeol Yun,Jaegul Choo

Main category: cs.CV

TL;DR: 本文提出了一种通过恢复SVG语义结构来实现可扩展矢量图形自动动画的框架,利用统计聚合多个弱部分预测,使视觉语言模型能够生成更连贯的动画。

Details Motivation: 由于SVG在网页设计中的重要性及其动画需求的增长,但现有视觉语言模型难以正确处理SVG的低级碎片化结构,导致动画不连贯,因此需要恢复其语义结构。 Method: 提出一种统计聚合方法,整合多个弱的部分分割预测,以稳定推断SVG的语义组成部分,并将这些部分重组为具有语义意义的组,从而支持VLM生成连贯动画。 Result: 实验表明,该方法显著优于现有方法,在SVG动画的连贯性和可解释性方面取得大幅提升。 Conclusion: 恢复SVG的语义结构是实现可靠自动动画的关键步骤,填补了当前VLM系统在处理矢量图形时忽略的重要层次。 Abstract: Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.

[105] Towards Transferable Defense Against Malicious Image Edits

Jie Zhang,Shuai Dong,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 提出了一种名为TDAE的双模态防御框架,通过视觉和文本的协同优化提升图像对恶意编辑的跨模型可迁移防御能力。

Details Motivation: 现有防御方法在跨模型评估中迁移性有限,难以有效应对扩散模型中的恶意图像编辑。 Method: 提出TDAE框架,包含FlatGrad Defense Mechanism(FDM)进行梯度正则化以寻找平坦极小值,增强视觉鲁棒性;并设计Dynamic Prompt Defense(DPD)动态优化文本嵌入,使免疫图像与原图编辑结果一致,提升跨模型迁移性。 Result: 实验表明TDAE在模型内和跨模型评估中均取得最优性能,显著提升对恶意编辑的防御效果和迁移能力。 Conclusion: TDAE通过协调图像与文本的对抗优化,实现了更强且可迁移的图像免疫防御,为扩散模型下的安全图像编辑提供了有效解决方案。 Abstract: Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.

[106] HGS: Hybrid Gaussian Splatting with Static-Dynamic Decomposition for Compact Dynamic View Synthesis

Kaizhe Zhang,Yijie Zhou,Weizhan Zhang,Caixia Yan,Haipeng Du,yugui xie,Yu-Hui Wen,Yong-Jin Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为混合高斯点阵(HGS)的高效动态新视角合成框架,通过显式分离场景中的静态与动态区域,在显著减小模型规模的同时实现高质量、实时渲染。

Details Motivation: 现有基于3D高斯点阵的方法在动态新视角合成中存在模型复杂度高、参数冗余的问题,导致模型大、渲染慢,难以应用于资源受限设备和实时场景。 Method: 提出Hybrid Gaussian Splatting(HGS),采用静态-动态分解(SDD)策略,利用径向基函数(RBF)建模高斯图元:动态区域使用时变RBF捕捉时间变化,静态区域共享不变参数以减少冗余,并设计两阶段训练策略提升边界处的时间一致性。 Result: 实验表明,HGS最多可将模型大小减少98%,在单块RTX 3090上以高达125 FPS的速度实现4K分辨率实时渲染;在RTX 3050上可达160 FPS,并已集成至VR系统,同时在高频细节和突变场景下呈现更优视觉质量。 Conclusion: HGS通过显式的静态-动态解耦和紧凑参数设计,在保持领先渲染质量的同时极大提升了效率与实用性,适用于资源受限环境下的实时动态新视角合成应用。 Abstract: Dynamic novel view synthesis (NVS) is essential for creating immersive experiences. Existing approaches have advanced dynamic NVS by introducing 3D Gaussian Splatting (3DGS) with implicit deformation fields or indiscriminately assigned time-varying parameters, surpassing NeRF-based methods. However, due to excessive model complexity and parameter redundancy, they incur large model sizes and slow rendering speeds, making them inefficient for real-time applications, particularly on resource-constrained devices. To obtain a more efficient model with fewer redundant parameters, in this paper, we propose Hybrid Gaussian Splatting (HGS), a compact and efficient framework explicitly designed to disentangle static and dynamic regions of a scene within a unified representation. The core innovation of HGS lies in our Static-Dynamic Decomposition (SDD) strategy, which leverages Radial Basis Function (RBF) modeling for Gaussian primitives. Specifically, for dynamic regions, we employ time-dependent RBFs to effectively capture temporal variations and handle abrupt scene changes, while for static regions, we reduce redundancy by sharing temporally invariant parameters. Additionally, we introduce a two-stage training strategy tailored for explicit models to enhance temporal coherence at static-dynamic boundaries. Experimental results demonstrate that our method reduces model size by up to 98% and achieves real-time rendering at up to 125 FPS at 4K resolution on a single RTX 3090 GPU. It further sustains 160 FPS at 1352 * 1014 on an RTX 3050 and has been integrated into the VR system. Moreover, HGS achieves comparable rendering quality to state-of-the-art methods while providing significantly improved visual fidelity for high-frequency details and abrupt scene changes.

[107] Enhancing Interpretability for Vision Models via Shapley Value Optimization

Kanglong Fan,Yunqiao Yang,Chen Ma

Main category: cs.CV

TL;DR: 提出一种新的自解释神经网络框架,通过在训练中引入Shapley值估计作为辅助任务,实现对模型决策的公平分配与高保真解释,在保持性能的同时显著提升可解释性。

Details Motivation: 现有解释方法存在不忠于模型行为(后验方法)或牺牲性能与兼容性(自解释模型)的问题,需一种兼顾保真度、性能与可解释性的新方法。 Method: 将Shapley值估计作为辅助任务融入自解释网络的训练过程,通过公平分配预测分数到输入特征(如图像块)来生成与模型逻辑一致的解释,并仅进行轻微结构修改以保持兼容性与性能。 Result: 在多个基准上实现了最先进的可解释性表现,同时保持了较高的模型性能和兼容性。 Conclusion: 该框架有效平衡了可解释性与模型性能,为构建既可信又高效的自解释深度神经网络提供了新思路。 Abstract: Deep neural networks have demonstrated remarkable performance across various domains, yet their decision-making processes remain opaque. Although many explanation methods are dedicated to bringing the obscurity of DNNs to light, they exhibit significant limitations: post-hoc explanation methods often struggle to faithfully reflect model behaviors, while self-explaining neural networks sacrifice performance and compatibility due to their specialized architectural designs. To address these challenges, we propose a novel self-explaining framework that integrates Shapley value estimation as an auxiliary task during training, which achieves two key advancements: 1) a fair allocation of the model prediction scores to image patches, ensuring explanations inherently align with the model's decision logic, and 2) enhanced interpretability with minor structural modifications, preserving model performance and compatibility. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art interpretability.

[108] Mimicking Human Visual Development for Learning Robust Image Representations

Ankita Raj,Kaashika Prajaapat,Tapan Kumar Gandhi,Chetan Arora

Main category: cs.CV

TL;DR: 提出一种受人类视觉发育启发的渐进式模糊课程方法,通过在训练初期使用模糊图像并逐步减少模糊程度来提升CNN的泛化能力和鲁棒性。

Details Motivation: 现代CNN在适应输入分布变化方面远不如人类视觉系统,而婴儿视觉从低清晰度开始发育的过程为改进CNN提供了灵感。 Method: 在训练初期使用高度模糊的图像,并随着训练进程逐渐减少模糊程度,使网络优先学习全局结构而非高频细节。 Result: 该方法在CIFAR-10-C上平均损坏误差(mCE)降低8.30%,在ImageNet-100-C上降低4.43%,且不影响域内准确率;优于静态模糊增强,并与CutMix、MixUp等技术兼容。 Conclusion: 渐进式模糊课程能有效提升CNN的鲁棒性和泛化能力,挑战了早期模糊会损害模型性能的传统观点。 Abstract: The human visual system is remarkably adept at adapting to changes in the input distribution; a capability modern convolutional neural networks (CNNs) still struggle to match. Drawing inspiration from the developmental trajectory of human vision, we propose a progressive blurring curriculum to improve the generalization and robustness of CNNs. Human infants are born with poor visual acuity, gradually refining their ability to perceive fine details. Mimicking this process, we begin training CNNs on highly blurred images during the initial epochs and progressively reduce the blur as training advances. This approach encourages the network to prioritize global structures over high-frequency artifacts, improving robustness against distribution shifts and noisy inputs. Challenging prior claims that blurring in the initial training epochs imposes a stimulus deficit and irreversibly harms model performance, we reveal that early-stage blurring enhances generalization with minimal impact on in-domain accuracy. Our experiments demonstrate that the proposed curriculum reduces mean corruption error (mCE) by up to 8.30% on CIFAR-10-C and 4.43% on ImageNet-100-C datasets, compared to standard training without blurring. Unlike static blur-based augmentation, which applies blurred images randomly throughout training, our method follows a structured progression, yielding consistent gains across various datasets. Furthermore, our approach complements other augmentation techniques, such as CutMix and MixUp, and enhances both natural and adversarial robustness against common attack methods. Code is available at https://github.com/rajankita/Visual_Acuity_Curriculum.

[109] Unified Semantic Transformer for 3D Scene Understanding

Sebastian Koch,Johanna Wald,Hide Matsuki,Pedro Hermosilla,Timo Ropinski,Federico Tombari

Main category: cs.CV

TL;DR: 本文提出了UNITE,一种用于3D场景理解的统一语义Transformer模型,能够基于RGB图像在单个模型中实现多种3D语义任务(如分割、实例嵌入、开放词汇识别等)的端到端预测,并通过2D蒸馏和多视角损失实现强自监督,在多个任务上达到甚至超越特定任务模型的性能。

Details Motivation: 现有的3D场景理解模型通常针对特定任务设计,难以应对现实世界中复杂且非结构化的环境,缺乏通用性和统一性。 Method: 提出UNITE,一种基于Transformer的前馈神经网络,直接从RGB图像进行端到端的3D语义几何推理;采用2D蒸馏和自监督学习,并引入新的多视角一致性损失以保证3D语义的一致性。 Result: UNITE在多个3D语义任务上实现了最先进的性能,包括3D场景分割、实例嵌入、开放词汇识别、功能性和可动部件识别,且在某些任务上优于依赖真实3D几何输入的专用模型。 Conclusion: UNITE成功地将多样化的3D语义任务统一于单一模型中,展现出强大的泛化能力和高效推理性能,推动了通用3D场景理解的发展。 Abstract: Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io

[110] Optimizing Rank for High-Fidelity Implicit Neural Representations

Julian McGinnis,Florian A. Hölzl,Suprosanna Shit,Florentin Bieder,Paul Friedrich,Mark Mühlau,Björn Menze,Daniel Rueckert,Benedikt Wiestler

Main category: cs.CV

TL;DR: 本文挑战了传统认为MLP无法表示高频内容的观点,提出问题源于训练过程中的稳定秩退化,而非架构本身;通过保持网络高秩更新(如使用Muon优化器),显著提升隐式神经表示在多种任务上的表现。

Details Motivation: 研究者希望解释为什么标准MLP在隐式神经表示中难以拟合高频信号,并重新审视这是否是架构固有缺陷。 Method: 通过分析训练过程中网络的奇异值演化,发现稳定秩退化是导致低频偏置的主要原因;采用具有高秩、近正交更新特性的优化器(如Muon)来维持网络表达能力。 Result: 在自然图像、医学图像和新视角合成等多个领域,使用高秩优化器显著提升了INR性能,PSNR最高提升达9 dB,超越此前最优方法。 Conclusion: MLP对高频信号的拟合能力受限并非其结构固有问题,而是训练动态中秩退化的结果;调节训练过程中的秩可极大增强其表达能力。 Abstract: Implicit Neural Representations (INRs) based on vanilla Multi-Layer Perceptrons (MLPs) are widely believed to be incapable of representing high-frequency content. This has directed research efforts towards architectural interventions, such as coordinate embeddings or specialized activation functions, to represent high-frequency signals. In this paper, we challenge the notion that the low-frequency bias of vanilla MLPs is an intrinsic, architectural limitation to learn high-frequency content, but instead a symptom of stable rank degradation during training. We empirically demonstrate that regulating the network's rank during training substantially improves the fidelity of the learned signal, rendering even simple MLP architectures expressive. Extensive experiments show that using optimizers like Muon, with high-rank, near-orthogonal updates, consistently enhances INR architectures even beyond simple ReLU MLPs. These substantial improvements hold across a diverse range of domains, including natural and medical images, and novel view synthesis, with up to 9 dB PSNR improvements over the previous state-of-the-art. Our project page, which includes code and experimental results, is available at: (https://muon-inrs.github.io).

[111] EcoScapes: LLM-Powered Advice for Crafting Sustainable Cities

Martin Röhn,Nora Gourmelon,Vincent Christlein

Main category: cs.CV

TL;DR: 提出一个多层系统,结合专用大语言模型、卫星图像分析和知识库,帮助小城市制定有效的气候适应策略。

Details Motivation: 小城市在人员资源有限的情况下,难以整合多源数据进行综合气候适应分析。 Method: 结合专用大语言模型(LLMs)、卫星图像分析和知识库构建多层系统。 Result: 该系统能够辅助小城市更有效地开发气候适应策略。 Conclusion: 所提出的系统有助于克服小城市在气候适应中面临的数据整合与资源不足的挑战。 Abstract: Climate adaptation is vital for the sustainability and sometimes the mere survival of our urban areas. However, small cities often struggle with limited personnel resources and integrating vast amounts of data from multiple sources for a comprehensive analysis. To overcome these challenges, this paper proposes a multi-layered system combining specialized LLMs, satellite imagery analysis and a knowledge base to aid in developing effective climate adaptation strategies. The corresponding code can be found at https://github.com/Photon-GitHub/EcoScapes.

[112] Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos

Le Jiang,Shaotong Zhu,Yedi Luo,Shayda Moezzi,Sarah Ostadabbas

Main category: cs.CV

TL;DR: 提出ExpanDyNeRF,一种基于高斯点阵先验和伪真值生成策略的单目动态NeRF框架,显著提升大角度视角下的新视图合成质量。

Details Motivation: 现有动态NeRF在大视角变化下渲染不稳定、不真实,缺乏有效处理极端视角偏移的能力。 Method: 引入Gaussian splatting先验和伪真值生成策略,优化密度与颜色特征,结合SynDM数据集进行训练与验证。 Result: 在SynDM和真实数据集上,ExpanDyNeRF在极端视角变换下显著优于现有动态NeRF方法,渲染保真度更高。 Conclusion: ExpanDyNeRF通过引入先验信息与伪监督策略,有效提升了大角度旋转下的动态场景新视图合成性能。 Abstract: In dynamic Neural Radiance Fields (NeRF) systems, state-of-the-art novel view synthesis methods often fail under significant viewpoint deviations, producing unstable and unrealistic renderings. To address this, we introduce Expanded Dynamic NeRF (ExpanDyNeRF), a monocular NeRF framework that leverages Gaussian splatting priors and a pseudo-ground-truth generation strategy to enable realistic synthesis under large-angle rotations. ExpanDyNeRF optimizes density and color features to improve scene reconstruction from challenging perspectives. We also present the Synthetic Dynamic Multiview (SynDM) dataset, the first synthetic multiview dataset for dynamic scenes with explicit side-view supervision-created using a custom GTA V-based rendering pipeline. Quantitative and qualitative results on SynDM and real-world datasets demonstrate that ExpanDyNeRF significantly outperforms existing dynamic NeRF methods in rendering fidelity under extreme viewpoint shifts. Further details are provided in the supplementary materials.

[113] DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

Nakamasa Inoue,Kanoko Goto,Masanari Oi,Martyna Gruszka,Mahiro Ukai,Takumi Hirose,Yusuke Sekikawa

Main category: cs.CV

TL;DR: 本文提出了一种无需微调的图像描述评估方法DISCODE,通过测试时自适应损失(ATT loss)和高斯先验提升跨域鲁棒性,并构建了多域评测基准MCEval,实验表明其在多个基准上达到最优性能。

Details Motivation: 现有大视觉语言模型在跨域场景下的图像描述评估表现不稳定,缺乏与人类判断一致的鲁棒评价指标。 Method: 提出DISCODE方法,引入基于高斯先验的自适应测试时(ATT)损失,在测试阶段通过解析解高效优化评估分数,无需微调。同时构建涵盖六个领域的MCEval基准用于评估指标鲁棒性。 Result: DISCODE在MCEval及四个现有基准上作为无参考评估指标均取得最先进性能,显著提升跨域评估的稳定性与人类一致性。 Conclusion: DISCODE通过测试时自适应机制有效提升了图像描述评估的跨域鲁棒性,MCEval为未来研究提供了更全面的评测平台。 Abstract: Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.

[114] LCMem: A Universal Model for Robust Image Memorization Detection

Mischa Dombrowski,Felix Nützel,Bernhard Kainz

Main category: cs.CV

TL;DR: 本文提出了一种新的跨域隐私审计模型LCMem,通过结合重识别和复制检测来统一解决生成图像模型中的记忆化检测问题,在多个数据集上显著优于现有方法。

Details Motivation: 现有的隐私审计方法在记忆化检测方面存在检测机制不可靠、量化评估不足以及跨域泛化能力差的问题,限制了生成图像模型在隐私保护数据共享中的应用。 Method: 提出了潜在对比记忆网络(LCMem),采用两阶段训练策略:第一阶段学习身份一致性,第二阶段引入增强鲁棒的复制检测,将记忆化检测视为重识别与复制检测的统一问题。 Result: 在六个基准数据集上,LCMem在重识别任务上最多提升16个百分点,在复制检测任务上最多提升30个百分点,显著提高了记忆化检测的可靠性与可扩展性。同时发现现有隐私过滤器性能和鲁棒性有限。 Conclusion: LCMem为跨域隐私审计设立了新标准,提供了更可靠且可扩展的记忆化检测方案,强调了构建更强保护机制的必要性。 Abstract: Recent advances in generative image modeling have achieved visual realism sufficient to deceive human experts, yet their potential for privacy preserving data sharing remains insufficiently understood. A central obstacle is the absence of reliable memorization detection mechanisms, limited quantitative evaluation, and poor generalization of existing privacy auditing methods across domains. To address this, we propose to view memorization detection as a unified problem at the intersection of re-identification and copy detection, whose complementary goals cover both identity consistency and augmentation-robust duplication, and introduce Latent Contrastive Memorization Network (LCMem), a cross-domain model evaluated jointly on both tasks. LCMem achieves this through a two-stage training strategy that first learns identity consistency before incorporating augmentation-robust copy detection. Across six benchmark datasets, LCMem achieves improvements of up to 16 percentage points on re-identification and 30 percentage points on copy detection, enabling substantially more reliable memorization detection at scale. Our results show that existing privacy filters provide limited performance and robustness, highlighting the need for stronger protection mechanisms. We show that LCMem sets a new standard for cross-domain privacy auditing, offering reliable and scalable memorization detection. Code and model is publicly available at https://github.com/MischaD/LCMem.

[115] The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy

Zhuo Chen,Fanyue Wei,Runze Xu,Jingjing Li,Lixin Duan,Angela Yao,Wen Li

Main category: cs.CV

TL;DR: 本文提出了一种名为SynPS的新方法,用于解决基于大型扩散模型的免训练图像编辑中复杂非刚性编辑(如姿态或形状变化)面临的“注意力崩溃”问题。通过协同利用位置嵌入和语义信息,SynPS实现了更精确的编辑控制。

Details Motivation: 现有注意力共享机制在处理非刚性编辑时存在注意力崩溃问题,导致位置嵌入或语义特征主导内容检索,造成过度编辑或编辑不足。因此需要一种能平衡两者影响的方法。 Method: 首先提出一种量化每一步去噪过程中编辑强度的编辑度量方法,并据此设计动态调节位置嵌入影响的注意力协同机制,自适应融合位置与语义线索以实现保真且准确的编辑。 Result: 在公开及新构建的基准数据集上进行了大量实验,结果表明SynPS在编辑质量和保真度方面显著优于现有方法。 Conclusion: SynPS通过协同整合位置和语义信息,有效解决了注意力崩溃问题,实现了无需训练、高保真的复杂非刚性图像编辑。 Abstract: Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing.To address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation.By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.

[116] Score-Based Turbo Message Passing for Plug-and-Play Compressive Imaging

Chang Cai,Hao Jiang,Xiaojun Yuan,Ying-Jun Angela Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于分数生成模型的快速消息传递框架(STMP及Q-STMP),用于压缩成像中的图像恢复,结合了消息传递的快速收敛性与分数模型的强大先验表达能力,并通过状态演化方程准确预测其渐近性能,在FFHQ数据集上表现出优异的性能-复杂度权衡,且对量化测量(甚至1比特)具有鲁棒性。

Details Motivation: 传统基于去噪器的插件式方法依赖手工先验,难以准确建模自然图像的复杂统计结构,导致在高度欠定情况下重建效果不佳;而基于分数的生成模型虽能精确刻画图像分布,但直接用于后验采样计算代价过高。 Method: 利用分数生成模型与经验贝叶斯去噪之间的理论联系,设计一种集成分数最小均方误差(MMSE)去噪器的消息传递算法(STMP);针对量化测量系统进一步提出Q-STMP,引入逐分量MMSE去量化模块,并推导出描述其渐近性能的状态演化(SE)方程。 Result: 在FFHQ数据集上的实验表明,STMP相比现有方法在性能和复杂度之间取得显著更优的平衡,Q-STMP在1比特量化下仍保持鲁棒性,且两种算法通常在10次迭代内收敛。 Conclusion: 所提出的STMP和Q-STMP框架有效结合了消息传递的高效性与分数模型的强大先验,为压缩感知中的图像恢复提供了一种高性能、低迭代次数且对量化友好的解决方案。 Abstract: Message-passing algorithms have been adapted for compressive imaging by incorporating various off-the-shelf image denoisers. However, these denoisers rely largely on generic or hand-crafted priors and often fall short in accurately capturing the complex statistical structure of natural images. As a result, traditional plug-and-play (PnP) methods often lead to suboptimal reconstruction, especially in highly underdetermined regimes. Recently, score-based generative models have emerged as a powerful framework for accurately characterizing sophisticated image distribution. Yet, their direct use for posterior sampling typically incurs prohibitive computational complexity. In this paper, by exploiting the close connection between score-based generative modeling and empirical Bayes denoising, we devise a message-passing framework that integrates a score-based minimum mean-squared error (MMSE) denoiser for compressive image recovery. The resulting algorithm, named score-based turbo message passing (STMP), combines the fast convergence of message passing with the expressive power of score-based generative priors. For practical systems with quantized measurements, we further propose quantized STMP (Q-STMP), which augments STMP with a component-wise MMSE dequantization module. We demonstrate that the asymptotic performance of STMP and Q-STMP can be accurately predicted by a set of state-evolution (SE) equations. Experiments on the FFHQ dataset demonstrate that STMP strikes a significantly better performance-complexity tradeoff compared with competing baselines, and that Q-STMP remains robust even under 1-bit quantization. Remarkably, both STMP and Q-STMP typically converge within 10 iterations.

[117] S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation

Leon Sick,Lukas Hoyer,Dominik Engel,Pedro Hermosilla,Timo Ropinski

Main category: cs.CV

TL;DR: 本文提出了一种仅基于真实视频数据的无监督视频实例分割方法,通过利用深度运动先验选择高质量关键掩码,并采用稀疏到密集蒸馏和时序DropLoss进行隐式掩码传播,显著提升了性能。

Details Motivation: 现有方法依赖合成视频数据,无法准确建模真实视频中的运动(如视角变化、部件运动、相机运动),导致分割性能受限。 Method: 首先在单帧上进行无监督实例分割,然后利用深部运动先验识别高质量的关键掩码(keymasks),构建稀疏伪标注;接着提出Sparse-To-Dense Distillation方法和Temporal DropLoss,将稀疏标注扩展为密集标签并训练最终模型。 Result: 该方法在多个基准上超越了当前最先进的无监督视频实例分割方法。 Conclusion: 仅使用真实视频数据并结合运动先验与蒸馏策略,可有效提升无监督视频实例分割的性能,避免了合成数据带来的失真问题。 Abstract: In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.

[118] A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Zixin Zhang,Kanghao Chen,Hanqing Wang,Hongfei Zhang,Harold Haodong Chen,Chenfei Liao,Litao Guo,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的代理框架A4-Agent,用于语言指导下的物体交互区域预测,通过解耦高阶推理与低层定位,在零样本设置下显著优于现有监督方法。

Details Motivation: 现有端到端模型在面对新物体和未见环境时泛化能力差,且依赖大量标注数据,难以适应实际应用需求。 Method: 将affordance预测解耦为三阶段:Dreamer生成交互视觉想象,Thinker利用视觉语言模型决定交互对象部分,Spotter使用视觉基础模型精确定位交互区域,全程无需微调。 Result: 在多个基准上显著超越最先进的监督方法,并在真实世界场景中展现出强鲁棒性和泛化能力。 Conclusion: A4-Agent通过组合专用基础模型实现训练-free、解耦式的affordance预测,为具身AI提供了更灵活、可扩展的新范式。 Abstract: Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $\textbf{Dreamer}$ that employs generative models to visualize $\textit{how}$ an interaction would look; (2) a $\textbf{Thinker}$ that utilizes large vision-language models to decide $\textit{what}$ object part to interact with; and (3) a $\textbf{Spotter}$ that orchestrates vision foundation models to precisely locate $\textit{where}$ the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.

[119] TACK Tunnel Data (TTD): A Benchmark Dataset for Deep Learning-Based Defect Detection in Tunnels

Andreas Sjölander,Valeria Belloni,Robel Fekadu,Andrea Nascetti

Main category: cs.CV

TL;DR: 本文介绍了一个新的公开隧道缺陷图像数据集,包含三种不同衬砌类型的裂缝、渗水和析出等典型缺陷,旨在支持深度学习在隧道自动化检测中的应用。

Details Motivation: 由于传统人工检测隧道缺陷耗时、主观且成本高,且现有深度学习方法受限于隧道数据集的缺乏,亟需一个多样化的公共数据集来推动自动化检测技术的发展。 Method: 构建了一个包含多种隧道衬砌类型和典型缺陷(裂缝、析出、渗水)的标注图像数据集,设计支持监督、半监督和无监督深度学习方法,并考虑纹理和施工工艺的多样性以促进模型泛化与迁移能力研究。 Result: 该数据集填补了隧道领域特定数据的空白,能够有效支持多种深度学习方法在缺陷检测与分割任务中的应用,并有助于跨隧道类型模型的通用性研究。 Conclusion: 该数据集为推进隧道自动化视觉检测提供了重要资源,有助于提升基础设施维护的安全性与效率。 Abstract: Tunnels are essential elements of transportation infrastructure, but are increasingly affected by ageing and deterioration mechanisms such as cracking. Regular inspections are required to ensure their safety, yet traditional manual procedures are time-consuming, subjective, and costly. Recent advances in mobile mapping systems and Deep Learning (DL) enable automated visual inspections. However, their effectiveness is limited by the scarcity of tunnel datasets. This paper introduces a new publicly available dataset containing annotated images of three different tunnel linings, capturing typical defects: cracks, leaching, and water infiltration. The dataset is designed to support supervised, semi-supervised, and unsupervised DL methods for defect detection and segmentation. Its diversity in texture and construction techniques also enables investigation of model generalization and transferability across tunnel types. By addressing the critical lack of domain-specific data, this dataset contributes to advancing automated tunnel inspection and promoting safer, more efficient infrastructure maintenance strategies.

[120] SuperCLIP: CLIP with Simple Classification Supervision

Weiheng Zhao,Zilong Huang,Jiashi Feng,Xinggang Wang

Main category: cs.CV

TL;DR: SuperCLIP提出了一种简单而有效的框架,通过在对比学习中引入分类监督来增强细粒度的图文对齐,仅增加极少量计算开销即可显著提升CLIP在零样本分类、图文检索等任务上的性能,并缓解小批量训练时的性能下降问题。

Details Motivation: CLIP类模型在处理长文本时未能充分利用文本中的细粒度语义信号,因其训练目标仅关注全局图文相似性,缺乏词元级监督,导致细粒度对齐能力受限。 Method: SuperCLIP在视觉编码器上添加一个轻量级线性层,利用文本中的词元级信号进行分类监督,从而增强视觉-文本对齐,无需额外标注数据,且仅增加0.077%的总FLOPs。 Result: SuperCLIP在零样本分类、图文检索和纯视觉任务上均取得一致提升,无论使用原始网络数据还是重新生成的丰富描述数据;同时缓解了CLIP在小批量训练下的性能下降问题。 Conclusion: SuperCLIP通过引入分类监督有效增强了CLIP的细粒度对齐能力,在几乎不增加计算成本的情况下提升了多种下游任务的性能,展现出广泛适用性和实用性。 Abstract: Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP's training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot classification, image-text retrieval, and purely visual tasks. These gains hold regardless of whether the model is trained on original web data or rich re-captioned data, demonstrating SuperCLIP's ability to recover textual supervision in both cases. Furthermore, SuperCLIP alleviates CLIP's small-batch performance drop through classification-based supervision that avoids reliance on large batch sizes. Code and models will be made open source.

[121] SignIT: A Comprehensive Dataset and Multimodal Analysis for Italian Sign Language Recognition

Alessia Micieli,Giovanni Maria Farinella,Francesco Ragusa

Main category: cs.CV

TL;DR: 本文介绍了SignIT数据集,用于研究意大利手语(LIS)识别任务,包含644个视频和94个手语类别,并提供了2D关键点和基准测试结果。

Details Motivation: 为了推动意大利手语识别的研究,填补该领域高质量标注数据集的空白。 Method: 构建了一个包含视频和2D人体关键点(手、脸、身体)标注的数据集,并基于RGB帧和时序信息,采用多种先进模型进行手语识别基准测试。 Result: 实验结果表明现有模型在该具有挑战性的LIS数据集上存在性能局限性。 Conclusion: SignIT为意大利手语识别提供了有价值的资源和基准,揭示了当前方法的不足,有助于推动未来研究。 Abstract: In this work we present SignIT, a new dataset to study the task of Italian Sign Language (LIS) recognition. The dataset is composed of 644 videos covering 3.33 hours. We manually annotated videos considering a taxonomy of 94 distinct sign classes belonging to 5 macro-categories: Animals, Food, Colors, Emotions and Family. We also extracted 2D keypoints related to the hands, face and body of the users. With the dataset, we propose a benchmark for the sign recognition task, adopting several state-of-the-art models showing how temporal information, 2D keypoints and RGB frames can be influence the performance of these models. Results show the limitations of these models on this challenging LIS dataset. We release data and annotations at the following link: https://fpv-iplab.github.io/SignIT/.

[122] Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency

Jia Guo,Jiawei Du,Shengzhu Yang,Shuai Lu,Wenquan Cheng,Kaiwen Zhang,Yihua Sun,Chuhong Yang,Weihang Zhang,Fang Chen,Yilan Wu,Lie Ju,Guochen Ning,Longfei Ma,Huiping Yao,Jinyuan Wang,Peilun Shi,Yukun Zhou,Jie Xu,Pearse A. Keane,Hanruo Liu,Hongen Liao,Ningli Wang,Huiqi Li

Main category: cs.CV

TL;DR: ReVision 是一个从真实世界远程医疗项目中学习的视网膜基础模型,利用48.5万张眼底图像和对应的诊断报告,实现了无需任务特定训练的零样本疾病检测,在27个眼科基准测试中表现出色,并显著提升医生诊断准确率。

Details Motivation: 现有视网膜基础模型受限于人工标注的研究数据集,缺乏真实临床背景,且需大量任务特定优化,难以在资源有限环境中部署。 Method: 通过中国162家医疗机构十年间积累的485,980对眼底彩照与诊断报告,构建名为ReVision的模型,直接从真实临床实践中的图文对齐数据学习临床原生智能。 Result: 在12个公开基准上零样本AUROC达0.946,3个独立临床队列上达0.952;仅需极少微调即可匹敌传统精细调参模型,参数量和标注需求大幅减少;表征可迁移到新机构、成像模态及系统健康预测;前瞻性读片研究显示辅助诊断准确率平均提升14.8%。 Conclusion: 无需额外标注,直接从临床档案中提取临床原生智能可有效构建适用于多种低资源场景的医学AI系统。 Abstract: Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insight is that large-scale telemedicine programs, where expert centers provide remote consultations across distributed facilities, represent a natural reservoir for learning clinical image interpretation. We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports, accumulated through a decade-long telemedicine program spanning 162 medical institutions across China. Through extensive evaluation across 27 ophthalmic benchmarks, we demonstrate that ReVison enables deployment efficiency with minimal local resources. Without any task-specific training, ReVision achieves zero-shot disease detection with an average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 independent clinical cohorts. When minimal adaptation is feasible, ReVision matches extensively fine-tuned alternatives while requiring orders of magnitude fewer trainable parameters and labeled examples. The learned representations also transfer effectively to new clinical sites, imaging domains, imaging modalities, and systemic health prediction tasks. In a prospective reader study with 33 ophthalmologists, ReVision's zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels. These results demonstrate that clinical native intelligence can be directly extracted from clinical archives without any further annotation to build medical AI systems suited to various low-resource settings.

[123] DASP: Self-supervised Nighttime Monocular Depth Estimation with Domain Adaptation of Spatiotemporal Priors

Yiheng Huang,Junhong Chen,Anqi Ning,Zhanhong Liang,Nick Michiels,Luc Claesen,Wenyin Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为DASP的自监督框架,利用时空先验进行夜间单目深度估计,通过对抗分支提取先验知识,并设计3D一致性投影损失优化结构一致性,在Oxford RobotCar和nuScenes数据集上实现了最先进性能。

Details Motivation: 现有自监督单目深度估计方法在白天表现良好,但在夜间由于光照不足、纹理缺失和动态物体导致的模糊区域等问题,性能显著下降,因此需要一种能够适应夜间复杂条件的方法。 Method: 提出DASP框架,包含对抗分支和自监督分支:对抗分支使用四个时空先验学习块(SPLB),其中包含空间-时间学习模块(STLM)和轴向空间学习模块(ASLM),以捕捉运动变化和多尺度结构信息;自监督分支引入3D一致性投影损失,将目标帧和源帧双向投影到共享3D空间,计算差异以优化3D结构和白天先验知识。 Result: 在Oxford RobotCar和nuScenes数据集上的实验表明,该方法在夜间深度估计任务中达到最先进的性能,消融研究验证了各组件的有效性,尤其在恢复纹理缺失区域和处理动态模糊方面表现突出。 Conclusion: DASP通过引入时空先验和3D一致性损失,有效提升了自监督单目深度估计在夜间场景下的鲁棒性和准确性,为低光照条件下的深度估计提供了新的解决方案。 Abstract: Self-supervised monocular depth estimation has achieved notable success under daytime conditions. However, its performance deteriorates markedly at night due to low visibility and varying illumination, e.g., insufficient light causes textureless areas, and moving objects bring blurry regions. To this end, we propose a self-supervised framework named DASP that leverages spatiotemporal priors for nighttime depth estimation. Specifically, DASP consists of an adversarial branch for extracting spatiotemporal priors and a self-supervised branch for learning. In the adversarial branch, we first design an adversarial network where the discriminator is composed of four devised spatiotemporal priors learning blocks (SPLB) to exploit the daytime priors. In particular, the SPLB contains a spatial-based temporal learning module (STLM) that uses orthogonal differencing to extract motion-related variations along the time axis and an axial spatial learning module (ASLM) that adopts local asymmetric convolutions with global axial attention to capture the multiscale structural information. By combining STLM and ASLM, our model can acquire sufficient spatiotemporal features to restore textureless areas and estimate the blurry regions caused by dynamic objects. In the self-supervised branch, we propose a 3D consistency projection loss to bilaterally project the target frame and source frame into a shared 3D space, and calculate the 3D discrepancy between the two projected frames as a loss to optimize the 3D structural consistency and daytime priors. Extensive experiments on the Oxford RobotCar and nuScenes datasets demonstrate that our approach achieves state-of-the-art performance for nighttime depth estimation. Ablation studies further validate the effectiveness of each component.

[124] CAPRMIL: Context-Aware Patch Representations for Multiple Instance Learning

Andreas Lolos,Theofilos Christodoulou,Aris L. Moustakas,Stergios Christodoulidis,Maria Vakalopoulou

Main category: cs.CV

TL;DR: 本文提出了一种新的多实例学习(MIL)框架CAPRMIL,用于计算病理学中的全切片分析。该方法通过生成上下文感知的补丁嵌入来简化相关性学习,无需复杂的注意力聚合机制,在多个公共病理学基准上达到了最先进的性能,同时显著减少了可训练参数数量、推理时的浮点运算量,并提高了GPU内存效率和训练速度。

Details Motivation: 由于WSI的千兆像素规模和像素级注释的稀缺性,弱监督已成为深度学习在计算病理学中的标准。现有的MIL方法依赖于复杂的注意力机制进行聚合,增加了模型复杂度和计算成本。因此,需要一种更高效且不依赖复杂聚合器的方法。 Method: CAPRMIL将从冻结的补丁编码器中提取的补丁特征投影到一组全局上下文/形态感知的小令牌中,并利用多头自注意力机制以线性计算复杂度注入全局上下文。该框架与简单的均值MIL聚合器结合使用,实现了高效的滑动级别分类。 Result: CAPRMIL在多个公开病理学基准上匹配了最先进的滑动级别性能,同时相比现有最佳MIL方法减少了48%-92.8%的可训练参数,降低了52%-99%的推理FLOPs,并在GPU内存效率和训练时间方面表现优异。 Conclusion: 在聚合之前学习丰富且上下文感知的实例表示是一种有效且可扩展的替代复杂池化方法,适用于全切片分析。 Abstract: In computational pathology, weak supervision has become the standard for deep learning due to the gigapixel scale of WSIs and the scarcity of pixel-level annotations, with Multiple Instance Learning (MIL) established as the principal framework for slide-level model training. In this paper, we introduce a novel setting for MIL methods, inspired by proceedings in Neural Partial Differential Equation (PDE) Solvers. Instead of relying on complex attention-based aggregation, we propose an efficient, aggregator-agnostic framework that removes the complexity of correlation learning from the MIL aggregator. CAPRMIL produces rich context-aware patch embeddings that promote effective correlation learning on downstream tasks. By projecting patch features -- extracted using a frozen patch encoder -- into a small set of global context/morphology-aware tokens and utilizing multi-head self-attention, CAPRMIL injects global context with linear computational complexity with respect to the bag size. Paired with a simple Mean MIL aggregator, CAPRMIL matches state-of-the-art slide-level performance across multiple public pathology benchmarks, while reducing the total number of trainable parameters by 48%-92.8% versus SOTA MILs, lowering FLOPs during inference by 52%-99%, and ranking among the best models on GPU memory efficiency and training time. Our results indicate that learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis. Our code is available at https://github.com/mandlos/CAPRMIL

[125] HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

Yifang Xu,Benxiang Zhai,Yunzhuo Sun,Ming Li,Yang Li,Sidan Du

Main category: cs.CV

TL;DR: 本文提出了一种用于零样本高保真肖像生成的HiFi-Portrait方法,通过引入面部优化器和地标生成器获取细粒度多脸特征及3D感知面部地标,并设计HiFi-Net融合特征并结合地标以提升身份保真度和面部控制能力。

Details Motivation: 现有方法在使用同一身份的多个参考图像时,生成的肖像保真度较低且难以精确定制面部属性。 Method: 引入面部优化器和地标生成器来提取细粒度多脸特征和3D感知面部地标;设计HiFi-Net进行多脸特征融合并与地标对齐;构建ID-based数据集训练模型。 Result: 实验结果表明,该方法在面部相似性和可控性方面优于当前最先进的方法,并兼容之前的SDXL-based工作。 Conclusion: HiFi-Portrait能有效提升基于多参考图像的身份保留肖像生成的质量和可控性,具有良好的应用潜力。 Abstract: Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.

[126] TAT: Task-Adaptive Transformer for All-in-One Medical Image Restoration

Zhiwen Yang,Jiaju Zhang,Yang Yi,Jian Liang,Bingzheng Wei,Yan Xu

Main category: cs.CV

TL;DR: 提出了一种任务自适应Transformer(TAT)框架,用于解决医学图像恢复中的多任务干扰与不平衡问题,在PET合成、CT去噪和MRI超分辨率任务中均达到先进性能。

Details Motivation: 由于医学图像模态和退化类型的显著差异,传统All-in-One模型在多任务学习中面临任务干扰和任务不平衡问题,需有效协调不同任务间的梯度更新和优化过程。 Method: 提出了任务自适应权重生成策略以生成各任务特定的参数,缓解梯度冲突;同时设计任务自适应损失平衡策略,动态调整各任务损失权重,应对学习难度差异。 Result: 在PET合成、CT去噪和MRI超分辨率三个任务上,TAT在独立任务和All-in-One设置下均取得最先进的性能表现。 Conclusion: TAT通过任务自适应机制有效解决了多任务医学图像恢复中的干扰与不平衡问题,具备良好的通用性和应用潜力。 Abstract: Medical image restoration (MedIR) aims to recover high-quality medical images from their low-quality counterparts. Recent advancements in MedIR have focused on All-in-One models capable of simultaneously addressing multiple different MedIR tasks. However, due to significant differences in both modality and degradation types, using a shared model for these diverse tasks requires careful consideration of two critical inter-task relationships: task interference, which occurs when conflicting gradient update directions arise across tasks on the same parameter, and task imbalance, which refers to uneven optimization caused by varying learning difficulties inherent to each task. To address these challenges, we propose a task-adaptive Transformer (TAT), a novel framework that dynamically adapts to different tasks through two key innovations. First, a task-adaptive weight generation strategy is introduced to mitigate task interference by generating task-specific weight parameters for each task, thereby eliminating potential gradient conflicts on shared weight parameters. Second, a task-adaptive loss balancing strategy is introduced to dynamically adjust loss weights based on task-specific learning difficulties, preventing task domination or undertraining. Extensive experiments demonstrate that our proposed TAT achieves state-of-the-art performance in three MedIR tasks--PET synthesis, CT denoising, and MRI super-resolution--both in task-specific and All-in-One settings. Code is available at https://github.com/Yaziwel/TAT.

[127] CLNet: Cross-View Correspondence Makes a Stronger Geo-Localizationer

Xianwei Cao,Dou Quan,Shuang Wang,Ning Huyan,Wei Wang,Yunan Li,Licheng Jiao

Main category: cs.CV

TL;DR: 本文提出了一种名为CLNet的新型对应感知特征 refinement 框架,用于基于图像检索的跨视角地理定位(IRCVGL),通过显式建模空间对应关系,在多个公开数据集上实现了最先进的性能。

Details Motivation: 现有方法主要依赖于学习全局表示或隐式特征对齐,难以建模精确地理定位所需的关键显式空间对应关系。 Method: CLNet包含三个可学习模块:神经对应图(NCM)用于通过潜在对应场进行空间对齐,非线性嵌入转换器(NEC)用于跨视角特征映射,全局特征重校准(GFR)利用空间线索重新加权特征通道。 Result: 在CVUSA、CVACT、VIGOR和University-1652四个基准上进行了广泛实验,结果表明CLNet在准确性和可解释性方面均优于现有方法。 Conclusion: CLNet能有效桥接跨视角图像间的语义与几何差异,联合捕捉高层语义与细粒度对齐,提升了跨视角地理定位的性能与泛化能力。 Abstract: Image retrieval-based cross-view geo-localization (IRCVGL) aims to match images captured from significantly different viewpoints, such as satellite and street-level images. Existing methods predominantly rely on learning robust global representations or implicit feature alignment, which often fail to model explicit spatial correspondences crucial for accurate localization. In this work, we propose a novel correspondence-aware feature refinement framework, termed CLNet, that explicitly bridges the semantic and geometric gaps between different views. CLNet decomposes the view alignment process into three learnable and complementary modules: a Neural Correspondence Map (NCM) that spatially aligns cross-view features via latent correspondence fields; a Nonlinear Embedding Converter (NEC) that remaps features across perspectives using an MLP-based transformation; and a Global Feature Recalibration (GFR) module that reweights informative feature channels guided by learned spatial cues. The proposed CLNet can jointly capture both high-level semantics and fine-grained alignments. Extensive experiments on four public benchmarks, CVUSA, CVACT, VIGOR, and University-1652, demonstrate that our proposed CLNet achieves state-of-the-art performance while offering better interpretability and generalizability.

[128] FoodLogAthl-218: Constructing a Real-World Food Image Dataset Using Dietary Management Applications

Mitsuki Watanabe,Sosuke Amano,Kiyoharu Aizawa,Yoko Yamakata

Main category: cs.CV

TL;DR: 本文介绍了FoodLogAthl-218,一个基于真实用户饮食记录构建的食物图像数据集,包含6,925张图像和218个食物类别,具有丰富的元数据和更贴近实际使用场景的特性,并提出了三个评估任务:标准分类、增量微调和上下文感知分类。

Details Motivation: 现有的食物图像数据集多基于网络爬取,与用户真实拍摄的餐食照片存在差异,限制了模型在实际应用中的表现,因此需要一个来自真实使用场景的数据集来提升模型的实用性。 Method: 从饮食管理应用FoodLog Athl中收集真实用户的餐食记录,构建包含图像、边界框和丰富元数据的数据集,并提出三种评估任务:标准分类、基于时间序列的增量微调和利用整餐上下文进行多菜品分类。使用大型多模态模型(LMMs)对这些任务进行评估。 Result: 构建了包含6,925张图像、218类食物和14,349个边界框的FoodLogAthl-218数据集,提供了真实、多样且带有上下文信息的餐食图像,并发布了三个具有挑战性的任务用于评估模型性能。 Conclusion: FoodLogAthl-218是一个更贴近真实应用场景的食物图像数据集,能够促进更实用、更具上下文感知能力的饮食识别模型的发展。 Abstract: Food image classification models are crucial for dietary management applications because they reduce the burden of manual meal logging. However, most publicly available datasets for training such models rely on web-crawled images, which often differ from users' real-world meal photos. In this work, we present FoodLogAthl-218, a food image dataset constructed from real-world meal records collected through the dietary management application FoodLog Athl. The dataset contains 6,925 images across 218 food categories, with a total of 14,349 bounding boxes. Rich metadata, including meal date and time, anonymized user IDs, and meal-level context, accompany each image. Unlike conventional datasets-where a predefined class set guides web-based image collection-our data begins with user-submitted photos, and labels are applied afterward. This yields greater intra-class diversity, a natural frequency distribution of meal types, and casual, unfiltered images intended for personal use rather than public sharing. In addition to (1) a standard classification benchmark, we introduce two FoodLog-specific tasks: (2) an incremental fine-tuning protocol that follows the temporal stream of users' logs, and (3) a context-aware classification task where each image contains multiple dishes, and the model must classify each dish by leveraging the overall meal context. We evaluate these tasks using large multimodal models (LMMs). The dataset is publicly available at https://huggingface.co/datasets/FoodLog/FoodLogAthl-218.

[129] LLM-driven Knowledge Enhancement for Multimodal Cancer Survival Prediction

Chenyu Zhao,Yingxue Xu,Fengtao Zhou,Yihui Wang,Hao Chen

Main category: cs.CV

TL;DR: 提出KEMM模型,利用大语言模型提炼的专家报告和预后背景知识增强多模态癌症生存预测,通过KECM注意力模块提升特征判别性与模态对齐,显著优于现有方法。

Details Motivation: 现有基于病理图像和基因组数据的多模态生存预测方法面临高维冗余、特征提取困难和模态对齐挑战,且仅依赖简单的随访标签难以有效监督复杂任务。 Method: 提出KEMM模型,融合LLM提炼的专家报告与生成的预后背景知识(PBK),设计知识增强跨模态(KECM)注意力模块,引导网络关注生存相关的关键特征,实现更有效的多模态融合与预测。 Result: 在五个数据集上实验表明,KEMM在癌症生存预测任务中达到最先进性能,显著优于现有方法。 Conclusion: 引入临床专家报告与LLM生成的背景知识可有效提升多模态生存预测模型的性能,KECM模块有助于缓解高维冗余并增强模态间对齐,为未来研究提供了新方向。 Abstract: Current multimodal survival prediction methods typically rely on pathology images (WSIs) and genomic data, both of which are high-dimensional and redundant, making it difficult to extract discriminative features from them and align different modalities. Moreover, using a simple survival follow-up label is insufficient to supervise such a complex task. To address these challenges, we propose KEMM, an LLM-driven Knowledge-Enhanced Multimodal Model for cancer survival prediction, which integrates expert reports and prognostic background knowledge. 1) Expert reports, provided by pathologists on a case-by-case basis and refined by large language model (LLM), offer succinct and clinically focused diagnostic statements. This information may typically suggest different survival outcomes. 2) Prognostic background knowledge (PBK), generated concisely by LLM, provides valuable prognostic background knowledge on different cancer types, which also enhances survival prediction. To leverage these knowledge, we introduce the knowledge-enhanced cross-modal (KECM) attention module. KECM can effectively guide the network to focus on discriminative and survival-relevant features from highly redundant modalities. Extensive experiments on five datasets demonstrate that KEMM achieves state-of-the-art performance. The code will be released upon acceptance.

[130] TUMTraf EMOT: Event-Based Multi-Object Tracking Dataset and Baseline for Traffic Scenarios

Mengyu Li,Xingcheng Zhou,Guang Chen,Alois Knoll,Hu Cao

Main category: cs.CV

TL;DR: 提出首个面向事件相机的智能交通系统(ITS)多目标跟踪数据集,用于解决传统帧率相机在低光照和高速运动下性能差的问题,并基于该数据集构建检测跟踪基准与专用特征提取器,取得优异性能。

Details Motivation: 传统帧基相机在暗光和高速运动条件下表现不佳,而事件相机具有低延迟、高动态范围和高时间分辨率的优势,但在ITS中相关研究较少,缺乏专门的数据集支持。 Method: 构建一个面向事件相机的ITS多目标检测与跟踪的先导数据集,并建立基于检测的跟踪基准,设计专用的特征提取器进行实验验证。 Result: 所提出的基准方法在该数据集上实现了优异的检测与跟踪性能,验证了事件相机在ITS中的潜力。 Conclusion: 该数据集为事件相机在智能交通系统中的应用提供了基础支持,推动了事件视觉在ITS领域的发展。 Abstract: In Intelligent Transportation Systems (ITS), multi-object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce an initial pilot dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking. We establish a tracking-by-detection benchmark with a specialized feature extractor based on this dataset, achieving excellent performance.

[131] FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos

Zhaolun Li,Jichang Li,Yinqi Cai,Junye Chen,Xiaonan Luo,Guanbin Li,Rushi Lan

Main category: cs.CV

TL;DR: 本文提出了一种名为FakeRadar的新型深度伪造视频检测框架,通过利用大规模预训练模型和伪造异常探测机制,提升在跨域场景下的泛化能力。

Details Motivation: 现有检测方法依赖于特定篡改线索,在面对新型伪造技术时泛化能力差,难以适应未知的伪造模式。 Method: 引入Forgery Outlier Probing,利用动态子簇建模和条件异常生成来模拟新伪造特征;设计Outlier-Guided Tri-Training,结合异常驱动的对比学习和条件交叉熵损失优化检测器。 Result: 实验表明,FakeRadar在多个基准数据集上优于现有方法,尤其在跨域评估中表现突出。 Conclusion: FakeRadar有效提升了深度伪造检测在真实场景中的泛化能力,能够更好地应对不断出现的新型伪造技术。 Abstract: In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.

[132] WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun,Haiyu Zhang,Haoyuan Wang,Junta Wu,Zehan Wang,Zhenwei Wang,Yunhong Wang,Jun Zhang,Tengfei Wang,Chunchao Guo

Main category: cs.CV

TL;DR: WorldPlay是一种流式视频扩散模型,通过双重动作表示、重构上下文记忆和上下文强制蒸馏方法,实现快速、交互式的长期几何一致的世界建模。

Details Motivation: 解决现有方法在速度与内存之间的权衡问题,实现实时交互式世界建模并保持长期几何一致性。 Method: 提出Dual Action Representation以增强用户输入控制;采用Reconstituted Context Memory动态重建上下文并利用时间重帧保留关键历史帧;设计Context Forcing蒸馏方法,对齐师生模型的内存上下文以维持长距离信息利用能力。 Result: WorldPlay能够在720p分辨率下以24FPS生成长时序流视频,表现出优越的一致性和跨场景泛化能力,优于现有技术。 Conclusion: WorldPlay有效解决了实时性与长期一致性之间的矛盾,为交互式3D场景建模提供了高效、可扩展的新方案。 Abstract: This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

[133] Distill Video Datasets into Images

Zhenghao Zhao,Haoxuan Wang,Kai Wang,Yuzhang Shang,Yuan Hong,Yan Yan

Main category: cs.CV

TL;DR: 本文提出了一种名为SFVD的单帧视频集蒸馏框架,通过将视频蒸馏为每类的关键帧并利用可微插值生成视频序列,有效解决了视频数据在蒸馏过程中因时间维度导致的优化困难问题。

Details Motivation: 现有视频数据集蒸馏方法因视频的时间维度引入大量可学习参数,导致优化复杂、收敛困难,性能不佳。 Method: 提出Single-Frame Video set Distillation (SFVD),利用单帧捕获视频语义,通过可微插值生成视频序列,并仅更新关键帧以提升优化效率;结合真实视频采样和通道重塑层融入时序信息。 Result: 在多个基准上进行实验,SFVD显著优于先前方法,在MiniUCF上最高提升达5.3%。 Conclusion: SFVD为视频数据集蒸馏提供了一个更有效的解决方案,能够在保持高性能的同时显著降低数据规模和优化难度。 Abstract: Dataset distillation aims to synthesize compact yet informative datasets that allow models trained on them to achieve performance comparable to training on the full dataset. While this approach has shown promising results for image data, extending dataset distillation methods to video data has proven challenging and often leads to suboptimal performance. In this work, we first identify the core challenge in video set distillation as the substantial increase in learnable parameters introduced by the temporal dimension of video, which complicates optimization and hinders convergence. To address this issue, we observe that a single frame is often sufficient to capture the discriminative semantics of a video. Leveraging this insight, we propose Single-Frame Video set Distillation (SFVD), a framework that distills videos into highly informative frames for each class. Using differentiable interpolation, these frames are transformed into video sequences and matched with the original dataset, while updates are restricted to the frames themselves for improved optimization efficiency. To further incorporate temporal information, the distilled frames are combined with sampled real videos from real videos during the matching process through a channel reshaping layer. Extensive experiments on multiple benchmarks demonstrate that SFVD substantially outperforms prior methods, achieving improvements of up to 5.3% on MiniUCF, thereby offering a more effective solution.

[134] AMD-HookNet++: Evolution of AMD-HookNet with Hybrid CNN-Transformer Feature Enhancement for Glacier Calving Front Segmentation

Fei Wu,Marcel Dreier,Nora Gourmelon,Sebastian Wind,Jianlin Zhang,Thorsten Seehaus,Matthias Braun,Andreas Maier,Vincent Christlein

Main category: cs.CV

TL;DR: 本文提出了一种新的混合CNN-Transformer模型AMD-HookNet++,用于冰川分割和冰裂前沿的精确提取,通过结合全局上下文信息与局部细节,在合成孔径雷达图像中实现了最先进的性能。

Details Motivation: 由于卷积操作的局部性和平移不变性限制了模型捕捉长距离依赖的能力,现有方法在冰川前端分割上存在不足,因此需要一种能够同时保持局部细节和全局上下文信息的新方法。 Method: 提出AMD-HookNet++,采用双分支结构:基于Transformer的上下文分支捕获长距离依赖关系,基于CNN的目标分支保留局部细节;设计增强的空间-通道注意力模块以加强混合特征表示,并引入像素到像素的对比深度监督优化模型。 Result: 在CaFFe数据集上实验表明,AMD-HookNet++达到78.2的IoU和1,318米的HD95,MDE为367米,显著优于现有方法,且生成更平滑的冰裂前沿轮廓。 Conclusion: AMD-HookNet++有效结合了CNN和Transformer的优势,在冰川分割任务中实现了先进性能,解决了纯Transformer方法边缘锯齿问题,具有良好的应用前景。 Abstract: The dynamics of glaciers and ice shelf fronts significantly impact the mass balance of ice sheets and coastal sea levels. To effectively monitor glacier conditions, it is crucial to consistently estimate positional shifts of glacier calving fronts. AMD-HookNet firstly introduces a pure two-branch convolutional neural network (CNN) for glacier segmentation. Yet, the local nature and translational invariance of convolution operations, while beneficial for capturing low-level details, restricts the model ability to maintain long-range dependencies. In this study, we propose AMD-HookNet++, a novel advanced hybrid CNN-Transformer feature enhancement method for segmenting glaciers and delineating calving fronts in synthetic aperture radar images. Our hybrid structure consists of two branches: a Transformer-based context branch to capture long-range dependencies, which provides global contextual information in a larger view, and a CNN-based target branch to preserve local details. To strengthen the representation of the connected hybrid features, we devise an enhanced spatial-channel attention module to foster interactions between the hybrid CNN-Transformer branches through dynamically adjusting the token relationships from both spatial and channel perspectives. Additionally, we develop a pixel-to-pixel contrastive deep supervision to optimize our hybrid model by integrating pixelwise metric learning into glacier segmentation. Through extensive experiments and comprehensive quantitative and qualitative analyses on the challenging glacier segmentation benchmark dataset CaFFe, we show that AMD-HookNet++ sets a new state of the art with an IoU of 78.2 and a HD95 of 1,318 m, while maintaining a competitive MDE of 367 m. More importantly, our hybrid model produces smoother delineations of calving fronts, resolving the issue of jagged edges typically seen in pure Transformer-based approaches.

[135] A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images

Rao Muhammad Umer,Daniel Sens,Jonathan Noll,Christian Matek,Lukas Wolfseher,Rainer Spang,Ralf Huss,Johannes Raffler,Sarah Reinke,Wolfram Klapper,Katja Steiger,Kristina Schwamborn,Carsten Marr

Main category: cs.CV

TL;DR: 本研究提出了首个多中心淋巴瘤亚型分类基准数据集,涵盖四种常见淋巴瘤亚型和正常组织,并系统评估了五种病理基础模型在不同放大倍数下的表现,发现模型在分布内测试中表现良好(平衡准确率超80%),但在分布外数据上性能显著下降(约60%),揭示了泛化能力的挑战。

Details Motivation: 现有的淋巴瘤诊断依赖昂贵设备和专业人员,流程复杂且耗时;尽管深度学习有望从常规HE染色切片中辅助诊断,但缺乏多中心数据下的综合基准评估。 Method: 构建包含四种常见淋巴瘤亚型和健康对照的多中心数据集,评估五种公开的病理基础模型(H-optimus-1、H0-mini、Virchow2、UNI2、Titan)结合两种多实例学习聚合方法(AB-MIL和TransMIL)在三个放大倍数(10x、20x、40x)下的表现。 Result: 在分布内测试集上,所有模型在各放大倍数下多类平衡准确率均超过80%,模型间差异小,聚合方法效果相近;40x分辨率已足够,更高分辨率或跨尺度聚合未带来提升;但在分布外测试集上,性能大幅下降至约60%。 Conclusion: 当前深度学习模型在多中心淋巴瘤亚型分类中具有潜力,但泛化能力仍是主要瓶颈;未来需更大规模、涵盖更多罕见亚型的多中心研究,作者提供了自动化基准测试流程以支持后续工作。 Abstract: Timely and accurate lymphoma diagnosis is essential for guiding cancer treatment. Standard diagnostic practice combines hematoxylin and eosin (HE)-stained whole slide images with immunohistochemistry, flow cytometry, and molecular genetic tests to determine lymphoma subtypes, a process requiring costly equipment, skilled personnel, and causing treatment delays. Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking. In this work, we present the first multicenter lymphoma benchmarking dataset covering four common lymphoma subtypes and healthy control tissue. We systematically evaluate five publicly available pathology foundation models (H-optimus-1, H0-mini, Virchow2, UNI2, Titan) combined with attention-based (AB-MIL) and transformer-based (TransMIL) multiple instance learning aggregators across three magnifications (10x, 20x, 40x). On in-distribution test sets, models achieve multiclass balanced accuracies exceeding 80% across all magnifications, with all foundation models performing similarly and both aggregation methods showing comparable results. The magnification study reveals that 40x resolution is sufficient, with no performance gains from higher resolutions or cross-magnification aggregation. However, on out-of-distribution test sets, performance drops substantially to around 60%, highlighting significant generalization challenges. To advance the field, larger multicenter studies covering additional rare lymphoma subtypes are needed. We provide an automated benchmarking pipeline to facilitate such future research.

[136] Adaptable Segmentation Pipeline for Diverse Brain Tumors with Radiomic-guided Subtyping and Lesion-Wise Model Ensemble

Daniel Capellán-Martín,Abhijeet Parida,Zhifan Jiang,Nishad Kulkarni,Krithika Iyer,Austin Tapp,Syed Muhammad Anwar,María J. Ledesma-Carbayo,Marius George Linguraru

Main category: cs.CV

TL;DR: 提出了一种灵活、模块化且可适应的脑肿瘤分割流程,通过结合先进模型和病灶特异性处理,在多参数MRI上实现了鲁棒且通用的分割性能。

Details Motivation: 脑肿瘤类型差异大,导致在多参数MRI上实现稳健且可推广的分割具有挑战性,尤其是跨成人与儿童、多种肿瘤类型的统一方法需求迫切。 Method: 构建一个模块化流程,结合最先进的模型,利用MRI提取放射组学特征以识别肿瘤亚型,并在训练前后进行肿瘤特异性预处理和后处理;采用自定义病灶级性能指标优化模型集成与后处理。 Result: 在BraTS 2025多个挑战赛(PED, MEN, MEN-RT, MET)的测试集上,该方法性能与顶级算法相当,验证了其在多样化数据上的泛化能力。 Conclusion: 病灶感知的定制化处理与模型选择能显著提升分割鲁棒性,且不依赖特定网络结构,具有临床应用于定量肿瘤测量、辅助诊断与预后的潜力。 Abstract: Robust and generalizable segmentation of brain tumors on multi-parametric magnetic resonance imaging (MRI) remains difficult because tumor types differ widely. The BraTS 2025 Lighthouse Challenge benchmarks segmentation methods on diverse high-quality datasets of adult and pediatric tumors: multi-consortium international pediatric brain tumor segmentation (PED), preoperative meningioma tumor segmentation (MEN), meningioma radiotherapy segmentation (MEN-RT), and segmentation of pre- and post-treatment brain metastases (MET). We present a flexible, modular, and adaptable pipeline that improves segmentation performance by selecting and combining state-of-the-art models and applying tumor- and lesion-specific processing before and after training. Radiomic features extracted from MRI help detect tumor subtype, ensuring a more balanced training. Custom lesion-level performance metrics determine the influence of each model in the ensemble and optimize post-processing that further refines the predictions, enabling the workflow to tailor every step to each case. On the BraTS testing sets, our pipeline achieved performance comparable to top-ranked algorithms across multiple challenges. These findings confirm that custom lesion-aware processing and model selection yield robust segmentations yet without locking the method to a specific network architecture. Our method has the potential for quantitative tumor measurement in clinical practice, supporting diagnosis and prognosis.

[137] ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking

Lihong Wang,Liangqi Li,Weiwei Feng,Jiamin Wu,Changtao Miao,Tieru Wu,Rui Ma,Bo Zhang,Zhe Li

Main category: cs.CV

TL;DR: 提出ViRC框架和CRUX数据集,通过模拟人类分步推理模式提升多模态数学任务中的推理能力,显著优于基线模型。

Details Motivation: 现有MLLM在多模态数学任务中仅依赖静态图像进行文本推理,忽略推理过程中的动态视觉获取,难以模拟人类逐步验证命题的思维过程。 Method: 提出ViRC框架,引入Reason Chunking机制,将多模态CoT分解为连续的关键推理单元(CRUs),并在CRUs间整合视觉信息;构建CRUX数据集,包含多种推理路径的显式标注CRUs;采用受人类认知启发的渐进训练策略(Instructional SFT、Practice SFT和Strategic RL)提升模型推理能力。 Result: ViRC-7B模型在多个数学基准上平均比基线模型提升18.8%。 Conclusion: 通过模拟人类分步查看图像并推理的模式,结合结构化推理单元与渐进训练,可有效提升多模态大模型在数学任务中的推理性能。 Abstract: CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting ViRC-7B model achieves a 18.8\% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.

[138] Enhancing Visual Sentiment Analysis via Semiotic Isotopy-Guided Dataset Construction

Marco Blanchini,Giovanna Maria Dimitri,Benedetta Tondi,Tarcisio Lancioni,Mauro Barni

Main category: cs.CV

TL;DR: 提出一种基于符号学同位性(semiotic isotopy)的视觉情感分析数据集构建方法,通过整合多源数据集并增强情感相关元素的组合建模,显著提升模型在跨数据集任务中的泛化能力。

Details Motivation: 现有视觉情感分析(VSA)数据集多样性不足且标注受限,导致模型在跨数据集场景下泛化性能差,亟需更丰富、更具情感语义结构的数据构建方法。 Method: 引入符号学中的同位性概念指导数据集构建,整合多个现有数据集,筛选和组织具有情感关联的图像元素组合,构建更大规模、语义更丰富的VSA数据集。 Result: 在主要VSA基准上进行跨数据集测试,使用新数据集训练的模型显著优于基于原始数据集训练的模型,验证了方法在提升泛化能力方面的有效性。 Conclusion: 将符号学理论融入数据构建过程可有效提升VSA模型对情感内容的理解与泛化能力,为情感感知任务提供了新的数据驱动思路。 Abstract: Visual Sentiment Analysis (VSA) is a challenging task due to the vast diversity of emotionally salient images and the inherent difficulty of acquiring sufficient data to capture this variability comprehensively. Key obstacles include building large-scale VSA datasets and developing effective methodologies that enable algorithms to identify emotionally significant elements within an image. These challenges are reflected in the limited generalization performance of VSA algorithms and models when trained and tested across different datasets. Starting from a pool of existing data collections, our approach enables the creation of a new larger dataset that not only contains a wider variety of images than the original ones, but also permits training new models with improved capability to focus on emotionally relevant combinations of image elements. This is achieved through the integration of the semiotic isotopy concept within the dataset creation process, providing deeper insights into the emotional content of images. Empirical evaluations show that models trained on a dataset generated with our method consistently outperform those trained on the original data collections, achieving superior generalization across major VSA benchmarks

[139] ART: Articulated Reconstruction Transformer

Zizhang Li,Cheng Zhang,Zhengqin Li,Henry Howard-Jenkins,Zhaoyang Lv,Chen Geng,Jiajun Wu,Richard Newcombe,Jakob Engel,Zhao Dong

Main category: cs.CV

TL;DR: ART是一种无需类别依赖的前馈模型,通过将关节物体视为刚性部件组合,利用Transformer架构从稀疏多状态RGB图像中实现完整的3D关节物体重建。

Details Motivation: 现有方法在关节物体重建上依赖缓慢且不稳定的优化过程,或局限于特定类别,缺乏通用性和效率。 Method: ART采用新型Transformer架构,将图像输入映射到可学习的部件槽,并联合解码每个部件的3D几何、纹理和关节参数,实现部分级别的预测。 Result: 在大规模多样化数据集上训练并经过多基准评估,ART显著优于现有基线方法。 Conclusion: ART实现了从图像输入进行关节物体重建的新SOTA,结果具有物理可解释性,可直接用于仿真。 Abstract: We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.

[140] VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image

Sicheng Xu,Guojun Chen,Jiaolong Yang,Yizhong Zhang,Yu Deng,Steve Lin,Baining Guo

Main category: cs.CV

TL;DR: VASA-3D是一种音频驱动的单图像3D头像生成方法,能从单张肖像图生成具有丰富表情细节的逼真3D说话头像,并支持高帧率自由视角视频生成。

Details Motivation: 解决现有方法在捕捉真实人脸细微表情和从单张图像重建复杂3D头像方面的不足。 Method: 利用VASA-1的运动隐变量建模表情细节,设计一个基于该隐变量的3D头模型,并通过基于多帧合成视频的优化框架实现单图像定制。 Result: 实验表明,VASA-3D生成的3D说话头像比以往方法更逼真,支持512x512分辨率、最高75 FPS的在线自由视角视频生成。 Conclusion: VASA-3D实现了高质量、实时的音频驱动3D头像生成,推动了个性化虚拟 avatar 的沉浸式交互发展。 Abstract: We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data. Our experiment shows that VASA-3D produces realistic 3D talking heads that cannot be achieved by prior art, and it supports the online generation of 512x512 free-viewpoint videos at up to 75 FPS, facilitating more immersive engagements with lifelike 3D avatars.

[141] Native and Compact Structured Latents for 3D Generation

Jianfeng Xiang,Xiaoxue Chen,Sicheng Xu,Ruicheng Wang,Zelong Lv,Yu Deng,Hongyuan Zhu,Yue Dong,Hao Zhao,Nicholas Jing Yuan,Jiaolong Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为O-Voxel的新型稀疏体素结构,用于从原生3D数据中学习结构化潜在表示,以提升3D生成模型在复杂拓扑和细节外观上的建模能力。

Details Motivation: 现有3D生成模型受限于表示方式,难以处理复杂拓扑和高保真外观。本文旨在通过结构化表示克服这些限制。 Method: 提出O-Voxel——一种同时编码几何与外观的稀疏、全向体素表示,并基于此构建Sparse Compression VAE以实现高效压缩和紧凑潜在空间;使用大规模流匹配模型(4B参数)在多源3D数据集上训练进行3D内容生成。 Result: 所提方法能稳健建模任意拓扑(包括开放、非流形和封闭表面),并捕捉包括PBR参数在内的丰富表面属性;生成资产在几何和材质质量上显著优于现有方法,且推理高效。 Conclusion: O-Voxel为3D生成建模提供了一种高效、高质量的新范式,推动了对复杂3D资产的生成能力。 Abstract: Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.

[142] CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

Zihan Wang,Jiashun Wang,Jeff Tan,Yiwen Zhao,Jessica Hodgins,Shubham Tulsiani,Deva Ramanan

Main category: cs.CV

TL;DR: CRISP是一种从单目视频中恢复可模拟的人体运动和场景几何的方法,通过拟合平面基元并结合人体-场景接触建模,生成清洁、凸的仿真就绪几何,并利用强化学习确保物理合理性。

Details Motivation: 现有方法在联合人体-场景重建中依赖数据驱动先验或缺乏物理约束的优化,导致恢复的几何存在噪声和伪影,影响交互动作的跟踪效果。 Method: 提出CRISP方法,通过聚类深度、法线和光流将点云重建拟合为平面基元以获得干净的凸几何;利用人体姿态推断被遮挡区域(如椅子座面)的几何;并通过强化学习驱动人形控制器来验证人体与场景重建的物理合理性。 Result: 在EMDB和PROX等基准上将运动跟踪失败率从55.2%降低至6.9%,RL仿真吞吐速度提升43%,并在真实拍摄、网络视频及Sora生成视频上验证了有效性。 Conclusion: CRISP能大规模生成物理有效的运动与交互环境,显著推进机器人和AR/VR中的现实到仿真的应用。 Abstract: We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering a 43\% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.

[143] Spherical Leech Quantization for Visual Tokenization and Generation

Yue Zhao,Hanwen Jiang,Zhenlin Xu,Chutong Yang,Ehsan Adeli,Philipp Krähenbühl

Main category: cs.CV

TL;DR: 本文提出了基于Leech格的球面量化方法(Λ₂₄-SQ),通过格编码的几何特性统一解释了非参数量化方法,并在图像重建和生成任务中优于现有技术。

Details Motivation: 非参数量化因其参数效率和对大码本的可扩展性受到关注,但现有方法需要辅助损失项且性能有限,因此需要更优的量化方案。 Method: 通过格编码的视角统一分析非参数量化方法,探索多种候选格(如随机格、广义斐波那契格、最密球堆积格),提出基于Leech格的球面量化方法(Λ₂₄-SQ)。 Result: Λ₂₄-SQ在图像重建质量上优于BSQ等先前方法,同时使用更少比特,在图像压缩与自回归生成框架中均取得更好性能。 Conclusion: Leech格因其高对称性和超球面上的均匀分布,为非参数量化提供了更优解,简化了训练流程并提升了压缩-重建权衡。 Abstract: Non-parametric quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization ($Λ_{24}$-SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.

[144] MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Sihui Ji,Xi Chen,Shuai Yang,Xin Tao,Pengfei Wan,Hengshuang Zhao

Main category: cs.CV

TL;DR: 提出MemFlow,通过动态更新和检索记忆库来提升流式视频生成中的长时内容一致性,同时保持高效生成。

Details Motivation: 现有方法使用固定策略压缩历史帧,难以适应不同视频片段对历史信息的不同需求,导致内容不一致。 Method: 在生成新片段前,根据当前文本提示动态检索并更新记忆库;在生成过程中,仅激活注意力层中与查询最相关的记忆标记。 Result: MemFlow显著提升了长时叙事连贯性,计算开销极小(相比无记忆基线仅慢7.9%),且兼容任何基于KV缓存的流式视频生成模型。 Conclusion: 动态记忆更新与稀疏注意力机制有效解决了流式视频生成中的长时一致性问题,兼顾效率与兼容性。 Abstract: The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.