Table of Contents
cs.CL [Back]
[1] Short-Context Dominance: How Much Local Context Natural Language Actually Needs?
Vala Vakilian,Zimeng Wang,Ankit Singh Rawat,Christos Thrampoulidis
Main category: cs.CL
TL;DR: 本文研究了短上下文主导假说,提出了一种无需真实下个词标签的长上下文序列检测方法DaMCL,并设计了解码算法来缓解大模型中因短上下文偏好导致的输出偏差。
Details
Motivation: 由于大语言模型在生成时倾向于依赖最近的局部上下文,可能导致忽略长距离相关信息,本文旨在验证短上下文主导现象,并识别和纠正由此引发的预测偏差。 Method: 通过测量最小上下文长度(MCL)评估预测所需上下文,提出分布感知的MCL(DaMCL)作为可在采样解码中使用的代理指标,并设计基于检测的解码算法以增强长距离相关词的生成。 Result: 实验表明75-80%的序列仅需最后96个token即可准确预测;DaMCL结合阈值能高效区分长短上下文序列;所提解码算法在多种问答任务和模型上提升了性能。 Conclusion: 短上下文主导是普遍现象,但可通过有效的检测机制和调整解码策略来缓解其带来的偏差,提升模型对长距离信息的利用能力。 Abstract: We investigate the short-context dominance hypothesis: that for most sequences, a small local prefix suffices to predict their next tokens. Using large language models as statistical oracles, we measure the minimum context length (MCL) needed to reproduce accurate full-context predictions across datasets with sequences of varying lengths. For sequences with 1-7k tokens from long-context documents, we consistently find that 75-80% require only the last 96 tokens at most. Given the dominance of short-context tokens, we then ask whether it is possible to detect challenging long-context sequences for which a short local prefix does not suffice for prediction. We introduce a practical proxy to MCL, called Distributionally Aware MCL (DaMCL), that does not require knowledge of the actual next-token and is compatible with sampling strategies beyond greedy decoding. Our experiments validate that simple thresholding of the metric defining DaMCL achieves high performance in detecting long vs. short context sequences. Finally, to counter the bias that short-context dominance induces in LLM output distributions, we develop an intuitive decoding algorithm that leverages our detector to identify and boost tokens that are long-range-relevant. Across Q&A tasks and model architectures, we confirm that mitigating the bias improves performance.[2] Adaptation of Embedding Models to Financial Filings via LLM Distillation
Eliot Brenner,Dominic Seyler,Manjunath Hegde,Andrei Simion,Koustuv Dasgupta,Bing Xiang
Main category: cs.CL
TL;DR: 本文提出了一种可扩展的训练管道,利用无标签语料库和通用检索嵌入模型,通过迭代挖掘难例并重训练学生模型,显著提升金融领域信息检索性能,无需人工标注。
Details
Motivation: 现有的嵌入模型在金融等专业领域的信息检索任务中表现不足,且依赖人工标注成本高,因此需要一种高效、低成本的领域适配方法。 Method: 基于通用检索模型作为基础,使用LLM判断的相关性信号,通过师生模型交互机制,迭代地从无标签语料中挖掘难正例和难负例,并用于持续优化学生模型。 Result: 在14种金融文件类型、21,800个查询-文档对上,MRR@5平均提升27.7%,DCG@5均值提升44.6%,在FinanceBench的4类文档中3类NDCG提升。 Conclusion: 该方法为将通用嵌入模型适配到专业领域提供了一种高效、低成本的解决方案,特别适用于标注数据稀缺的场景。 Abstract: Despite advances in generative large language models (LLMs), practical application of specialized conversational AI agents remains constrained by computation costs, latency requirements, and the need for precise domain-specific relevance measures. While existing embedding models address the first two constraints, they underperform on information retrieval in specialized domains like finance. This paper introduces a scalable pipeline that trains specialized models from an unlabeled corpus using a general purpose retrieval embedding model as foundation. Our method yields an average of 27.7% improvement in MRR$\texttt{@}$5, 44.6% improvement in mean DCG$\texttt{@}$5 across 14 financial filing types measured over 21,800 query-document pairs, and improved NDCG on 3 of 4 document classes in FinanceBench. We adapt retrieval embeddings (bi-encoder) for RAG, not LLM generators, using LLM-judged relevance to distill domain knowledge into a compact retriever. There are prior works which pair synthetically generated queries with real passages to directly fine-tune the retrieval model. Our pipeline differs from these by introducing interaction between student and teacher models that interleaves retrieval-based mining of hard positive/negative examples from the unlabeled corpus with iterative retraining of the student model's weights using these examples. Each retrieval iteration uses the refined student model to mine the corpus for progressively harder training examples for the subsequent training iteration. The methodology provides a cost-effective solution to bridging the gap between general-purpose models and specialized domains without requiring labor-intensive human annotation.[3] Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing
Zifan Jiang,Youngjoon Jang,Liliane Momeni,Gül Varol,Sarah Ebling,Andrew Zisserman
Main category: cs.CL
TL;DR: 提出了一种名为SEA(Segment, Embed, and Align)的通用字幕对齐方法,用于将口语文本与连续手语视频对齐,支持多语言和多领域,基于预训练模型实现高效准确的对齐。
Details
Motivation: 现有方法通常依赖于特定语言或数据集的端到端训练,泛化能力差,限制了在不同场景下的应用。 Method: 采用两阶段预训练模型:第一个模型将视频帧序列分割为单个手势,第二个模型将每个手势视频片段嵌入到与文本共享的潜在空间中,然后通过轻量级动态规划算法进行对齐。 Result: 在四个手语数据集上实现了最先进的对齐性能,能够在CPU上一分钟内完成小时级视频处理,且适用于从小型词典到大型连续语料库的多种场景。 Conclusion: SEA提供了一个灵活、高效的通用框架,可用于生成高质量的平行数据,推动手语处理技术的发展。 Abstract: The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.[4] Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation
Sampriti Soor,Suklav Ghosh,Arijit Sur
Main category: cs.CL
TL;DR: 提出了一种通用对抗后缀方法,通过Gumbel-Softmax学习可迁移的离散后缀,在多种任务和模型上实现有效的黑盒攻击。
Details
Motivation: 现有对抗攻击通常针对特定任务或模型设计触发器,缺乏可比性和可迁移性,本文旨在研究能在不同任务和模型间广泛转移的通用对抗后缀。 Method: 使用Gumbel-Softmax松弛在软向量空间中学习对抗后缀,通过最大化标签区域的校准交叉熵进行训练,并掩码真实标签token以防止信息泄露,加入熵正则化避免崩溃,最后离散化用于推理。 Result: 在情感分析、自然语言推理、释义检测、常识问答和物理推理等多个任务上,对Qwen2-1.5B、Phi-1.5和TinyLlama-1.1B等模型实现了有效且可迁移的攻击,显著降低准确率和校准置信度。 Conclusion: 通用对抗后缀具有强迁移能力,揭示了语言模型在零样本/少样本分类中的普遍脆弱性,为鲁棒性评估提供了新视角。 Abstract: Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.[5] Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward
Sampriti Soor,Suklav Ghosh,Arijit Sur
Main category: cs.CL
TL;DR: 提出一种基于强化学习的框架,使用近端策略优化(PPO)训练对抗性后缀,通过校准的交叉熵奖励塑造,提升在不同任务和模型间的可迁移性。
Details
Motivation: 现有方法生成的对抗后缀通常脆弱且局限于特定任务或模型,缺乏跨任务和模型的泛化能力。 Method: 将对抗后缀视为策略,使用Proximal Policy Optimization在冻结的语言模型上进行训练,奖励由校准的交叉熵构建,消除标签偏差并聚合不同表面形式以增强迁移性。 Result: 在五个NLP基准数据集和三个不同语言模型上验证,RL训练的后缀能持续降低模型准确率,并在跨任务和跨模型场景下表现出更强的迁移能力。 Conclusion: 强化学习框架能有效生成更具迁移性的对抗后缀,优于传统梯度搜索和基于规则的方法,揭示了语言模型在统一攻击下的普遍脆弱性。 Abstract: Language models are vulnerable to short adversarial suffixes that can reliably alter predictions. Previous works usually find such suffixes with gradient search or rule-based methods, but these are brittle and often tied to a single task or model. In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization against a frozen model as a reward oracle. Rewards are shaped using calibrated cross-entropy, removing label bias and aggregating across surface forms to improve transferability. The proposed method is evaluated on five diverse NLP benchmark datasets, covering sentiment, natural language inference, paraphrase, and commonsense reasoning, using three distinct language models: Qwen2-1.5B Instruct, TinyLlama-1.1B Chat, and Phi-1.5. Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.[6] ClinicalTrialsHub: Bridging Registries and Literature for Comprehensive Clinical Trial Access
Jiwoo Park,Ruoqi Liu,Avani Jagdale,Andrew Srisuwananukorn,Jing Zhao,Lang Li,Ping Zhang,Sachin Kumar
Main category: cs.CL
TL;DR: ClinicalTrialsHub是一个整合ClinicalTrials.gov数据并从PubMed文献中自动提取结构化临床试验信息的交互式平台,利用大语言模型提升数据可访问性,支持用户查询解析、结构化搜索和溯源问答。
Details
Motivation: 现有临床试验数据分散且结构化程度不足,限制了患者、临床医生、研究人员和政策制定者对证据的获取;需提升数据可及性以推动循证医学发展。 Method: 利用GPT-5.1和Gemini-3-Pro等大语言模型,自动解析PubMed全文文献,提取结构化试验信息,将用户查询转化为数据库搜索,并构建具备溯源能力的问答系统;通过用户研究和系统性自动评估验证其有效性。 Result: 相比仅使用ClinicalTrials.gov,结构化临床试验数据的可访问性提升了83.8%;用户研究和自动评估表明系统在信息提取和问答方面具有实用价值。 Conclusion: ClinicalTrialsHub显著扩展了结构化临床试验数据的覆盖范围,提升了数据检索与理解的效率,有助于促进循证医学的发展。 Abstract: We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.[7] Are generative AI text annotations systematically biased?
Sjoerd B. Stolwijk,Mark Boukes,Damian Trilling
Main category: cs.CL
TL;DR: 该研究通过概念性复制Boukes(2024)的手动标注,评估GLLM在政治内容、互动性、理性、无礼和意识形态五个概念上的标注偏差。研究发现,尽管GLLM在F1分数上表现尚可,但在标注频率、下游分析结果和系统性偏差方面与人工标注存在显著差异,且不同模型间的一致性高于与人工标注的一致性,表明F1分数不足以反映实际偏差程度。
Details
Motivation: 探究生成式大语言模型(GLLM)在社会科学研究中自动标注的可靠性,特别是其相对于人工标注的系统性偏差问题。 Method: 使用多种GLLM(Llama3.1:8b, Llama3.3:70b, GPT4o, Qwen2.5:72b)结合五种不同提示,对五个概念进行标注,并与Boukes(2024)的人工标注结果进行比较,评估F1分数、标注频率、下游结果一致性及模型间重叠度。 Result: GLLM在F1分数上表现尚可,但在标注频率上与人工标注存在差异,导致下游分析结果不同;同时,各GLLM之间的一致性高于其与人工标注的一致性,显示出系统性偏差。 Conclusion: F1分数等传统指标不足以全面评估GLLM标注的偏差,需引入更敏感的测量方式以识别和纠正系统性偏差,确保其在社会科学应用中的有效性与公正性。 Abstract: This paper investigates bias in GLLM annotations by conceptually replicating manual annotations of Boukes (2024). Using various GLLMs (Llama3.1:8b, Llama3.3:70b, GPT4o, Qwen2.5:72b) in combination with five different prompts for five concepts (political content, interactivity, rationality, incivility, and ideology). We find GLLMs perform adequate in terms of F1 scores, but differ from manual annotations in terms of prevalence, yield substantively different downstream results, and display systematic bias in that they overlap more with each other than with manual annotations. Differences in F1 scores fail to account for the degree of bias.[8] What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models
Janiça Hackenbuchner,Arda Tezcan,Joke Daems
Main category: cs.CL
TL;DR: 本研究通过对比解释和显著性归因分析,探索机器翻译模型中性别偏见的来源,发现源句子中显著词汇的归因与人类性别感知有明显重叠,并强调理解模型决策对缓解性别偏见的重要性。
Details
Motivation: 现有研究多局限于测量性别偏见,而缺乏对其成因的深入探究。本文旨在从根源上分析触发翻译模型性别选择的上下文因素,以推动对性别偏见机制的理解。 Method: 使用性别歧义的真实源数据,结合对比解释和显著性归因方法,分析源句中输入词元如何影响目标语言中的性别屈折选择;设定不同归因阈值,比较显著词元与人类性别感知的一致性,并进行语言学分析。 Result: 发现了源句中某些词元对模型性别决策具有显著影响,这些显著词元与人类对性别线索的感知存在明显重叠;不同归因水平的分析揭示了关键触发词的作用。 Conclusion: 理解模型在翻译中做出性别选择的依据至关重要,其归因模式与人类感知部分一致,应利用此类可解释性方法识别并缓解模型中的性别偏见。 Abstract: Interpretability can be implemented as a means to understand decisions taken by (black box) models, such as machine translation (MT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. With this research, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this study examines which context, in the form of input tokens in the source sentence, influences (or triggers) the translation model choice of a certain gender inflection in the target language. To analyse this, we use contrastive explanations and compute saliency attribution. We first address the challenge of a lacking scoring threshold and specifically examine different attribution levels of source words on the model gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.[9] Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models
Ju-Young Kim,Ji-Hong Park,Se-Yeon Lee,Sujin Park,Gun-Woo Kim
Main category: cs.CL
TL;DR: 本研究提出一種軟性歸納偏見方法,透過明確定義推理視角來引導韓語大型語言模型的推論過程,以提升不當言論檢測的準確性。實驗結果顯示,此方法使Kanana-1.5模型的平均準確率達到87.0046%,相較於傳統監督學習提升約3.89%。
Details
Motivation: 由於線上遊戲與社群中匿名環境常導致不當言論升級為言語霸凌甚至犯罪行為,社會對自動檢測不當言論技術的需求日益增加。然而,目前將韓語大型語言模型與思維鏈推理應用於此領域的研究仍有限。 Method: 提出一種軟性歸納偏見方法,透過明確設定推理視角來引導模型進行理性判斷;並以此方法微調韓語大型語言模型,進行定量與定性評估。 Result: Kanana-1.5模型在不當言論檢測任務中達成87.0046%的平均準確率,比標準監督學習高出約3.89%。定性分析也顯示模型能做出更一致且可解釋的判斷。 Conclusion: 所提出的推理視角引導方法不僅超越了大型語言模型僅模仿知識的限制,還能透過受限的推理路徑實現更精確、穩定的不當言論檢測,驗證了其有效性與應用潛力。 Abstract: Recent incidents in certain online games and communities, where anonymity is guaranteed, show that unchecked inappropriate remarks frequently escalate into verbal abuse and even criminal behavior, raising significant social concerns. Consequently, there is a growing need for research on techniques that can detect inappropriate utterances within conversational texts to help build a safer communication environment. Although large-scale language models trained on Korean corpora and chain-of-thought reasoning have recently gained attention, research applying these approaches to inappropriate utterance detection remains limited. In this study, we propose a soft inductive bias approach that explicitly defines reasoning perspectives to guide the inference process, thereby promoting rational decision-making and preventing errors that may arise during reasoning. We fine-tune a Korean large language model using the proposed method and conduct both quantitative performance comparisons and qualitative evaluations across different training strategies. Experimental results show that the Kanana-1.5 model achieves an average accuracy of 87.0046, improving by approximately 3.89 percent over standard supervised learning. These findings indicate that the proposed method goes beyond simple knowledge imitation by large language models and enables more precise and consistent judgments through constrained reasoning perspectives, demonstrating its effectiveness for inappropriate utterance detection.[10] Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks
Indrajit Kar,Kalathur Chenchu Kishore Kumar
Main category: cs.CL
TL;DR: 提出了一种分层多智能体架构,通过64*64轻量级智能体网格和选择性oracle进行长视野推理,结合基于空间课程学习和NLL置信度的训练策略,在类汉诺塔任务中提升了稳定性与推理能力。
Details
Motivation: 现有大模型和多智能体系统在长视野推理任务中表现不佳且计算成本高,需提升分布式推理的效率与稳定性。 Method: 构建分层多智能体架构,在64*64网格上分布推理任务;引入空间课程学习,从中心向外围逐步扩展训练区域;使用Negative Log-Likelihood(NLL)作为置信度指标,结合Thompson Sampling策略动态选择训练区域。 Result: 在空间版塔式汉诺塔基准上验证,系统展现出更高的稳定性、更少的oracle调用次数以及更强的长距离推理能力。 Conclusion: 分层结构与自适应课程学习相结合能有效提升多智能体系统在复杂长视野任务中的性能,为机器人操作与规划任务提供了可扩展的解决方案。 Abstract: Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.[11] HealthcareNLP: where are we and what is next?
Lifeng Han,Paul Rayson,Suzan Verberne,Andrew Moore,Goran Nenadic
Main category: cs.CL
TL;DR: 本教程介绍了医疗领域自然语言处理(HealthcareNLP)的关键子领域,涵盖数据资源、NLP评估和患者参与三个层次,旨在为初学者提供全面导览,并包含实践环节。
Details
Motivation: 现有综述常忽略医疗NLP中的重要任务(如合成数据生成、可解释性临床NLP)或关键方法(如检索增强生成、大模型与知识图谱的神经符号整合),亟需系统性介绍。 Method: 采用三层架构:数据/资源层(注释指南、伦理治理、合成数据)、NLP-Eval层(命名实体识别、关系抽取、情感分析等任务及分类方法)、患者层(公众参与、健康素养、文本简化与决策支持),并结合动手实践环节。 Result: 提供了面向医疗NLP的系统性框架,涵盖隐私保护、可解释性AI和实际应用,支持从技术到临床落地的整合。 Conclusion: 该教程为NLP研究者、医疗研究人员和学生提供了入门级、无需先验知识的医疗NLP全面导览,促进跨学科合作与实际应用。 Abstract: This proposed tutorial focuses on Healthcare Domain Applications of NLP, what we have achieved around HealthcareNLP, and the challenges that lie ahead for the future. Existing reviews in this domain either overlook some important tasks, such as synthetic data generation for addressing privacy concerns, or explainable clinical NLP for improved integration and implementation, or fail to mention important methodologies, including retrieval augmented generation and the neural symbolic integration of LLMs and KGs. In light of this, the goal of this tutorial is to provide an introductory overview of the most important sub-areas of a patient- and resource-oriented HealthcareNLP, with three layers of hierarchy: data/resource layer: annotation guidelines, ethical approvals, governance, synthetic data; NLP-Eval layer: NLP tasks such as NER, RE, sentiment analysis, and linking/coding with categorised methods, leading to explainable HealthAI; patients layer: Patient Public Involvement and Engagement (PPIE), health literacy, translation, simplification, and summarisation (also NLP tasks), and shared decision-making support. A hands-on session will be included in the tutorial for the audience to use HealthcareNLP applications. The target audience includes NLP practitioners in the healthcare application domain, NLP researchers who are interested in domain applications, healthcare researchers, and students from NLP fields. The type of tutorial is "Introductory to CL/NLP topics (HealthcareNLP)" and the audience does not need prior knowledge to attend this. Tutorial materials: https://github.com/4dpicture/HealthNLP[12] QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Maximilian Kreutner,Jens Rupprecht,Georg Ahnert,Ahmed Salem,Markus Strohmaier
Main category: cs.CL
TL;DR: QSTN是一个开源Python框架,用于通过问卷式提示系统生成大语言模型(LLM)的响应,支持无代码实验设置,显著提升LLM在调查和标注任务中的可重复性和可靠性。
Details
Motivation: 为了提高基于大语言模型的研究在问卷调查和数据标注任务中的可重复性、可靠性和易用性,同时降低计算成本。 Method: 开发了一个名为QSTN的开源Python框架,支持系统化的问卷提示响应生成,并提供对问卷呈现方式、提示扰动和响应生成方法的鲁棒评估;同时配备无代码用户界面,便于非编程背景的研究人员使用。 Result: 在超过4000万条调查响应的大规模评估中发现,问题结构和响应生成方法显著影响LLM生成答案与人类回答的一致性,且可在更低计算成本下实现高性能。 Conclusion: QSTN框架有效支持了LLM在模拟调查和标注任务中的应用,提升了实验的可复现性与可访问性,有望推动LLM在科研中的标准化使用。 Abstract: We introduce QSTN, an open-source Python framework for systematically generating responses from questionnaire-style prompts to support in-silico surveys and annotation tasks with large language models (LLMs). QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods. Our extensive evaluation ($>40 $ million survey responses) shows that question structure and response generation methods have a significant impact on the alignment of generated survey responses with human answers, and can be obtained for a fraction of the compute cost. In addition, we offer a no-code user interface that allows researchers to set up robust experiments with LLMs without coding knowledge. We hope that QSTN will support the reproducibility and reliability of LLM-based research in the future.[13] An Agentic AI System for Multi-Framework Communication Coding
Bohao Yang,Rui Yang,Joshua M. Biro,Haoyuan Wang,Jessica L. Handley,Brianna Richardson,Sophia Bessias,Nicoleta Economou-Zavlanos,Armando D. Bedoya,Monica Agrawal,Michael M. Zavlanos,Anand Chowdhury,Raj M. Ratwani,Kai Sun,Kathryn I. Pollak,Michael J. Pencina,Chuan Hong
Main category: cs.CL
TL;DR: 本研究开发了一种基于LangGraph的多框架结构化智能体AI系统MOSAIC,用于自动化标注临床医患对话,显著提升了标注的可扩展性、适应性和可靠性,在测试中达到0.928的整体F1分数。
Details
Motivation: 临床沟通对患者结局至关重要,但人工标注医患对话费时费力且难以规模化;现有大语言模型方法多为单任务模型,缺乏跨框架和领域的适应性与可靠性。 Method: 构建MOSAIC系统,包含规划、更新、标注和验证四个智能体,采用代码本引导的检索增强生成(RAG)与动态少样本提示技术,基于LangGraph架构实现多智能体协同。 Result: 在50个测试转录本上,MOSAIC整体F1得分为0.928,风湿病学子集表现最佳(F1=0.962),尤其擅长识别患者行为;消融实验证明其优于基线模型。 Conclusion: MOSAIC系统能高效、准确地自动化标注多领域临床对话,具备良好的可解释性与可扩展性,为大规模临床沟通分析提供了可靠工具。 Abstract: Clinical communication is central to patient outcomes, yet large-scale human annotation of patient-provider conversation remains labor-intensive, inconsistent, and difficult to scale. Existing approaches based on large language models typically rely on single-task models that lack adaptability, interpretability, and reliability, especially when applied across various communication frameworks and clinical domains. In this study, we developed a Multi-framework Structured Agentic AI system for Clinical Communication (MOSAIC), built on a LangGraph-based architecture that orchestrates four core agents, including a Plan Agent for codebook selection and workflow planning, an Update Agent for maintaining up-to-date retrieval databases, a set of Annotation Agents that applies codebook-guided retrieval-augmented generation (RAG) with dynamic few-shot prompting, and a Verification Agent that provides consistency checks and feedback. To evaluate performance, we compared MOSAIC outputs against gold-standard annotations created by trained human coders. We developed and evaluated MOSAIC using 26 gold standard annotated transcripts for training and 50 transcripts for testing, spanning rheumatology and OB/GYN domains. On the test set, MOSAIC achieved an overall F1 score of 0.928. Performance was highest in the Rheumatology subset (F1 = 0.962) and strongest for Patient Behavior (e.g., patients asking questions, expressing preferences, or showing assertiveness). Ablations revealed that MOSAIC outperforms baseline benchmarking.[14] Automatic Essay Scoring and Feedback Generation in Basque Language Learning
Ekhi Azurmendi,Xabier Arregi,Oier Lopez de Lacalle
Main category: cs.CL
TL;DR: 本文介绍了首个用于巴斯克语自动作文评分(AES)和反馈生成的公开数据集,针对CEFR C1水平,包含3,200篇由专家评分并提供详细反馈的作文。研究对多种开源模型进行微调,在评分一致性与反馈质量上超越了GPT-5和Claude Sonnet 4.5等闭源系统,并提出了一种结合自动指标与专家验证的新评估方法。
Details
Motivation: 推动低资源语言(如巴斯克语)中透明、可复现且具有教育意义的自然语言处理研究,填补自动作文评分与反馈生成领域的资源空白。 Method: 基于HABE考试构建标注数据集,涵盖多个评分维度;使用RoBERTa-EusCrawl和Latxa 8B/70B等开源模型进行监督微调(SFT),并提出结合自动一致性指标与专家错误提取验证的新型反馈评估方法。 Result: 微调后的Latxa模型在评分一致性和反馈质量上超过GPT-5和Claude Sonnet 4.5等闭源模型,能识别更多类型的 learner 错误,生成符合评分标准且具教学意义的反馈。 Conclusion: 该数据集与基准为低资源语言的AES研究提供了坚实基础,证明开源模型经适当微调可在教育NLP任务中超越闭源系统,强调透明性与教育相关性的重要性。 Abstract: This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.[15] Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages
David Samuel,Lilja Øvrelid,Erik Velldal,Andrey Kutuzov
Main category: cs.CL
TL;DR: 提出一种针对低资源语言的训后优化方法,通过在线策略训练保持语言模型的流利性,无需目标语言的指令微调数据,在挪威语上验证了方法的有效性。
Details
Motivation: 低资源语言缺乏母语者标注的数据集和能生成流利合成数据的语言模型,现有偏好对齐研究多集中于英语和中文,因此需要一种不依赖稀缺数据的流利语言模型对齐方法。 Method: 采用在线策略(on-policy)训练方法,与基于机器翻译数据的监督微调和多语言微调两种常见方法进行对比,整个过程无需使用目标语言的指令微调数据。 Result: 在挪威博克马尔语上的实验表明,该方法显著优于其他两种方法,尤其在保持语言流利性方面表现突出,且完全不依赖难以获取的数据。 Conclusion: 在线策略训练对于低资源语言的偏好对齐至关重要,能够在没有原生指令数据的情况下有效提升并保持语言模型的流利性。 Abstract: We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.[16] A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs
Mahmoud Srewa,Tianyu Zhao,Salma Elmalaki
Main category: cs.CL
TL;DR: 本文提出了一种在联邦学习环境中对齐大语言模型与多样化人类偏好的评估框架,并引入一种根据群体历史表现动态调整权重的自适应聚合方法,在保持对齐性能的同时显著提升公平性。
Details
Motivation: 标准的对齐方法在联邦学习中难以充分表达多样化的用户偏好,缺乏对齐质量与公平性之间权衡的系统评估。 Method: 在联邦设置下,各组本地生成奖励信号,服务器通过标准(最小、最大、平均)和提出的自适应策略聚合群体级奖励,不接触原始数据。使用基于PPO的RLHF流程在问答任务上进行实验。 Result: 所提自适应方法在多个Q/A任务上显著优于标准聚合方式,实现了更高的公平性同时保持有竞争力的对齐得分。 Conclusion: 该工作为在多样化人群中评估LLM行为提供了稳健方法,并推动了真正多元化和公平对齐模型的发展。 Abstract: This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group's historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.[17] Ask, Answer, and Detect: Role-Playing LLMs for Personality Detection with Question-Conditioned Mixture-of-Experts
Yifan Lyu,Liang Zhang
Main category: cs.CL
TL;DR: 提出ROME框架,利用大语言模型模拟用户回答心理测评问卷,将社交媒体文本转化为可解释的问卷式证据,以缓解标签稀缺并增强性格检测的语义映射。
Details
Motivation: 现有性格检测方法受限于标签稀疏和语言与心理构念之间语义映射不明确的问题,缺乏中间监督信号和可解释性。 Method: 基于心理测评量表设计,利用大语言模型的角色扮演能力生成用户对标准化问卷题目的回答,通过问题条件化的Mixture-of-Experts模块联合学习帖子与问题表示,并在多任务框架中融合答案向量与用户表示进行性格预测。 Result: 在两个真实数据集上实验表明,ROME显著优于现有最先进方法,在Kaggle数据集上性能提升15.41%。 Conclusion: ROME通过引入心理学知识和可解释的中间问答任务,有效缓解了标签稀缺问题,提升了性格检测的准确性与可解释性。 Abstract: Understanding human personality is crucial for web applications such as personalized recommendation and mental health assessment. Existing studies on personality detection predominantly adopt a "posts -> user vector -> labels" modeling paradigm, which encodes social media posts into user representations for predicting personality labels (e.g., MBTI labels). While recent advances in large language models (LLMs) have improved text encoding capacities, these approaches remain constrained by limited supervision signals due to label scarcity, and under-specified semantic mappings between user language and abstract psychological constructs. We address these challenges by proposing ROME, a novel framework that explicitly injects psychological knowledge into personality detection. Inspired by standardized self-assessment tests, ROME leverages LLMs' role-play capability to simulate user responses to validated psychometric questionnaires. These generated question-level answers transform free-form user posts into interpretable, questionnaire-grounded evidence linking linguistic cues to personality labels, thereby providing rich intermediate supervision to mitigate label scarcity while offering a semantic reasoning chain that guides and simplifies the text-to-personality mapping learning. A question-conditioned Mixture-of-Experts module then jointly routes over post and question representations, learning to answer questionnaire items under explicit supervision. The predicted answers are summarized into an interpretable answer vector and fused with the user representation for final prediction within a multi-task learning framework, where question answering serves as a powerful auxiliary task for personality detection. Extensive experiments on two real-world datasets demonstrate that ROME consistently outperforms state-of-the-art baselines, achieving improvements (15.41% on Kaggle dataset).[18] Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis
Ferdinand Kapl,Emmanouil Angelis,Tobias Höppe,Kaitlin Maile,Johannes von Oswald,Nino Scherrer,Stefan Bauer
Main category: cs.CL
TL;DR: 本文研究了通过训练过程中逐渐增加Transformer深度(如MIDAS方法)所带来的性能提升机制,发现这种渐进式增长能更有效地利用模型深度,改变残差流结构,并促进可置换计算模块的形成,从而缓解标准模型中的“深度诅咒”问题。
Details
Motivation: 尽管已知在训练中逐步增加Transformer的深度可以降低成本并提升推理能力,但其背后机制尚不清楚。本文旨在从机理上理解这种深度增长为何有效,特别是针对深度利用不均的问题(即“深度诅咒”)。 Method: 通过逐层分析方法,比较标准Transformer与采用渐进中间堆叠(gradual middle stacking)生长策略的模型在残差流结构和各层对输出贡献上的差异,并提出一种轻量级改进版MIDAS方法。 Result: 发现渐进式深度增长能更均衡地利用各层深度,改善残差流的信息流动,形成可置换的计算块;所提改进方法在下游推理任务中进一步提升了性能。 Conclusion: 渐进式深度增长不仅缓解了深度诅咒问题,还促进了不同计算电路的形成,提高了模型的表达能力和推理性能。 Abstract: Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.[19] Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders
Guangzhi Xiong,Zhenghao He,Bohan Liu,Sanchit Sinha,Aidong Zhang
Main category: cs.CL
TL;DR: 本文提出了RAGLens,一种基于稀疏自编码器(SAE)解析大语言模型内部激活以检测检索增强生成(RAG)中幻觉的轻量级方法,具有高准确性和可解释性。
Details
Motivation: 现有RAG幻觉检测方法依赖大规模训练数据或外部LLM评判,成本高且准确性有限,因此需要一种高效、低开销且准确的检测方法。 Method: 利用机械可解释性中的稀疏自编码器(SAE)解耦LLM内部激活,通过基于信息的特征选择和加性特征建模,识别与RAG幻觉相关的特定特征,构建轻量级检测器RAGLens。 Result: RAGLens在检测性能上优于现有方法,能准确标记不忠实的RAG输出,并提供可解释的判断依据,支持事后缓解;同时揭示了幻觉相关信号在LLM中的分布特性。 Conclusion: RAGLens是一种高效、可解释的RAG幻觉检测方法,无需额外训练或调用判别模型,为理解与缓解LLM幻觉提供了新思路。 Abstract: Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.cs.CV [Back]
[20] Detection of Cyberbullying in GIF using AI
Pal Dave,Xiaohong Yuan,Madhuri Siddula,Kaushik Roy
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的方法,利用VGG16模型从推特收集的GIF数据集中检测网络欺凌,准确率达到97%,并为相关研究提供了新的数据集。
Details
Motivation: 由于网络欺凌在社交媒体上日益严重,而现有研究对GIF/贴图中的欺凌检测关注较少,因此需要开发有效的方法来识别此类内容。 Method: 通过提取与网络欺凌相关的推特标签,使用GIPHY API下载GIF文件,构建包含4100多个GIF的数据集,并采用预训练的VGG16深度学习模型进行分类检测。 Result: 模型在检测GIF中的网络欺凌方面达到了97%的准确率,并公开了可用于后续研究的GIF数据集。 Conclusion: 该研究表明深度学习模型在GIF内容中的网络欺凌检测具有高效性,且所构建的数据集可促进未来相关研究的发展。 Abstract: Cyberbullying is a well-known social issue, and it is escalating day by day. Due to the vigorous development of the internet, social media provide many different ways for the user to express their opinions and exchange information. Cyberbullying occurs on social media using text messages, comments, sharing images and GIFs or stickers, and audio and video. Much research has been done to detect cyberbullying on textual data; some are available for images. Very few studies are available to detect cyberbullying on GIFs/stickers. We collect a GIF dataset from Twitter and Applied a deep learning model to detect cyberbullying from the dataset. Firstly, we extracted hashtags related to cyberbullying using Twitter. We used these hashtags to download GIF file using publicly available API GIPHY. We collected over 4100 GIFs including cyberbullying and non cyberbullying. we applied deep learning pre-trained model VGG16 for the detection of the cyberbullying. The deep learning model achieved the accuracy of 97%. Our work provides the GIF dataset for researchers working in this area.[21] Near-real time fires detection using satellite imagery in Sudan conflict
Kuldip Singh Atwal,Dieter Pfoser,Daniel Rothbart
Main category: cs.CV
TL;DR: 本文提出了一种基于Planet Labs 4波段影像和深度学习模型的自动化方法,用于近实时监测苏丹武装冲突中的火灾损毁情况,并通过五个案例研究验证了其优于基线方法的准确性。
Details
Motivation: 持续的战争使得对冲突地区进行快速监测和分析变得至关重要,而传统手段难以满足时效性需求。 Method: 利用Planet Labs提供的4波段卫星遥感影像,结合深度学习模型,实现对火灾和烧毁区域的自动识别与监测。 Result: 在五个苏丹案例中,该方法比基线方法更准确地捕捉到活跃火点和炭化区域;使用8波段或时序影像仅带来边际性能提升。 Conclusion: 基于4波段影像的深度学习方法可高效、低成本地实现对冲突地区火灾损毁的近实时监测,具备实际应用潜力。 Abstract: The challenges of ongoing war in Sudan highlight the need for rapid monitoring and analysis of such conflicts. Advances in deep learning and readily available satellite remote sensing imagery allow for near real-time monitoring. This paper uses 4-band imagery from Planet Labs with a deep learning model to show that fire damage in armed conflicts can be monitored with minimal delay. We demonstrate the effectiveness of our approach using five case studies in Sudan. We show that, compared to a baseline, the automated method captures the active fires and charred areas more accurately. Our results indicate that using 8-band imagery or time series of such imagery only result in marginal gains.[22] Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
Zekai Luo,Zongze Du,Zhouhang Zhu,Hao Zhong,Muzhi Zhu,Wen Wang,Yuling Xi,Chenchen Jing,Hao Chen,Chunhua Shen
Main category: cs.CV
TL;DR: 本文提出了LivingSwap,首个基于视频参考引导的视频换脸模型,通过关键帧条件和视频参考引导实现高保真和时序一致的长序列换脸。
Details
Motivation: 在电影和娱乐制作中,现有方法难以在长且复杂的视频序列中同时保持高保真度和时间一致性,因此需要一种能利用源视频丰富视觉特征的新方法。 Method: 提出LivingSwap模型,使用关键帧作为条件信号注入目标身份,并结合视频参考引导进行时序拼接;构建了配对数据集Face2Face并反转数据对以增强监督。 Result: 实验表明该方法在保真度、身份稳定性与时间一致性上达到SOTA水平,能无缝融合目标身份与源视频的表情、光照和运动。 Conclusion: LivingSwap首次实现了参考引导的视频换脸,在长视频序列中显著提升了编辑质量与生产效率,具有较强的应用潜力。 Abstract: Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap[23] Restrictive Hierarchical Semantic Segmentation for Stratified Tooth Layer Detection
Ryan Banks,Camila Lindoni Azevedo,Hongying Tang,Yunpeng Li
Main category: cs.CV
TL;DR: 提出一种显式嵌入解剖层次结构的语义分割框架,通过递归逐层预测、限制性输出头和自上而下的特征调制,在牙齿全景片中提升细粒度解剖结构的分割性能与解剖一致性。
Details
Motivation: 现有层次感知分割方法主要通过损失函数间接编码解剖结构,监督弱且不直接,缺乏对精细解剖层级的显式建模。 Method: 设计一个通用框架:在类树每一层重新运行骨干网络(输入为原图与上一层logits拼接),使用父类概率通过FiLM调制子类特征,并引入概率组合规则和分层损失(Dice+交叉熵+一致性损失)确保父子类预测一致。 Result: 在自建TL-pano数据集上验证,基于UNet和HRNet的层次化变体在IoU、Dice和召回率上均优于基线,尤其提升细粒度结构分割效果,生成更符合解剖逻辑的掩码,但假阳性有所增加。 Conclusion: 显式构建解剖层次结构能有效提升小样本牙科影像中语义分割的性能与临床合理性。 Abstract: Accurate understanding of anatomical structures is essential for reliably staging certain dental diseases. A way of introducing this within semantic segmentation models is by utilising hierarchy-aware methodologies. However, existing hierarchy-aware segmentation methods largely encode anatomical structure through the loss functions, providing weak and indirect supervision. We introduce a general framework that embeds an explicit anatomical hierarchy into semantic segmentation by coupling a recurrent, level-wise prediction scheme with restrictive output heads and top-down feature conditioning. At each depth of the class tree, the backbone is re-run on the original image concatenated with logits from the previous level. Child class features are conditioned using Feature-wise Linear Modulation of their parent class probabilities, to modulate child feature spaces for fine grained detection. A probabilistic composition rule enforces consistency between parent and descendant classes. Hierarchical loss combines per-level class weighted Dice and cross entropy loss and a consistency term loss, ensuring parent predictions are the sum of their children. We validate our approach on our proposed dataset, TL-pano, containing 194 panoramic radiographs with dense instance and semantic segmentation annotations, of tooth layers and alveolar bone. Utilising UNet and HRNet as donor models across a 5-fold cross validation scheme, the hierarchical variants consistently increase IoU, Dice, and recall, particularly for fine-grained anatomies, and produce more anatomically coherent masks. However, hierarchical variants also demonstrated increased recall over precision, implying increased false positives. The results demonstrate that explicit hierarchical structuring improves both performance and clinical plausibility, especially in low data dental imaging regimes.[24] FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models
Jiyoon Pyo,Yuankun Jiao,Dongwon Jung,Zekun Li,Leeje Jang,Sofia Kirsanova,Jina Kim,Yijun Lin,Qin Liu,Junyi Xie,Hadi Askari,Nan Xu,Muhao Chen,Yao-Yi Chiang
Main category: cs.CV
TL;DR: 本文提出了FRIEDA,一个用于评估大视觉语言模型(LVLMs)在复杂开放性地图推理任务中表现的基准测试。与现有将地图视为图表特例的研究不同,FRIEDA强调对符号层、空间关系(拓扑、度量、方向)及跨地图推理的理解。实验显示当前最先进的模型性能远低于人类水平,揭示了LVLMs在空间智能方面的显著不足。
Details
Motivation: 现有的视觉语言模型在地图理解任务中多将地图视为图表的一种,未能充分评估其对空间关系和多地图推理的复杂认知能力。而实际应用如灾害响应和城市规划需要高水平的制图推理能力,因此亟需专门针对此类能力的评测基准。 Method: 基于GIS文献中的分类,构建FRIEDA基准,包含来自多领域和地理区域的真实地图图像,并涵盖三类空间关系:拓扑、度量和方向。所有问题均需多步推理,部分需跨地图关联。在直接设置和上下文设置下评估11个最先进的LVLMs。 Result: 即使最强的模型Gemini-2.5-Pro和GPT-5-Think在FRIEDA上的准确率也仅为38.20%和37.20%,远低于人类的84.87%。结果表明当前LVLMs在多步制图推理方面存在显著差距。 Conclusion: FRIEDA揭示了当前大视觉语言模型在复杂制图推理任务上的局限性,为推动空间智能的发展提供了一个严格的基准测试平台。 Abstract: Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model studies on map visual question-answering often treat maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, respectively, far below human performance of 84.87%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.[25] SSplain: Sparse and Smooth Explainer for Retinopathy of Prematurity Classification
Elifnur Sunger,Tales Imbiriba,Peter Campbell,Deniz Erdogmus,Stratis Ioannidis,Jennifer Dy
Main category: cs.CV
TL;DR: 本文提出了一种名为Sparse and Smooth Explainer (SSplain) 的解释方法,用于从眼底图像中分类早产儿视网膜病变(ROP),通过优化问题结合ADMM算法生成保持图像结构(平滑性和稀疏性)的像素级解释,实验表明其在事后准确性和平滑性分析上优于现有方法,并能识别与临床判别特征一致的关键区域,且具有良好的泛化能力。
Details
Motivation: 现有的模型解释方法在解释神经网络用于医学诊断时,往往无法保留输入图像的结构特性(如平滑性和稀疏性),导致解释结果不够真实可靠,影响临床医生对模型的信任。 Method: 提出SSplain方法,通过构建一个带有组合约束的优化问题,并使用交替方向乘子法(ADMM)求解,以生成兼具平滑性和稀疏性的像素级解释图。 Result: 实验结果显示,SSplain在事后准确性和平滑性方面优于常用的解释方法,并能识别出与临床认知一致的ROP判别特征,同时在多个公开数据集上验证了其泛化能力。 Conclusion: SSplain能够生成更符合图像结构特性的高质量解释,提升黑盒模型在医疗影像诊断中的可解释性与可信度,有助于临床应用。 Abstract: Neural networks are frequently used in medical diagnosis. However, due to their black-box nature, model explainers are used to help clinicians understand better and trust model outputs. This paper introduces an explainer method for classifying Retinopathy of Prematurity (ROP) from fundus images. Previous methods fail to generate explanations that preserve input image structures such as smoothness and sparsity. We introduce Sparse and Smooth Explainer (SSplain), a method that generates pixel-wise explanations while preserving image structures by enforcing smoothness and sparsity. This results in realistic explanations to enhance the understanding of the given black-box model. To achieve this goal, we define an optimization problem with combinatorial constraints and solve it using the Alternating Direction Method of Multipliers (ADMM). Experimental results show that SSplain outperforms commonly used explainers in terms of both post-hoc accuracy and smoothness analyses. Additionally, SSplain identifies features that are consistent with domain-understandable features that clinicians consider as discriminative factors for ROP. We also show SSplain's generalization by applying it to additional publicly available datasets. Code is available at https://github.com/neu-spiral/SSplain.[26] Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment
Youngjoon Jang,Liliane Momeni,Zifan Jiang,Joon Son Chung,Gül Varol,Andrew Zisserman
Main category: cs.CV
TL;DR: 本文提出了一种统一模型,用于手语理解,同时实现手语翻译(SLT)和手语-字幕对齐(SSA),在多个数据集上实现了最先进的性能,并展现出跨语言的泛化能力。
Details
Motivation: 为了促进实际交流、大规模语料库构建和教育应用,需要一个能同时处理手语翻译和时间对齐的统一模型。 Method: 采用轻量级视觉主干网络提取人体关键点和唇部区域特征,结合滑动Perceiver映射网络将连续视觉特征聚合成词级嵌入,并通过多任务可扩展训练策略联合优化SLT和SSA。 Result: 在BOBSL(BSL)数据集上达到SLT和SSA的最先进性能,并在How2Sign(ASL)上表现出强健的零样本迁移和微调性能。 Conclusion: 所提出的模型通过多语言预训练和有效架构设计,能够有效支持多种手语的理解与翻译,具有良好的跨语言泛化能力和实际应用前景。 Abstract: Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.[27] Towards Sustainable Universal Deepfake Detection with Frequency-Domain Masking
Chandler Timm C. Doloriel,Habib Ullah,Kristian Hovde Liland,Fadi Al Machot,Ngai-Man Cheung
Main category: cs.CV
TL;DR: 提出基于频域掩码的深度伪造检测训练策略,通过随机掩码和几何变换提升模型对未见生成器的泛化能力,在保持高检测精度的同时支持大幅模型剪枝,实现高效、可扩展且节能的通用深伪检测方案。
Details
Motivation: 通用深度伪造检测需对不断涌现的新生成模型(包括未见模型)具备强泛化能力,同时降低计算开销以支持大规模筛查,符合绿色AI的需求。 Method: 引入频域掩码作为训练策略,结合随机掩码和几何变换,专注于利用频率信息增强模型鲁棒性和泛化性,并验证其在模型剪枝下的稳定性。 Result: 在GAN和扩散模型生成图像数据集上达到最先进的泛化性能,且在结构化剪枝下保持稳定表现,显著降低资源消耗。 Conclusion: 频域掩码是一种高效、可持续且具有良好泛化能力的深度伪造检测方法,为绿色AI下的实际应用提供了可行路径。 Abstract: Universal deepfake detection aims to identify AI-generated images across a broad range of generative models, including unseen ones. This requires robust generalization to new and unseen deepfakes, which emerge frequently, while minimizing computational overhead to enable large-scale deepfake screening, a critical objective in the era of Green AI. In this work, we explore frequency-domain masking as a training strategy for deepfake detectors. Unlike traditional methods that rely heavily on spatial features or large-scale pretrained models, our approach introduces random masking and geometric transformations, with a focus on frequency masking due to its superior generalization properties. We demonstrate that frequency masking not only enhances detection accuracy across diverse generators but also maintains performance under significant model pruning, offering a scalable and resource-conscious solution. Our method achieves state-of-the-art generalization on GAN- and diffusion-generated image datasets and exhibits consistent robustness under structured pruning. These results highlight the potential of frequency-based masking as a practical step toward sustainable and generalizable deepfake detection. Code and models are available at: [https://github.com/chandlerbing65nm/FakeImageDetection](https://github.com/chandlerbing65nm/FakeImageDetection).[28] Mask to Adapt: Simple Random Masking Enables Robust Continual Test-Time Learning
Chandler Timm C. Doloriel
Main category: cs.CV
TL;DR: 本文提出了一种简单的持续测试时自适应方法Mask to Adapt (M2A),通过随机掩码策略结合一致性损失和熵最小化损失,在强腐蚀场景下有效提升图像分类器的鲁棒性,无需依赖不确定性估计或注意力机制。
Details
Motivation: 现有CTTA方法通常依赖校准的不确定性或稳定的注意力分数,设计复杂。本文探究是否简单的随机掩码策略即可在严重分布偏移下实现有效自适应。 Method: M2A生成一系列掩码视图(空间或频率域),采用两种掩码类型(空间:patch与pixel;频率:all、low、high),并通过掩码一致性损失和熵最小化损失进行模型自适应。 Result: 在CIFAR10C/CIFAR100C/ImageNetC(severity 5)上,M2A(Spatial)达到8.3%/19.8%/39.2%的平均错误率,优于或媲美现有强基线;M2A(Frequency)表现较差。消融实验表明随机掩码策略有效且鲁棒。 Conclusion: 简单的随机掩码策略结合一致性与熵优化目标,足以实现高效测试时自适应,无需复杂的定制化掩码设计或额外信号。 Abstract: Distribution shifts at test time degrade image classifiers. Recent continual test-time adaptation (CTTA) methods use masking to regulate learning, but often depend on calibrated uncertainty or stable attention scores and introduce added complexity. We ask: do we need custom-made masking designs, or can a simple random masking schedule suffice under strong corruption? We introduce Mask to Adapt (M2A), a simple CTTA approach that generates a short sequence of masked views (spatial or frequency) and adapts with two objectives: a mask consistency loss that aligns predictions across different views and an entropy minimization loss that encourages confident outputs. Motivated by masked image modeling, we study two common masking families -- spatial masking and frequency masking -- and further compare subtypes within each (spatial: patch vs.\ pixel; frequency: all vs.\ low vs.\ high). On CIFAR10C/CIFAR100C/ImageNetC (severity~5), M2A (Spatial) attains 8.3\%/19.8\%/39.2\% mean error, outperforming or matching strong CTTA baselines, while M2A (Frequency) lags behind. Ablations further show that simple random masking is effective and robust. These results indicate that a simple random masking schedule, coupled with consistency and entropy objectives, is sufficient to drive effective test-time adaptation without relying on uncertainty or attention signals.[29] Identification of Deforestation Areas in the Amazon Rainforest Using Change Detection Models
Christian Massao Konishi,Helio Pedrini
Main category: cs.CV
TL;DR: 本研究评估了多种基于Transformer和卷积网络的遥感图像变化检测模型,用于亚马逊雨林砍伐监测,并通过统一数据集和后处理技术提升性能,最终通过模型融合达到80.41%的F1分数。
Details
Motivation: 现有砍伐检测方法效果不佳,缺乏现代架构(如自注意力机制)和标准化方法,难以比较不同模型性能。 Method: 在统一数据集上评估多种变化检测模型,包括全卷积网络和基于Transformer的自注意力模型;采用连通区域过滤、纹理替换和图像增强等预后处理技术,并测试模型融合策略。 Result: 证明了预后处理技术能显著提升模型性能,模型融合策略取得了80.41%的F1-score,与现有文献结果相当。 Conclusion: 结合现代网络架构、合理预后处理和模型融合策略可有效提升亚马逊砍伐检测精度,为该领域提供了可比较的基准框架。 Abstract: The preservation of the Amazon Rainforest is one of the global priorities in combating climate change, protecting biodiversity, and safeguarding indigenous cultures. The Satellite-based Monitoring Project of Deforestation in the Brazilian Legal Amazon (PRODES), a project of the National Institute for Space Research (INPE), stands out as a fundamental initiative in this effort, annually monitoring deforested areas not only in the Amazon but also in other Brazilian biomes. Recently, machine learning models have been developed using PRODES data to support this effort through the comparative analysis of multitemporal satellite images, treating deforestation detection as a change detection problem. However, existing approaches present significant limitations: models evaluated in the literature still show unsatisfactory effectiveness, many do not incorporate modern architectures, such as those based on self-attention mechanisms, and there is a lack of methodological standardization that allows direct comparisons between different studies. In this work, we address these gaps by evaluating various change detection models in a unified dataset, including fully convolutional models and networks incorporating self-attention mechanisms based on Transformers. We investigate the impact of different pre- and post-processing techniques, such as filtering deforested areas predicted by the models based on the size of connected components, texture replacement, and image enhancements; we demonstrate that such approaches can significantly improve individual model effectiveness. Additionally, we test different strategies for combining the evaluated models to achieve results superior to those obtained individually, reaching an F1-score of 80.41%, a value comparable to other recent works in the literature.[30] CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning
Zeyuan Chen,Xiang Zhang,Haiyang Xu,Jianwen Xie,Zhuowen Tu
Main category: cs.CV
TL;DR: 提出了一种受中央-外周视觉启发的多模态模型CVP,通过目标亲和性令牌和以自我为中心的网格增强3D场景理解中的空间推理能力。
Details
Motivation: 现有方法依赖于非结构化表示,缺乏对场景的显式高层结构理解,导致空间推理能力受限。 Method: 在大型多模态模型架构中引入两种互补组件:类比中央视觉的目标亲和性令牌,引导模型关注与查询相关的物体;类比外周视觉的以自我为中心的网格,捕捉全局场景上下文和空间布局。 Result: CVP在多个3D场景理解基准上实现了最先进的性能。 Conclusion: CVP通过结合中央和外周视觉机制,有效提升了复杂3D环境中结构化、上下文感知的理解能力。 Abstract: We present a central-peripheral vision-inspired framework (CVP), a simple yet effective multimodal model for spatial reasoning that draws inspiration from the two types of human visual fields -- central vision and peripheral vision. Existing approaches primarily rely on unstructured representations, such as point clouds, voxels, or patch features, and inject scene context implicitly via coordinate embeddings. However, this often results in limited spatial reasoning capabilities due to the lack of explicit, high-level structural understanding. To address this limitation, we introduce two complementary components into a Large Multimodal Model-based architecture: target-affinity token, analogous to central vision, that guides the model's attention toward query-relevant objects; and allocentric grid, akin to peripheral vision, that captures global scene context and spatial arrangements. These components work in tandem to enable structured, context-aware understanding of complex 3D environments. Experiments show that CVP achieves state-of-the-art performance across a range of 3D scene understanding benchmarks.[31] Fourier-RWKV: A Multi-State Perception Network for Efficient Image Dehazing
Lirong Zheng,Yanshan Li,Rui Yu,Kaihao Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于多状态感知范式的新型图像去雾框架Fourier-RWKV,通过空间、频域和语义三种感知状态的协同,在线性复杂度下实现高效去雾。
Details
Motivation: 现有Transformer方法因二次计算复杂度难以实时部署,且在非均匀雾条件下建模能力受限,需更高效且全面的去雾模型。 Method: 引入Fourier-RWKV模型,包含DQ-Shift操作进行空间形态感知,Fourier Mix模块实现频域感知,以及SBM模块通过DSK-Fusion进行语义关系感知,整体实现线性复杂度下的全局上下文捕获与特征对齐。 Result: 在多个基准上取得领先性能,显著降低计算开销,兼顾去雾质量和运行效率。 Conclusion: Fourier-RWKV通过多状态感知机制有效解决了真实非均匀雾环境下的去雾难题,为实时高质量去雾提供了可行方案。 Abstract: Image dehazing is crucial for reliable visual perception, yet it remains highly challenging under real-world non-uniform haze conditions. Although Transformer-based methods excel at capturing global context, their quadratic computational complexity hinders real-time deployment. To address this, we propose Fourier Receptance Weighted Key Value (Fourier-RWKV), a novel dehazing framework based on a Multi-State Perception paradigm. The model achieves comprehensive haze degradation modeling with linear complexity by synergistically integrating three distinct perceptual states: (1) Spatial-form Perception, realized through the Deformable Quad-directional Token Shift (DQ-Shift) operation, which dynamically adjusts receptive fields to accommodate local haze variations; (2) Frequency-domain Perception, implemented within the Fourier Mix block, which extends the core WKV attention mechanism of RWKV from the spatial domain to the Fourier domain, preserving the long-range dependencies essential for global haze estimation while mitigating spatial attenuation; (3) Semantic-relation Perception, facilitated by the Semantic Bridge Module (SBM), which utilizes Dynamic Semantic Kernel Fusion (DSK-Fusion) to precisely align encoder-decoder features and suppress artifacts. Extensive experiments on multiple benchmarks demonstrate that Fourier-RWKV delivers state-of-the-art performance across diverse haze scenarios while significantly reducing computational overhead, establishing a favorable trade-off between restoration quality and practical efficiency. Code is available at: https://github.com/Dilizlr/Fourier-RWKV.[32] Accuracy Does Not Guarantee Human-Likeness in Monocular Depth Estimators
Yuki Kubota,Taiki Fukiage
Main category: cs.CV
TL;DR: 该研究系统分析了69种单目深度估计模型在KITTI数据集上的表现,发现尽管DNN在准确性上超越人类,但其误差模式与人类感知存在显著差异,提升准确性并不等同于更接近人类感知,强调需建立以人类为中心的多维评估体系。
Details
Motivation: 对齐模型表征与人类感知是提升模型鲁棒性和可解释性的关键策略,但在深度估计领域是否也存在准确率与人类相似性之间的权衡尚不明确,尤其在依赖传感器真值而非人类感知标注的自然户外场景中。 Method: 在KITTI数据集上评估69种单目深度估计模型,采用仿射拟合分解预测误差,逐因素分析模型误差结构,并比较模型与人类感知的误差相关性及权衡关系。 Result: 发现人类与DNN共享某些估计偏差(如正向误差相关),但模型准确性与人类相似性之间存在明显权衡关系,即高精度模型不一定更接近人类感知模式。 Conclusion: 提高深度估计模型的准确性并不能保证其行为更像人类,因此需要发展超越传统准确率的、多维度且以人类为中心的评估方法。 Abstract: Monocular depth estimation is a fundamental capability for real-world applications such as autonomous driving and robotics. Although deep neural networks (DNNs) have achieved superhuman accuracy on physical-based benchmarks, a key challenge remains: aligning model representations with human perception, a promising strategy for enhancing model robustness and interpretability. Research in object recognition has revealed a complex trade-off between model accuracy and human-like behavior, raising a question whether a similar divergence exist in depth estimation, particularly for natural outdoor scenes where benchmarks rely on sensor-based ground truth rather than human perceptual estimates. In this study, we systematically investigated the relationship between model accuracy and human similarity across 69 monocular depth estimators using the KITTI dataset. To dissect the structure of error patterns on a factor-by-factor basis, we applied affine fitting to decompose prediction errors into interpretable components. Intriguingly, our results reveal while humans and DNNs share certain estimation biases (positive error correlations), we observed distinct trade-off relationships between model accuracy and human similarity. This finding indicates that improving accuracy does not necessarily lead to more human-like behavior, underscoring the necessity of developing multifaceted, human-centric evaluations beyond traditional accuracy.[33] GeoLoom: High-quality Geometric Diagram Generation from Textual Input
Xiaojing Wei,Ting Zhang,Wei He,Jingdong Wang,Hua Huang
Main category: cs.CV
TL;DR: 提出GeoLoom框架,结合自然语言到形式语言的转换与坐标求解,实现高精度几何图表生成。
Details
Motivation: 几何图表生成需要高空间准确性,现有方法难以同时保证正确性与可解释性。 Method: 设计GeoLingua形式语言与自动形式化模块,并采用蒙特卡洛优化的坐标求解器;构建GeoNF数据集及基于约束的评估指标。 Result: 实验表明GeoLoom在结构保真度上显著优于现有最先进基线方法。 Conclusion: GeoLoom为可解释、可扩展的几何图表生成提供了原理性基础。 Abstract: High-quality geometric diagram generation presents both a challenge and an opportunity: it demands strict spatial accuracy while offering well-defined constraints to guide generation. Inspired by recent advances in geometry problem solving that employ formal languages and symbolic solvers for enhanced correctness and interpretability, we propose GeoLoom, a novel framework for text-to-diagram generation in geometric domains. GeoLoom comprises two core components: an autoformalization module that translates natural language into a specifically designed generation-oriented formal language GeoLingua, and a coordinate solver that maps formal constraints to precise coordinates using the efficient Monte Carlo optimization. To support this framework, we introduce GeoNF, a dataset aligning natural language geometric descriptions with formal GeoLingua descriptions. We further propose a constraint-based evaluation metric that quantifies structural deviation, offering mathematically grounded supervision for iterative refinement. Empirical results demonstrate that GeoLoom significantly outperforms state-of-the-art baselines in structural fidelity, providing a principled foundation for interpretable and scalable diagram generation.[34] Animal Re-Identification on Microcontrollers
Yubo Chen,Di Zhao,Yun Sing Koh,Talia Xu
Main category: cs.CV
TL;DR: 提出了一种适用于微控制器的低功耗动物重识别框架,通过优化MobileNetV2架构和数据高效微调策略,在极小模型尺寸下实现 competitive 准确率,支持野外环境中的全设备端推理。
Details
Motivation: 现有动物重识别模型多为服务器设计,难以部署在内存小、输入分辨率低的微控制器设备上,限制了其在野外监测和畜牧管理中的应用。 Method: 分析了先进模型与MCU硬件间的差距,采用系统缩放的CNN-based MobileNetV2主干网络适配低分辨率输入,并提出仅需每类3张图像的数据高效微调策略。 Result: 在六个公开数据集上,模型体积缩小两个数量级以上,检索精度保持竞争力;在自建牛群数据集上实现全设备端推理,Top-1精度与集群版本相当。 Conclusion: 证明了在MCU级设备上实现高精度、可适应的动物重识别是可行的,为实际野外场景的大规模部署铺平了道路。 Abstract: Camera-based animal re-identification (Animal Re-ID) can support wildlife monitoring and precision livestock management in large outdoor environments with limited wireless connectivity. In these settings, inference must run directly on collar tags or low-power edge nodes built around microcontrollers (MCUs), yet most Animal Re-ID models are designed for workstations or servers and are too large for devices with small memory and low-resolution inputs. We propose an on-device framework. First, we characterise the gap between state-of-the-art Animal Re-ID models and MCU-class hardware, showing that straightforward knowledge distillation from large teachers offers limited benefit once memory and input resolution are constrained. Second, guided by this analysis, we design a high-accuracy Animal Re-ID architecture by systematically scaling a CNN-based MobileNetV2 backbone for low-resolution inputs. Third, we evaluate the framework with a real-world dataset and introduce a data-efficient fine-tuning strategy to enable fast adaptation with just three images per animal identity at a new site. Across six public Animal Re-ID datasets, our compact model achieves competitive retrieval accuracy while reducing model size by over two orders of magnitude. On a self-collected cattle dataset, the deployed model performs fully on-device inference with only a small accuracy drop and unchanged Top-1 accuracy relative to its cluster version. We demonstrate that practical, adaptable Animal Re-ID is achievable on MCU-class devices, paving the way for scalable deployment in real field environments.[35] Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement
Chia-Hern Lai,I-Hsuan Lo,Yen-Ku Yeh,Thanh-Nguyen Truong,Ching-Chun Huang
Main category: cs.CV
TL;DR: 提出Blur2Sharp框架,结合3D感知神经渲染与扩散模型,从单个参考视图生成清晰、几何一致的新视角人像图像。
Details
Motivation: 现有方法在多视角生成中存在几何不一致或图像模糊问题,尤其在复杂姿态和遮挡情况下表现不佳。 Method: 采用双条件架构:首先使用Human NeRF生成具有3D结构引导的多视角渲染;然后利用扩散模型基于这些渲染结果进行图像细化,并通过层次化特征融合引入SMPL模型提供的纹理、法线和语义先验。 Result: 在新姿态和新视角生成任务中显著优于现有最先进方法,尤其在宽松衣物和遮挡场景下表现出更强的细节保持和结构一致性。 Conclusion: Blur2Sharp有效解决了单视图驱动下的逼真人像生成难题,实现了高保真、几何一致的多视角图像合成。 Abstract: The creation of lifelike human avatars capable of realistic pose variation and viewpoint flexibility remains a fundamental challenge in computer vision and graphics. Current approaches typically yield either geometrically inconsistent multi-view images or sacrifice photorealism, resulting in blurry outputs under diverse viewing angles and complex motions. To address these issues, we propose Blur2Sharp, a novel framework integrating 3D-aware neural rendering and diffusion models to generate sharp, geometrically consistent novel-view images from only a single reference view. Our method employs a dual-conditioning architecture: initially, a Human NeRF model generates geometrically coherent multi-view renderings for target poses, explicitly encoding 3D structural guidance. Subsequently, a diffusion model conditioned on these renderings refines the generated images, preserving fine-grained details and structural fidelity. We further enhance visual quality through hierarchical feature fusion, incorporating texture, normal, and semantic priors extracted from parametric SMPL models to simultaneously improve global coherence and local detail accuracy. Extensive experiments demonstrate that Blur2Sharp consistently surpasses state-of-the-art techniques in both novel pose and view generation tasks, particularly excelling under challenging scenarios involving loose clothing and occlusions.[36] Beyond Real Weights: Hypercomplex Representations for Stable Quantization
Jawad Ibn Ahad,Maisha Rahman,Amrijit Biswas,Muhammad Rafsan Kabir,Robin Krambroeckers,Sifat Momen,Nabeel Mohammed,Shafin Rahman
Main category: cs.CV
TL;DR: 提出一种渐进式重参数化策略,通过用紧凑的PHM层逐步替换密集前馈网络块来压缩多模态语言模型,在保持性能的同时显著减少参数量和推理延迟。
Details
Motivation: 多模态语言模型因需对齐高维视觉与语言特征而参数量大、计算成本高,难以高效部署。 Method: 引入渐进式重参数化策略,使用PHM层替代密集FFN块,并结合残差插值调度、轻量重建损失和知识蒸馏损失,使PHM模块继承原有功能行为。 Result: 在多个视觉-语言模型上验证了该方法,实现了显著的参数量和FLOPs减少,同时保持与基础模型相当的性能和多模态对齐能力。 Conclusion: 渐进式PHM替换为高效的多模态推理提供了兼容现有架构的压缩路径,并可与低比特量化技术互补。 Abstract: Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.[37] VisKnow: Constructing Visual Knowledge Base for Object Understanding
Ziwei Yao,Qiyang Wan,Ruiping Wang,Xilin Chen
Main category: cs.CV
TL;DR: 本文提出了一个名为VisKnow的框架,用于构建结构化的多模态视觉知识库,以实现对物体类别的深入理解。作为案例,构建了涵盖406种动物类别的AnimalKB,并验证了其在零样本识别、细粒度视觉问答等任务中的有效性。
Details
Motivation: 现有的物体理解数据通常是任务导向且缺乏系统性组织,难以支持对物体类别的全面理解,因此需要一种系统化、多模态的知识结构来提升视觉理解能力。 Method: 提出VisKnow框架,通过结合专家设计与大规模模型,从图文数据中提取对象级的多模态知识,并以图结构构建视觉知识库;具体实现了AnimalKB,包含文本三元组、图像及区域标注。 Result: AnimalKB包含22K文本三元组、420K图像及区域标注,在零样本识别、细粒度VQA、知识图谱补全和部件分割等任务上表现出色,成为相关研究的基准数据集。 Conclusion: 自动构建结构化视觉知识库是推动深入物体理解和多种视觉任务发展的有效途径,具有广泛的应用潜力。 Abstract: Understanding objects is fundamental to computer vision. Beyond object recognition that provides only a category label as typical output, in-depth object understanding represents a comprehensive perception of an object category, involving its components, appearance characteristics, inter-category relationships, contextual background knowledge, etc. Developing such capability requires sufficient multi-modal data, including visual annotations such as parts, attributes, and co-occurrences for specific tasks, as well as textual knowledge to support high-level tasks like reasoning and question answering. However, these data are generally task-oriented and not systematically organized enough to achieve the expected understanding of object categories. In response, we propose the Visual Knowledge Base that structures multi-modal object knowledge as graphs, and present a construction framework named VisKnow that extracts multi-modal, object-level knowledge for object understanding. This framework integrates enriched aligned text and image-source knowledge with region annotations at both object and part levels through a combination of expert design and large-scale model application. As a specific case study, we construct AnimalKB, a structured animal knowledge base covering 406 animal categories, which contains 22K textual knowledge triplets extracted from encyclopedic documents, 420K images, and corresponding region annotations. A series of experiments showcase how AnimalKB enhances object-level visual tasks such as zero-shot recognition and fine-grained VQA, and serves as challenging benchmarks for knowledge graph completion and part segmentation. Our findings highlight the potential of automatically constructing visual knowledge bases to advance visual understanding and its practical applications. The project page is available at https://vipl-vsu.github.io/VisKnow.[38] Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture
Samuel Ebimobowei Johnny,Blessed Guda,Emmanuel Enejo Aaron,Assane Gueye
Main category: cs.CV
TL;DR: 本文提出了一个新的任务——手语定位(Sign Language Spotting),旨在从连续手语句子中检测特定手语视频的存在与否。作者提出了一种基于姿态关键点的端到端模型,直接对查询手语与句子级手语序列进行匹配,避免了中间词汇识别或文本匹配。该方法在WSLP 2025共享任务数据集上取得了61.88%准确率和60.00% F1分数,具有较低计算成本和强抗噪能力。
Details
Motivation: 现有的手语识别研究多集中于整体句子识别或分类,而缺乏对手语序列中特定词汇的检索能力。为了实现更灵活的手语内容检索与交互,需要一种能够判断某一手语是否出现在长序列中的方法,因此作者提出了“手语定位”这一新任务。 Method: 采用仅编码器结构的主干网络,直接处理从手语视频中提取的姿态关键点数据,通过二分类头判断查询手语是否存在于目标序列中。模型不依赖RGB帧或中间gloss表示,实现端到端的匹配学习。 Result: 在Word Presence Prediction数据集上达到61.88%的准确率和60.00%的F1-score,验证了姿态表示在手语定位任务中的有效性,并显著降低计算开销与视觉噪声影响。 Conclusion: 基于姿态关键点的端到端框架为手语定位任务提供了有效解决方案,推动了手语检索与验证方向的研究进展,具有实际应用潜力。 Abstract: Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88\% accuracy and 60.00\% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at https://github.com/EbimoJohnny/Pose-Based-Sign-Language-Spotting[39] SOP^2: Transfer Learning with Scene-Oriented Prompt Pool on 3D Object Detection
Ching-Hung Cheng,Hsiu-Fu Wu,Bing-Chen Wu,Khanh-Phong Bui,Van-Tin Luu,Ching-Chun Huang
Main category: cs.CV
TL;DR: 本文探讨了在3D目标检测中使用提示调优方法的有效性,提出了一种面向场景的提示池(SOP²),并验证了其在跨数据集迁移中的有效性。
Details
Motivation: 探索大语言模型中的提示调优技术是否可有效迁移到3D目标检测任务中,并构建基于大规模数据集的通用基础模型。 Method: 通过分析提示令牌和提示生成器的影响,设计并提出了Scene-Oriented Prompt Pool (SOP²),用于提升模型在不同场景下的适应能力。 Result: 实验表明,所提出的SOP²在从Waymo等大规模数据集向其他3D检测场景迁移时表现出良好的性能,验证了提示池的有效性。 Conclusion: 提示调优在3D目标检测中具有潜力,SOP²为未来研究提供了新方向,推动提示机制在3D视觉领域的深入探索。 Abstract: With the rise of Large Language Models (LLMs) such as GPT-3, these models exhibit strong generalization capabilities. Through transfer learning techniques such as fine-tuning and prompt tuning, they can be adapted to various downstream tasks with minimal parameter adjustments. This approach is particularly common in the field of Natural Language Processing (NLP). This paper aims to explore the effectiveness of common prompt tuning methods in 3D object detection. We investigate whether a model trained on the large-scale Waymo dataset can serve as a foundation model and adapt to other scenarios within the 3D object detection field. This paper sequentially examines the impact of prompt tokens and prompt generators, and further proposes a Scene-Oriented Prompt Pool (\textbf{SOP$^2$}). We demonstrate the effectiveness of prompt pools in 3D object detection, with the goal of inspiring future researchers to delve deeper into the potential of prompts in the 3D field.[40] New VVC profiles targeting Feature Coding for Machines
Md Eimran Hossain Eimon,Ashan Perera,Juan Merlos,Velibor Adzic,Hari Kalva
Main category: cs.CV
TL;DR: 本文研究了在分离式推理系统中使用通用视频编码(VVC)压缩神经网络中间特征的方法,针对MPEG-AI的FCM标准提出了三种轻量级VVC配置文件(Fast、Faster、Fastest),在显著降低编码时间的同时保持较高的压缩效率和任务准确性。
Details
Motivation: 传统视频编解码器针对像素数据和人类视觉感知进行了优化,但在传输神经网络中间特征的分离式推理系统中,这些假设不再适用,因此需要专门针对抽象、稀疏且任务特定的特征进行高效压缩方法的研究。 Method: 基于Versatile Video Coding (VVC) 标准,对各个编码工具进行逐项分析,评估其对压缩效率和下游视觉任务准确率的影响,并据此设计三种轻量级VVC配置文件:Fast、Faster 和 Fastest。 Result: Fast 配置文件在编码时间减少21.8%的同时实现2.96%的BD-Rate增益;Faster 实现1.85% BD-Rate增益且编码速度提升51.5%;Fastest 将编码时间减少95.6%,仅损失1.71%的BD-Rate。 Conclusion: 针对面向机器感知的中间特征压缩,可以通过简化VVC工具集设计高效的轻量级编码配置,在大幅降低计算开销的同时维持良好的压缩性能和任务表现,为实际部署提供了可行方案。 Abstract: Modern video codecs have been extensively optimized to preserve perceptual quality, leveraging models of the human visual system. However, in split inference systems-where intermediate features from neural network are transmitted instead of pixel data-these assumptions no longer apply. Intermediate features are abstract, sparse, and task-specific, making perceptual fidelity irrelevant. In this paper, we investigate the use of Versatile Video Coding (VVC) for compressing such features under the MPEG-AI Feature Coding for Machines (FCM) standard. We perform a tool-level analysis to understand the impact of individual coding components on compression efficiency and downstream vision task accuracy. Based on these insights, we propose three lightweight essential VVC profiles-Fast, Faster, and Fastest. The Fast profile provides 2.96% BD-Rate gain while reducing encoding time by 21.8%. Faster achieves a 1.85% BD-Rate gain with a 51.5% speedup. Fastest reduces encoding time by 95.6% with only a 1.71% loss in BD-Rate.[41] MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models
Jusheng Zhang,Kaitong Cai,Xiaoyang Guo,Sidi Liu,Qinhan Lv,Ruiqi Chen,Jing Yang,Yijia Fan,Xiaofei Sun,Jian Wang,Ziliang Chen,Liang Lin,Keze Wang
Main category: cs.CV
TL;DR: 本文提出了MM-CoT,一个用于评估多模态模型在视觉推理中是否真正具备视觉依据和逻辑一致性的诊断性基准。与现有生成式任务不同,MM-CoT要求模型选择同时满足视觉一致性和逻辑连贯性的唯一正确推理链,并通过对抗性干扰项揭示模型的推理缺陷。实验表明,当前最先进的模型在此任务上表现不佳,暴露出生成流畅性与真实推理能力之间的巨大差距。
Details
Motivation: 现有的多模态推理基准侧重于生成推理过程,但缺乏对推理链是否基于视觉证据且逻辑自洽的验证机制。因此,需要一个新的基准来专门评估模型的视觉接地和逻辑一致性能力。 Method: 提出MM-CoT基准,要求模型从多个事件链中选出唯一符合两个正交约束的选项:(i) 视觉一致性(所有步骤均有可观测视觉证据支持);(ii) 逻辑连贯性(具备因果和常识合理性)。设计对抗性干扰项,分别违反其中一个约束以检测特定类型的推理失败。 Result: 在MM-CoT上评估主流视觉语言模型发现,即使最先进的模型也表现较差,显示出生成流畅性与实际推理保真度之间的显著脱节。MM-CoT与现有基准相关性低,证明其测量的是独特的综合能力。 Conclusion: MM-CoT有效揭示了当前多模态模型在链式思维推理中的根本缺陷,为发展真正忠实、连贯地基于视觉进行推理的未来模型提供了基础。 Abstract: The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.[42] Geometry-Aware Sparse Depth Sampling for High-Fidelity RGB-D Depth Completion in Robotic Systems
Tony Salloom,Dandi Zhou,Xinhai Sun
Main category: cs.CV
TL;DR: 本文提出了一种基于法向量引导的稀疏深度采样策略,通过PCA估计表面法向量来计算像素级深度可靠性,并据此生成更符合真实传感器行为的稀疏深度图,结合Marigold-DC模型在NYU Depth v2上提升了深度补全的精度与边缘质量。
Details
Motivation: 现有深度补全方法中稀疏深度的生成方式过于理想化(如随机采样),忽略了真实传感器在不同几何结构下深度可靠性的空间非均匀性,导致训练与实际应用脱节。 Method: 利用PCA对RGB-D点云进行表面法向量估计,构建每像素的深度可靠性度量,并据此概率分布采样生成稀疏深度;将该策略集成到扩散模型Marigold-DC中,在NYU Depth v2上进行训练与评估。 Result: 实验表明,所提方法在标准指标下提高了深度补全的准确性,减少了边缘和深度不连续区域的伪影,且训练条件更贴近真实传感器行为。 Conclusion: 引入几何感知的稀疏深度采样策略能有效提升深度补全模型的性能与现实适用性,为后续研究提供了更真实的建模思路。 Abstract: Accurate three-dimensional perception is essential for modern industrial robotic systems that perform manipulation, inspection, and navigation tasks. RGB-D and stereo vision sensors are widely used for this purpose, but the depth maps they produce are often noisy, incomplete, or biased due to sensor limitations and environmental conditions. Depth completion methods aim to generate dense, reliable depth maps from RGB images and sparse depth input. However, a key limitation in current depth completion pipelines is the unrealistic generation of sparse depth: sparse pixels are typically selected uniformly at random from dense ground-truth depth, ignoring the fact that real sensors exhibit geometry-dependent and spatially nonuniform reliability. In this work, we propose a normal-guided sparse depth sampling strategy that leverages PCA-based surface normal estimation on the RGB-D point cloud to compute a per-pixel depth reliability measure. The sparse depth samples are then drawn according to this reliability distribution. We integrate this sampling method with the Marigold-DC diffusion-based depth completion model and evaluate it on NYU Depth v2 using the standard metrics. Experiments show that our geometry-aware sparse depth improves accuracy, reduces artifacts near edges and discontinuities, and produces more realistic training conditions that better reflect real sensor behavior.[43] FastBEV++: Fast by Algorithm, Deployable by Design
Yuanpeng Chen,Hui Song,Wei Tao,ShanHui Mo,Shuang Zhang,Xiao Hua,TianKun Zhao
Main category: cs.CV
TL;DR: FastBEV++提出了一种面向部署友好的高效BEV感知框架,通过算法设计和标准化算子实现高性能与高效率的统一,在nuScenes上达到0.359 NDS且超过134 FPS。
Details
Motivation: 现有基于相机的BEV感知方法依赖计算代价高的视图变换和定制化CUDA核,难以兼顾性能与车载部署实用性。 Method: 提出“Fast by Algorithm”和“Deployable by Design”两大原则;采用Index-Gather-Reshape流水线分解投影过程,使用标准算子实现无需自定义内核的视图变换,并结合深度感知融合、时序聚合与数据增强提升BEV表征质量。 Result: 在nuScenes数据集上实现了0.359的新SOTA NDS分数,同时在Tesla T4上达到超过134 FPS的实时性能。 Conclusion: FastBEV++提供了一种无需自定义插件、高精度且可扩展的BEV感知方案,推动了自动驾驶系统中纯视觉BEV感知的工业化落地。 Abstract: The advancement of camera-only Bird's-Eye-View(BEV) perception is currently impeded by a fundamental tension between state-of-the-art performance and on-vehicle deployment tractability. This bottleneck stems from a deep-rooted dependency on computationally prohibitive view transformations and bespoke, platform-specific kernels. This paper introduces FastBEV++, a framework engineered to reconcile this tension, demonstrating that high performance and deployment efficiency can be achieved in unison via two guiding principles: Fast by Algorithm and Deployable by Design. We realize the "Deployable by Design" principle through a novel view transformation paradigm that decomposes the monolithic projection into a standard Index-Gather-Reshape pipeline. Enabled by a deterministic pre-sorting strategy, this transformation is executed entirely with elementary, operator native primitives (e.g Gather, Matrix Multiplication), which eliminates the need for specialized CUDA kernels and ensures fully TensorRT-native portability. Concurrently, our framework is "Fast by Algorithm", leveraging this decomposed structure to seamlessly integrate an end-to-end, depth-aware fusion mechanism. This jointly learned depth modulation, further bolstered by temporal aggregation and robust data augmentation, significantly enhances the geometric fidelity of the BEV representation.Empirical validation on the nuScenes benchmark corroborates the efficacy of our approach. FastBEV++ establishes a new state-of-the-art 0.359 NDS while maintaining exceptional real-time performance, exceeding 134 FPS on automotive-grade hardware (e.g Tesla T4). By offering a solution that is free of custom plugins yet highly accurate, FastBEV++ presents a mature and scalable design philosophy for production autonomous systems. The code is released at: https://github.com/ymlab/advanced-fastbev[44] HybridToken-VLM: Hybrid Token Compression for Vision-Language Models
Jusheng Zhang,Xiaoyang Guo,Kaitong Cai,Qinhan Lv,Yijia Fan,Wenhao Chai,Jian Wang,Keze Wang
Main category: cs.CV
TL;DR: HTC-VLM提出了一种混合视觉语言模型框架,通过连续和离散双通道分别处理细粒度细节与高层语义,在实现580:1高压缩比的同时保留87.2%的性能,有效解决了计算效率与表征保真度之间的权衡问题。
Details
Motivation: 现有视觉语言模型在压缩视觉token时面临效率与保真度的矛盾:连续压缩会削弱高层语义,而离散量化则丢失细节信息。 Method: 提出HTC-VLM,采用双通道架构——连续通路保留ViT图像块的细节,离散通路通过MGVQ量化生成四个符号锚点;利用解耦注意力掩码和瓶颈机制将580个混合token压缩为单个voco token,实现高效且接地的表示融合。 Result: 在七个基准(GQA、VQAv2等)上平均保留87.2%性能,优于最佳连续基线的81.0%;达到580:1压缩比,注意力分析显示压缩后的token更关注离散锚点,验证其语义引导作用。 Conclusion: HTC-VLM通过极简的混合设计成功缓解了视觉语言模型中的效率-保真困境,为可扩展的多模态模型提供了新方向。 Abstract: Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.[45] Residual-SwinCA-Net: A Channel-Aware Integrated Residual CNN-Swin Transformer for Malignant Lesion Segmentation in BUSI
Saeeda Naz,Saddam Hussain Khan
Main category: cs.CV
TL;DR: 提出了一种新的深度混合Residual-SwinCA-Net分割框架,用于乳腺病变分割,结合残差CNN和改进的Swin Transformer,引入多种注意力与边界优化模块,在BUSI数据集上实现了99.29%准确率、98.74% IoU和0.9041 Dice系数,显著提升诊断性能。
Details
Motivation: 为解决乳腺超声图像中病变区域特征提取困难、边界模糊、噪声干扰和多尺度结构变化等问题,需构建一个能同时捕获局部相关特征与全局依赖关系的鲁棒分割模型。 Method: 提出Residual-SwinCA-Net:采用残差CNN模块提取局部强特征;定制带内部残差路径的Swin Transformer块以学习全局依赖并增强梯度稳定性;使用Laplacian-of-Gaussian算子抑制噪声并增强组织连续性;引入边界导向算子保持恶性病灶形态完整性;通过阶段式收缩策略获取尺度不变性;在解码器中引入MSCAS模块实现多尺度通道注意力与信息压缩;最后用Pixel-Attention模块自适应增强病灶像素并抑制背景干扰。 Result: 在公开BUSI数据集上测试,该框架达到99.29%平均准确率、98.74%交并比(IoU)和0.9041 Dice系数,优于现有CNN和ViT方法。 Conclusion: Residual-SwinCA-Net有效融合局部与全局特征,结合多尺度注意力与边界优化机制,显著提升乳腺病变分割精度,有助于提高临床诊断效率与决策及时性。 Abstract: A novel deep hybrid Residual-SwinCA-Net segmentation framework is proposed in the study for addressing such challenges by extracting locally correlated and robust features, incorporating residual CNN modules. Furthermore, for learning global dependencies, Swin Transformer blocks are customized using internal residual pathways, which reinforce gradient stability, refine local patterns, and facilitate global feature fusion. Formerly, for enhancing tissue continuity, ultrasound noise suppressions, and accentuating fine structural transitions Laplacian-of-Gaussian regional operator is applied, and for maintaining the morphological integrity of malignant lesion contours, a boundary-oriented operator has been incorporated. Subsequently, a contraction strategy was applied stage-wise by progressively reducing features-map progressively for capturing scale invariance and enhancing the robustness of structural variability. In addition, each decoder level prior augmentation integrates a new Multi-Scale Channel Attention and Squeezing (MSCAS) module. The MSCAS selectively emphasizes encoder salient maps, retains discriminative global context, and complementary local structures with minimal computational cost while suppressing redundant activations. Finally, the Pixel-Attention module encodes class-relevant spatial cues by adaptively weighing malignant lesion pixels while suppressing background interference. The Residual-SwinCA-Net and existing CNNs/ViTs techniques have been implemented on the publicly available BUSI dataset. The proposed Residual-SwinCA-Net framework outperformed and achieved 99.29% mean accuracy, 98.74% IoU, and 0.9041 Dice for breast lesion segmentation. The proposed Residual-SwinCA-Net framework improves the BUSI lesion diagnostic performance and strengthens timely clinical decision-making.[46] Distilling Future Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection
Haowen Zheng,Hu Zhu,Lu Deng,Weihao Gu,Yang Yang,Yanyan Liang
Main category: cs.CV
TL;DR: 提出了一种基于稀疏查询的未来时序知识蒸馏方法(FTKD),通过未来感知特征重建和未来引导logit蒸馏,有效将离线教师模型中的未来帧知识传递给在线学生模型,在不增加推理成本的情况下显著提升3D目标检测性能。
Details
Motivation: 现有知识蒸馏方法忽视未来帧信息,难以让在线模型有效学习离线模型中利用未来帧带来的丰富时序知识,限制了性能提升。 Method: 提出FTKD,采用稀疏查询机制,设计未来感知特征重建策略以摆脱严格帧对齐约束,并引入未来引导logit蒸馏来利用教师模型的前景与背景上下文信息。 Result: 在nuScenes数据集上,应用于两个高性能3D检测基线上,最高提升1.3 mAP和1.3 NDS,同时实现最准确的速度估计,且不增加推理开销。 Conclusion: FTKD能有效传递未来时序知识,显著提升在线3D目标检测模型性能,为自动驾驶中高效知识蒸馏提供了新思路。 Abstract: Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher's stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.[47] Query-aware Hub Prototype Learning for Few-Shot 3D Point Cloud Semantic Segmentation
YiLin Zhou,Lili Wei,Zheming Xu,Ziyi Chen,Congyan Lang
Main category: cs.CV
TL;DR: 本文提出了一种新的查询感知中心原型(QHP)学习方法,用于解决少样本3D点云语义分割中的原型偏差问题,通过建模支持集与查询集之间的语义关联,显著提升了分割性能。
Details
Motivation: 现有基于度量的原型学习方法仅从支持集中生成原型,未考虑其与查询数据的相关性,导致原型过拟合支持集特征,在分布偏移下泛化能力差,影响分割效果。 Method: 提出查询感知中心原型(QHP)学习方法,包括中心原型生成(HPG)模块,构建支持点与查询点的二部图,识别频繁连接的支持中心并生成与查询相关的原型;并引入原型分布优化(PDO)模块,采用纯度重加权对比损失优化原型表示,将不良中心和异常原型拉近对应类别中心。 Result: 在S3DIS和ScanNet数据集上的大量实验表明,QHP在少样本3D点云语义分割任务中显著优于现有最先进方法,有效缩小了原型与查询集之间的语义差距。 Conclusion: QHP通过显式建模支持集与查询集间的语义关联,有效缓解了原型偏差问题,提升了模型在分布变化下的泛化能力和分割性能。 Abstract: Few-shot 3D point cloud semantic segmentation (FS-3DSeg) aims to segment novel classes with only a few labeled samples. However, existing metric-based prototype learning methods generate prototypes solely from the support set, without considering their relevance to query data. This often results in prototype bias, where prototypes overfit support-specific characteristics and fail to generalize to the query distribution, especially in the presence of distribution shifts, which leads to degraded segmentation performance. To address this issue, we propose a novel Query-aware Hub Prototype (QHP) learning method that explicitly models semantic correlations between support and query sets. Specifically, we propose a Hub Prototype Generation (HPG) module that constructs a bipartite graph connecting query and support points, identifies frequently linked support hubs, and generates query-relevant prototypes that better capture cross-set semantics. To further mitigate the influence of bad hubs and ambiguous prototypes near class boundaries, we introduce a Prototype Distribution Optimization (PDO) module, which employs a purity-reweighted contrastive loss to refine prototype representations by pulling bad hubs and outlier prototypes closer to their corresponding class centers. Extensive experiments on S3DIS and ScanNet demonstrate that QHP achieves substantial performance gains over state-of-the-art methods, effectively narrowing the semantic gap between prototypes and query sets in FS-3DSeg.[48] SFP: Real-World Scene Recovery Using Spatial and Frequency Priors
Yun Liu,Tao Li,Cosmin Ancuti,Wenqi Ren,Weisi Lin
Main category: cs.CV
TL;DR: 本文提出了一种用于真实场景恢复的空域-频域先验方法(SFP),通过结合空间域中的传输图估计和频率域中的自适应增强,有效应对多种退化问题,并通过加权融合策略提升恢复效果。
Details
Motivation: 现有方法通常依赖单一先验或复杂网络,在处理多类真实退化时泛化能力差,且合成数据训练的方法难以适应真实场景多样性,因此需要更鲁棒的先验设计。 Method: 在空间域利用退化图像倒数的谱方向投影估计传输图以处理散射退化;在频率域提出两个新先验:一是三通道直流分量均值近似清晰图像,二是低径向频率幅值占比约为1%;构建自适应频率增强掩码,并设计加权融合策略整合空间恢复、频率增强与输入图像显著特征。 Result: 所提SFP方法在多种退化条件下均表现出优异的恢复性能,广泛评估验证了其有效性与优越性,尤其在真实场景中优于现有方法。 Conclusion: 结合空间与频率双域先验可有效提升真实场景恢复的鲁棒性和泛化能力,所提出的先验假设和融合策略为图像恢复任务提供了新的思路。 Abstract: Scene recovery serves as a critical task for various computer vision applications. Existing methods typically rely on a single prior, which is inherently insufficient to handle multiple degradations, or employ complex network architectures trained on synthetic data, which suffer from poor generalization for diverse real-world scenarios. In this paper, we propose Spatial and Frequency Priors (SFP) for real-world scene recovery. In the spatial domain, we observe that the inverse of the degraded image exhibits a projection along its spectral direction that resembles the scene transmission. Leveraging this spatial prior, the transmission map is estimated to recover the scene from scattering degradation. In the frequency domain, a mask is constructed for adaptive frequency enhancement, with two parameters estimated using our proposed novel priors. Specifically, one prior assumes that the mean intensity of the degraded image's direct current (DC) components across three channels in the frequency domain closely approximates that of each channel in the clear image. The second prior is based on the observation that, for clear images, the magnitude of low radial frequencies below 0.001 constitutes approximately 1% of the total spectrum. Finally, we design a weighted fusion strategy to integrate spatial-domain restoration, frequency-domain enhancement, and salient features from the input image, yielding the final recovered result. Extensive evaluations demonstrate the effectiveness and superiority of our proposed SFP for scene recovery under various degradation conditions.[49] RLCNet: An end-to-end deep learning framework for simultaneous online calibration of LiDAR, RADAR, and Camera
Hafeez Husain Cholakkal,Stefano Arrigoni,Francesco Braghin
Main category: cs.CV
TL;DR: 本文提出了一种名为RLCNet的端到端可训练深度学习框架,用于激光雷达、雷达和相机传感器的在线联合外参标定,具有高精度、强鲁棒性,并支持实时运行。
Details
Motivation: 由于机械振动和动态环境中的传感器漂移,多模态传感器(LiDAR、RADAR、相机)的外参标定具有挑战性,现有方法难以满足实际部署中对准确性和稳定性的要求。 Method: 提出RLCNet,一种端到端可训练的深度学习框架,结合加权移动平均和异常值剔除机制,实现多传感器的同步在线标定,以降低预测噪声并增强对漂移的鲁棒性。 Result: 在真实世界数据集上验证了RLCNet的有效性,实验表明其在不同条件下均表现出优异的标定精度和鲁棒性,优于现有方法。 Conclusion: RLCNet为多传感器在线标定提供了一种高效可靠的解决方案,适用于自动驾驶等需要长期稳定感知的应用场景。 Abstract: Accurate extrinsic calibration of LiDAR, RADAR, and camera sensors is essential for reliable perception in autonomous vehicles. Still, it remains challenging due to factors such as mechanical vibrations and cumulative sensor drift in dynamic environments. This paper presents RLCNet, a novel end-to-end trainable deep learning framework for the simultaneous online calibration of these multimodal sensors. Validated on real-world datasets, RLCNet is designed for practical deployment and demonstrates robust performance under diverse conditions. To support real-time operation, an online calibration framework is introduced that incorporates a weighted moving average and outlier rejection, enabling dynamic adjustment of calibration parameters with reduced prediction noise and improved resilience to drift. An ablation study highlights the significance of architectural choices, while comparisons with existing methods demonstrate the superior accuracy and robustness of the proposed approach.[50] EgoX: Egocentric Video Generation from a Single Exocentric Video
Taewoong Kang,Kinam Kim,Dohyeon Kim,Minho Park,Junha Hyung,Jaegul Choo
Main category: cs.CV
TL;DR: 提出EgoX框架,通过轻量级LoRA适配和统一条件策略,从单个外视角视频生成逼真的内视角视频。
Details
Motivation: 将外视角(第三人称)视频转换为内视角(第一人称)视频具有挑战性,因相机姿态变化大且视野重叠少,需在保持可见内容的同时合理合成不可见区域。 Method: 提出EgoX框架,利用大规模视频扩散模型的时空先验知识,采用轻量级LoRA进行适配,并通过宽度和通道级联融合内外视角先验;引入几何引导的自注意力机制以增强空间一致性和视觉保真度。 Result: 在多种未见过的真实场景视频上实现了连贯且逼真的内视角视频生成,表现出良好的可扩展性和鲁棒性。 Conclusion: EgoX能有效实现从单个外视角视频到内视角视频的转换,在几何一致性与视觉质量方面均表现优异。 Abstract: Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.[51] PAVAS: Physics-Aware Video-to-Audio Synthesis
Oh Hyun-Bin,Yuhta Takida,Toshimitsu Uesaka,Tae-Hyun Oh,Yuki Mitsufuji
Main category: cs.CV
TL;DR: 提出PAVAS方法,通过引入物理参数(如质量、速度)提升视频到音频生成的物理真实性,利用物理驱动音频适配器和扩散模型生成更符合现实物理规律的声音。
Details
Motivation: 现有视频到音频生成模型多基于外观特征,忽略真实声音产生的物理因素,导致生成声音缺乏物理合理性。 Method: 设计Physics-Driven Audio Adapter (Phy-Adapter),结合Physical Parameter Estimator (PPE) 估计运动物体的质量与轨迹速度;使用视觉语言模型(VLM)和分割引导的3D重建获取物理参数,并融入扩散模型生成音频。 Result: 在新构建的VGG-Impact基准上验证,PAVAS在物理真实性和听觉质量上均优于现有方法;提出APCC指标衡量音-物物理一致性。 Conclusion: 引入物理先验能有效提升视频到音频生成的物理可信度,PAVAS为跨模态生成提供了更贴近物理规律的新范式。 Abstract: Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.[52] OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation
Yexin Liu,Manyuan Zhang,Yueze Wang,Hongyu Li,Dian Zheng,Weiming Zhang,Changsheng Lu,Xunliang Cai,Yan Feng,Peng Pei,Harry Yang
Main category: cs.CV
TL;DR: 本文提出了OpenSubject,一个大规模视频衍生数据集,用于提升主体驱动的图像生成与编辑效果,尤其在复杂场景下表现更优。
Details
Motivation: 现有主体驱动生成模型在多主体复杂场景中常偏离参考身份,缺乏高质量、大规模的数据支持,因此需要构建更有效的数据集以提升生成一致性与身份保真度。 Method: 提出四阶段构建流程:视频筛选、跨帧主体挖掘与配对、基于分割图和边界框引导的参考图像合成、以及VLM驱动的验证与标注;构建包含250万样本、435万图像的大规模数据集,并设计基准测试评估生成与编辑性能。 Result: 实验表明,使用OpenSubject训练显著提升了主体驱动生成与编辑的效果,尤其在身份保真、提示遵循、操作一致性和背景一致性方面优于现有方法。 Conclusion: OpenSubject为提升复杂场景下的主体驱动视觉生成提供了有效数据支持,其构建流程和基准有助于推动该领域发展。 Abstract: Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.[53] Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation
Alexander Goslin
Main category: cs.CV
TL;DR: 提出Terrain Diffusion,一种基于扩散模型的无限地形生成方法,作为Perlin噪声的AI时代继任者。
Details
Motivation: 传统过程噪声函数(如Perlin噪声)在真实感和大尺度连贯性上存在局限。 Method: 引入InfiniteDiffusion算法,结合分层扩散模型与拉普拉斯编码,支持无限生成、种子一致性与常数时间随机访问。 Result: 实现了地球尺度下无缝、实时、高保真的无限地形合成,具备大范围连贯性和细节控制能力。 Conclusion: 扩散模型可成为实用的过程式世界生成基础,突破传统噪声函数的限制。 Abstract: For decades, procedural worlds have been built on procedural noise functions such as Perlin noise, which are fast and infinite, yet fundamentally limited in realism and large-scale coherence. We introduce Terrain Diffusion, an AI-era successor to Perlin noise that bridges the fidelity of diffusion models with the properties that made procedural noise indispensable: seamless infinite extent, seed-consistency, and constant-time random access. At its core is InfiniteDiffusion, a novel algorithm for infinite generation, enabling seamless, real-time synthesis of boundless landscapes. A hierarchical stack of diffusion models couples planetary context with local detail, while a compact Laplacian encoding stabilizes outputs across Earth-scale dynamic ranges. An open-source infinite-tensor framework supports constant-memory manipulation of unbounded tensors, and few-step consistency distillation enables efficient generation. Together, these components establish diffusion models as a practical foundation for procedural world generation, capable of synthesizing entire planets coherently, controllably, and without limits.[54] GeoDM: Geometry-aware Distribution Matching for Dataset Distillation
Xuhui Li,Zhengquan Luo,Zihui Cui,Zhiqiang Xu
Main category: cs.CV
TL;DR: 提出了一种几何感知的数据蒸馏框架GeoDM,能够在欧几里得、双曲和球面流形的笛卡尔积空间中对数据分布进行匹配,通过可学习的曲率和权重参数自适应地捕捉数据的内在几何结构,并利用最优传输损失提高分布保真度,理论和实验表明其优于现有方法。
Details
Motivation: 现有数据蒸馏方法局限于欧几里得空间,无法捕捉高维数据潜在流形中的非线性结构(如层次性和周期性),忽略了数据的内在几何特性,导致蒸馏效果受限。 Method: 提出GeoDM框架,在欧几里得、双曲和球面流形的乘积空间中进行分布匹配;引入可学习的曲率与权重参数以自适应数据几何结构;设计基于最优传输的损失函数提升分布保真度。 Result: 理论分析表明该方法在乘积流形上的泛化误差界更小;在多个标准基准上的实验显示其性能优于当前最先进的数据蒸馏方法,且适用于多种单一流形匹配策略。 Conclusion: GeoDM通过联合建模多种几何结构,实现了更高效的数据蒸馏,验证了考虑数据内在几何的重要性,为未来数据压缩与表示学习提供了新方向。 Abstract: Dataset distillation aims to synthesize a compact subset of the original data, enabling models trained on it to achieve performance comparable to those trained on the original large dataset. Existing distribution-matching methods are confined to Euclidean spaces, making them only capture linear structures and overlook the intrinsic geometry of real data, e.g., curvature. However, high-dimensional data often lie on low-dimensional manifolds, suggesting that dataset distillation should have the distilled data manifold aligned with the original data manifold. In this work, we propose a geometry-aware distribution-matching framework, called \textbf{GeoDM}, which operates in the Cartesian product of Euclidean, hyperbolic, and spherical manifolds, with flat, hierarchical, and cyclical structures all captured by a unified representation. To adapt to the underlying data geometry, we introduce learnable curvature and weight parameters for three kinds of geometries. At the same time, we design an optimal transport loss to enhance the distribution fidelity. Our theoretical analysis shows that the geometry-aware distribution matching in a product space yields a smaller generalization error bound than the Euclidean counterparts. Extensive experiments conducted on standard benchmarks demonstrate that our algorithm outperforms state-of-the-art data distillation methods and remains effective across various distribution-matching strategies for the single geometries.[55] Detecting Dental Landmarks from Intraoral 3D Scans: the 3DTeethLand challenge
Achraf Ben-Hamadou,Nour Neifar,Ahmed Rekik,Oussama Smaoui,Firas Bouzguenda,Sergi Pujades,Niels van Nistelrooij,Shankeeth Vinayahalingam,Kaibo Shi,Hairong Jin,Youyi Zheng,Tibor Kubík,Oldřich Kodym,Petr Šilling,Kateřina Trávníčková,Tomáš Mojžiš,Jan Matula,Jeffry Hartanto,Xiaoying Zhu,Kim-Ngan Nguyen,Tudor Dascalu,Huikai Wu,and Weijie Liu,Shaojie Zhuang,Guangshun Wei,Yuanfeng Zhou
Main category: cs.CV
TL;DR: 本文介绍了2024年MICCAI会议举办的3DTeethLand挑战赛,旨在推动基于口腔内3D扫描的牙齿关键点检测技术的发展,并发布了首个公开的3D牙齿关键点数据集。
Details
Motivation: 由于牙齿几何结构复杂且个体差异大,传统方法难以精确检测3D牙齿关键点,临床正畸中对高精度自动化检测的需求迫切。 Method: 通过举办3DTeethLand国际挑战赛,提供一个公开的3D牙齿关键点检测数据集,促进深度学习等先进算法在该领域的研究与应用。 Result: 成功发布了首个用于3D牙齿关键点检测的公开数据集,并吸引了学术界对这一临床重要问题的关注和方法创新。 Conclusion: 3DTeethLand挑战赛为推动牙齿关键点检测技术的发展提供了重要平台,有助于提升临床诊断、个性化治疗及疗效监测的准确性。 Abstract: Teeth landmark detection is a critical task in modern clinical orthodontics. Their precise identification enables advanced diagnostics, facilitates personalized treatment strategies, and supports more effective monitoring of treatment progress in clinical dentistry. However, several significant challenges may arise due to the intricate geometry of individual teeth and the substantial variations observed across different individuals. To address these complexities, the development of advanced techniques, especially through the application of deep learning, is essential for the precise and reliable detection of 3D tooth landmarks. In this context, the 3DTeethLand challenge was held in collaboration with the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2024, calling for algorithms focused on teeth landmark detection from intraoral 3D scans. This challenge introduced the first publicly available dataset for 3D teeth landmark detection, offering a valuable resource to assess the state-of-the-art methods in this task and encourage the community to provide methodological contributions towards the resolution of their problem with significant clinical implications.[56] GeoDiffMM: Geometry-Guided Conditional Diffusion for Motion Magnification
Xuedeng Liu,Jiabao Guo,Zheng Zhang,Fei Wang,Zhi Liu,Dan Guo
Main category: cs.CV
TL;DR: 提出了一种基于扩散模型和光流引导的视频运动放大框架GeoDiffMM,通过无噪声光流增强和几何感知扩散网络实现结构一致的高质量运动放大。
Details
Motivation: 现有欧拉方法在微小运动下难以区分光子噪声与真实微运动,导致放大后噪声显著。 Method: 设计了无噪声光流增强策略生成无噪声非刚性运动场;构建基于光流几何先验和可学习放大因子的扩散运动放大器;采用基于光流的视频合成恢复高保真放大视频。 Result: 在真实和合成数据集上显著优于现有最先进方法,有效抑制噪声并提升运动放大质量。 Conclusion: GeoDiffMM通过结合拉格朗日建模与扩散架构,在结构保持性和噪声抑制方面实现了突破,为视频运动放大提供了新思路。 Abstract: Video Motion Magnification (VMM) amplifies subtle macroscopic motions to a perceptible level. Recently, existing mainstream Eulerian approaches address amplification-induced noise via decoupling representation learning such as texture, shape and frequancey schemes, but they still struggle to separate photon noise from true micro-motion when motion displacements are very small. We propose GeoDiffMM, a novel diffusion-based Lagrangian VMM framework conditioned on optical flow as a geometric cue, enabling structurally consistent motion magnification. Specifically, we design a Noise-free Optical Flow Augmentation strategy that synthesizes diverse nonrigid motion fields without photon noise as supervision, helping the model learn more accurate geometry-aware optial flow and generalize better. Next, we develop a Diffusion Motion Magnifier that conditions the denoising process on (i) optical flow as a geometry prior and (ii) a learnable magnification factor controlling magnitude, thereby selectively amplifying motion components consistent with scene semantics and structure while suppressing content-irrelevant perturbations. Finally, we perform Flow-based Video Synthesis to map the amplified motion back to the image domain with high fidelity. Extensive experiments on real and synthetic datasets show that GeoDiffMM outperforms state-of-the-art methods and significantly improves motion magnification.[57] Low Rank Support Quaternion Matrix Machine
Wang Chen,Ziyan Luo,Shuangyue Wang
Main category: cs.CV
TL;DR: 提出了一种基于四元数的低秩支持四元数矩阵机(LSQMM)方法,用于彩色图像分类,通过保留通道间的耦合关系并引入四元数核范数正则化,提升了分类精度与效率。
Details
Motivation: 传统方法将彩色图像的输入特征表示为实数域中的向量、矩阵或张量,难以充分保留RGB通道间的内在关联;受四元数在图像恢复和去噪中成功的启发,希望利用其代数特性提升分类性能。 Method: 将RGB通道作为纯四元数处理,构建支持四元数矩阵机(LSQMM),引入四元数核范数正则化以促进强相关通道的低秩结构,并结合ADMM算法求解该四元数优化问题。 Result: 在多个彩色图像分类数据集上实验表明,LSQMM在分类准确率、鲁棒性和计算效率方面优于现有的支持向量机、支持矩阵机和支持张量机等方法。 Conclusion: LSQMM有效利用了四元数表示和低秩正则化,能够更好地建模彩色图像的通道间关系,是一种高效且准确的彩色图像分类新方法。 Abstract: Input features are conventionally represented as vectors, matrices, or third order tensors in the real field, for color image classification. Inspired by the success of quaternion data modeling for color images in image recovery and denoising tasks, we propose a novel classification method for color image classification, named as the Low-rank Support Quaternion Matrix Machine (LSQMM), in which the RGB channels are treated as pure quaternions to effectively preserve the intrinsic coupling relationships among channels via the quaternion algebra. For the purpose of promoting low-rank structures resulting from strongly correlated color channels, a quaternion nuclear norm regularization term, serving as a natural extension of the conventional matrix nuclear norm to the quaternion domain, is added to the hinge loss in our LSQMM model. An Alternating Direction Method of Multipliers (ADMM)-based iterative algorithm is designed to effectively resolve the proposed quaternion optimization model. Experimental results on multiple color image classification datasets demonstrate that our proposed classification approach exhibits advantages in classification accuracy, robustness and computational efficiency, compared to several state-of-the-art methods using support vector machines, support matrix machines, and support tensor machines.[58] Interpreting Structured Perturbations in Image Protection Methods for Diffusion Models
Michael R. Martin,Garrick Chan,Kwan-Liu Ma
Main category: cs.CV
TL;DR: 本研究通过可解释AI框架系统分析了Glaze和Nightshade等图像保护机制,揭示其通过与图像内容紧密耦合的结构化低熵扰动在特征、空间和频谱域中工作,而非引入全局语义偏移。
Details
Motivation: 尽管Glaze和Nightshade等图像保护机制在经验上有效,但其扰动的内部结构、可检测性和表征行为尚不清楚,亟需深入理解以指导未来防御设计。 Method: 采用统一框架结合白盒特征空间检查与黑盒信号级探测,利用潜在空间聚类、特征通道激活分析、遮蔽空间敏感性映射和频域表征进行分析。 Result: 发现保护机制产生的是与图像内容对齐的结构化低熵扰动,保留言下特征组织并形成特定子结构;连续保护会增强而非抑制可检测性;频域能量沿主频率轴重新分布而非扩散。 Conclusion: 当前图像保护机制通过结构化的特征级变形运作,解释了其视觉上难以察觉但可一致检测的原因,提升了对抗性图像保护的可解释性,并为生成式AI系统的未来防护与检测策略提供依据。 Abstract: Recent image protection mechanisms such as Glaze and Nightshade introduce imperceptible, adversarially designed perturbations intended to disrupt downstream text-to-image generative models. While their empirical effectiveness is known, the internal structure, detectability, and representational behavior of these perturbations remain poorly understood. This study provides a systematic, explainable AI analysis using a unified framework that integrates white-box feature-space inspection and black-box signal-level probing. Through latent-space clustering, feature-channel activation analysis, occlusion-based spatial sensitivity mapping, and frequency-domain characterization, we show that protection mechanisms operate as structured, low-entropy perturbations tightly coupled to underlying image content across representational, spatial, and spectral domains. Protected images preserve content-driven feature organization with protection-specific substructure rather than inducing global representational drift. Detectability is governed by interacting effects of perturbation entropy, spatial deployment, and frequency alignment, with sequential protection amplifying detectable structure rather than suppressing it. Frequency-domain analysis shows that Glaze and Nightshade redistribute energy along dominant image-aligned frequency axes rather than introducing diffuse noise. These findings indicate that contemporary image protection operates through structured feature-level deformation rather than semantic dislocation, explaining why protection signals remain visually subtle yet consistently detectable. This work advances the interpretability of adversarial image protection and informs the design of future defenses and detection strategies for generative AI systems.[59] PointDico: Contrastive 3D Representation Learning Guided by Diffusion Models
Pengbo Li,Yiding Sun,Haozhe Cheng
Main category: cs.CV
TL;DR: 本文提出了PointDico,一种结合扩散模型和对比学习的3D表示学习新方法,在处理无序点云方面优于现有方法,并在多个基准上达到SOTA性能。
Details
Motivation: 现有的自监督方法在处理3D数据时面临过拟合或无法有效处理无序点云的问题,因此需要一种能融合生成模型和对比学习优势的新方法。 Method: 提出PointDico,通过知识蒸馏将扩散模型作为对比模型的引导,采用分层金字塔条件生成器提取多尺度几何特征,并设计双通道结构融合局部与全局上下文信息。 Result: PointDico在ScanObjectNN上达到94.32%的准确率,在ShapeNetPart上取得86.5%的实例mIoU,显著优于现有方法。 Conclusion: PointDico成功结合了扩散模型与对比学习的优势,有效提升了3D表示学习的性能,为未来3D自监督学习提供了新的思路。 Abstract: Self-supervised representation learning has shown significant improvement in Natural Language Processing and 2D Computer Vision. However, existing methods face difficulties in representing 3D data because of its unordered and uneven density. Through an in-depth analysis of mainstream contrastive and generative approaches, we find that contrastive models tend to suffer from overfitting, while 3D Mask Autoencoders struggle to handle unordered point clouds. This motivates us to learn 3D representations by sharing the merits of diffusion and contrast models, which is non-trivial due to the pattern difference between the two paradigms. In this paper, we propose \textit{PointDico}, a novel model that seamlessly integrates these methods. \textit{PointDico} learns from both denoising generative modeling and cross-modal contrastive learning through knowledge distillation, where the diffusion model serves as a guide for the contrastive model. We introduce a hierarchical pyramid conditional generator for multi-scale geometric feature extraction and employ a dual-channel design to effectively integrate local and global contextual information. \textit{PointDico} achieves a new state-of-the-art in 3D representation learning, \textit{e.g.}, \textbf{94.32\%} accuracy on ScanObjectNN, \textbf{86.5\%} Inst. mIoU on ShapeNetPart.[60] Bi^2MAC: Bimodal Bi-Adaptive Mask-Aware Convolution for Remote Sensing Pansharpening
Xianghong Xiao,Zeyu Xia,Zhou Fei,Jinliang Xiao,Haorui Chen,Liangjian Deng
Main category: cs.CV
TL;DR: 本文提出了一种新的双模态双自适应掩码感知卷积(Bi^2MAC)方法,用于解决遥感图像全色锐化中的区域异质性建模问题,在多个基准数据集上实现了最先进的性能,同时显著降低了计算成本和参数量。
Details
Motivation: 传统深度学习方法在处理特征表示中的区域异质性方面存在局限,现有自适应卷积方法计算成本高且对异质区域捕捉能力有限。 Method: 设计轻量级模块生成软硬掩码,分别用于初步调制输入特征并引导不同区域进入专用处理分支:同质区域由紧凑分支低代价处理,异质区域由聚焦分支进行细粒度建模。 Result: 在多个基准数据集上实验表明,Bi^2MAC在训练时间、参数量和计算成本均更低的情况下,达到了SOTA性能。 Conclusion: Bi^2MAC通过智能分配计算资源,有效平衡了异质区域建模能力与计算效率,为高效全色锐化提供了新思路。 Abstract: Pansharpening aims to fuse a high-resolution panchromatic (PAN) image with a low-resolution multispectral (LRMS) image to generate a high-resolution multispectral image (HRMS). Conventional deep learning-based methods are inherently limited in their ability to adapt to regional heterogeneity within feature representations. Although various adaptive convolution methods have been proposed to address this limitation, they often suffer from excessive computational costs and a limited ability to capture heterogeneous regions in remote sensing images effectively. To overcome these challenges, we propose Bimodal Bi-Adaptive Mask-Aware Convolution (Bi^2MAC), which effectively exploits information from different types of regions while intelligently allocating computational resources. Specifically, we design a lightweight module to generate both soft and hard masks, which are used to modulate the input features preliminarily and to guide different types of regions into separate processing branches, respectively. Redundant features are directed to a compact branch for low-cost global processing. In contrast, heterogeneous features are routed to a focused branch that invests more computational resources for fine-grained modeling. Extensive experiments on multiple benchmark datasets demonstrate that Bi^2MAC achieves state-of-the-art (SOTA) performance while requiring substantially lower training time and parameter counts, and the minimal computational cost among adaptive convolution models.[61] HybridSplat: Fast Reflection-baked Gaussian Tracing using Hybrid Splatting
Chang Liu,Hongliang Yuan,Lianghao Zhang,Sichao Wang,Jianwei Guo,Shi-Sheng Huang
Main category: cs.CV
TL;DR: 本文提出了一种名为HybridSplat的混合点阵化机制,用于3D高斯点阵在复杂反射场景中的高效、高质量渲染。通过引入反射烘焙的高斯追踪和层级加速策略,显著提升了渲染速度并减少了内存占用。
Details
Motivation: 现有的3D高斯点阵方法在处理复杂反射场景时面临渲染速度慢和存储开销大的问题,难以兼顾真实感与效率,因此需要一种新的机制来优化反射建模与渲染流程。 Method: 提出反射烘焙的高斯追踪技术,将视角相关的反射信息嵌入高斯基元中,并结合基于瓦片的点阵化进行渲染;设计统一的混合点阵化框架融合反射与基础高斯基元;引入流水线级加速和反射敏感的高斯剪枝策略以提升效率。 Result: 在Ref-NeRF和NeRF-Casting等复杂反射场景上,相比基于光线追踪的基线方法,HybridSplat实现了约7倍的渲染加速,且仅使用40%的高斯基元数量,显著降低内存消耗同时保持高质量的反射效果。 Conclusion: HybridSplat为复杂反射场景的高保真视图合成提供了新的SOTA方案,在渲染速度、内存效率和视觉质量之间实现了优越平衡。 Abstract: Rendering complex reflection of real-world scenes using 3D Gaussian splatting has been a quite promising solution for photorealistic novel view synthesis, but still faces bottlenecks especially in rendering speed and memory storage. This paper proposes a new Hybrid Splatting(HybridSplat) mechanism for Gaussian primitives. Our key idea is a new reflection-baked Gaussian tracing, which bakes the view-dependent reflection within each Gaussian primitive while rendering the reflection using tile-based Gaussian splatting. Then we integrate the reflective Gaussian primitives with base Gaussian primitives using a unified hybrid splatting framework for high-fidelity scene reconstruction. Moreover, we further introduce a pipeline-level acceleration for the hybrid splatting, and reflection-sensitive Gaussian pruning to reduce the model size, thus achieving much faster rendering speed and lower memory storage while preserving the reflection rendering quality. By extensive evaluation, our HybridSplat accelerates about 7x rendering speed across complex reflective scenes from Ref-NeRF, NeRF-Casting with 4x fewer Gaussian primitives than similar ray-tracing based Gaussian splatting baselines, serving as a new state-of-the-art method especially for complex reflective scenes.[62] DINO-BOLDNet: A DINOv3-Guided Multi-Slice Attention Network for T1-to-BOLD Generation
Jianwei Wang,Qing Wang,Menglan Ruan,Rongjun Ge,Chunfeng Yang,Yang Chen,Chunming Xie
Main category: cs.CV
TL;DR: 提出DINO-BOLDNet,首个从T1w图像直接生成BOLD图像的框架,利用DINOv3引导的多切片注意力机制实现结构到功能的映射。
Details
Motivation: 在BOLD图像缺失或损坏时,通过生成BOLD图像来恢复功能信息,并支持下游任务。 Method: 结合冻结的自监督DINOv3编码器与轻量可训练解码器,利用DINOv3提取切片内结构特征,通过切片注意力模块融合相邻切片上下文信息,并采用多尺度解码器恢复功能对比度,结合基于DINO的感知损失保证结构和纹理一致性。 Result: 在248名受试者的临床数据集上实验表明,DINO-BOLDNet在PSNR和MS-SSIM指标上均优于条件GAN基线方法。 Conclusion: DINO-BOLDNet是首个能直接从T1w图像生成平均BOLD图像的框架,展示了自监督Transformer在结构到功能映射中的潜力。 Abstract: Generating BOLD images from T1w images offers a promising solution for recovering missing BOLD information and enabling downstream tasks when BOLD images are corrupted or unavailable. Motivated by this, we propose DINO-BOLDNet, a DINOv3-guided multi-slice attention framework that integrates a frozen self-supervised DINOv3 encoder with a lightweight trainable decoder. The model uses DINOv3 to extract within-slice structural representations, and a separate slice-attention module to fuse contextual information across neighboring slices. A multi-scale generation decoder then restores fine-grained functional contrast, while a DINO-based perceptual loss encourages structural and textural consistency between predictions and ground-truth BOLD in the transformer feature space. Experiments on a clinical dataset of 248 subjects show that DINO-BOLDNet surpasses a conditional GAN baseline in both PSNR and MS-SSIM. To our knowledge, this is the first framework capable of generating mean BOLD images directly from T1w images, highlighting the potential of self-supervised transformer guidance for structural-to-functional mapping.[63] TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels
Jiahao Lu,Weitao Xiong,Jiacheng Deng,Peng Li,Tianyu Huang,Zhiyang Dou,Cheng Lin,Sai-Kit Yeung,Yuan Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为TrackingWorld的新方法,用于实现单目视频中几乎所有像素的密集3D跟踪,解决了现有方法在分离相机运动与前景动态运动以及追踪新出现物体方面的不足。
Details
Motivation: 现有单目3D跟踪方法难以有效分离相机运动和前景动态运动,并且无法密集追踪视频中新出现的动态主体。 Method: 提出TrackingWorld,包含一个跟踪上采样器将稀疏2D轨迹提升为密集2D轨迹,通过消除重叠区域冗余来泛化至新物体,并基于优化框架结合相机位姿估计将2D轨迹反投影到世界中心的3D坐标系中。 Result: 在合成和真实数据集上的实验表明,该方法能实现准确且密集的世界中心3D跟踪。 Conclusion: TrackingWorld有效克服了现有方法的局限性,在多种场景下实现了高质量、密集的长期3D像素跟踪。 Abstract: Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.[64] SCU-CGAN: Enhancing Fire Detection through Synthetic Fire Image Generation and Dataset Augmentation
Ju-Young Kim,Ji-Hong Park,Gun-Woo Kim
Main category: cs.CV
TL;DR: 提出SCU-CGAN模型,结合U-Net、CBAM和额外判别器,从非火灾图像生成逼真火灾图像,显著提升家庭火灾检测性能。
Details
Motivation: 缺乏足够的火灾数据集限制了火灾检测模型的性能,需通过数据增强解决。 Method: 提出SCU-CGAN模型,融合U-Net、CBAM和额外判别器,将非火灾图像转换为逼真火灾图像,并用于增强训练数据。 Result: SCU-CGAN在KID分数上比CycleGAN提升41.5%,YOLOv5 nano的mAP@0.5:0.95提升了56.5%。 Conclusion: SCU-CGAN能有效生成高质量火灾图像,显著提升火灾检测模型精度,适用于家庭物联网环境下的早期火灾检测。 Abstract: Fire has long been linked to human life, causing severe disasters and losses. Early detection is crucial, and with the rise of home IoT technologies, household fire detection systems have emerged. However, the lack of sufficient fire datasets limits the performance of detection models. We propose the SCU-CGAN model, which integrates U-Net, CBAM, and an additional discriminator to generate realistic fire images from nonfire images. We evaluate the image quality and confirm that SCU-CGAN outperforms existing models. Specifically, SCU-CGAN achieved a 41.5% improvement in KID score compared to CycleGAN, demonstrating the superior quality of the generated fire images. Furthermore, experiments demonstrate that the augmented dataset significantly improves the accuracy of fire detection models without altering their structure. For the YOLOv5 nano model, the most notable improvement was observed in the mAP@0.5:0.95 metric, which increased by 56.5%, highlighting the effectiveness of the proposed approach.[65] The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss
Bozhou Li,Xinda Xue,Sihan Yang,Yang Shi,Xinlong Chen,Yushuo Guan,Yuanxing Zhang,Wentao Zhang
Main category: cs.CV
TL;DR: 本文发现多模态大语言模型(MLLMs)中由于视觉标记和文本标记之间的范数差异,导致“不对称更新动态”,从而影响跨模态融合;提出通过在视觉投影器后添加一个精心初始化的LayerNorm层来实现范数对齐,显著提升性能。
Details
Motivation: 解决MLLMs中因Pre-Norm架构引起的视觉与文本标记间范数不均衡问题,该问题导致跨模态表示学习效率低下。 Method: 通过理论分析揭示范数差异引发的不对称更新动态,并提出在视觉投影器后插入单个LayerNorm层以实现范数对齐。 Result: 在LLaVA-1.5等主流MLLM上验证了该现象的普遍性,所提方法在多个多模态及纯文本基准(如MMLU)上均取得显著性能提升。 Conclusion: 范数对齐是改善MLLM中跨模态融合的关键,简单的架构调整可带来全面且显著的性能增益。 Abstract: Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an ``asymmetric update dynamic,'' where high-norm visual tokens exhibit a ``representational inertia,'' causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic -- the persistence of norm disparity and the resulting asymmetric update rates -- is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.[66] Simultaneous Enhancement and Noise Suppression under Complex Illumination Conditions
Jing Tao,You Li,Banglei Guan,Yang Shang,Qifeng Yu
Main category: cs.CV
TL;DR: 提出了一种在复杂光照条件下同时进行图像增强和降噪的新框架,通过梯度域加权引导滤波、Retinex分解和多曝光融合等策略,显著提升了低光图像的质量。
Details
Motivation: 现有图像增强方法在低光条件下容易放大噪声或仅适用于特定光照,难以兼顾增强效果与噪声抑制。 Method: 采用梯度域加权引导滤波(GDWGIF)估计光照并提升质量,利用Retinex模型分解图像为光照和反射层并分别处理,最后通过多曝光融合和线性拉伸优化动态范围。 Result: 在真实场景数据集上验证,实验结果表明该方法在对比度增强和噪声抑制方面优于现有先进方法。 Conclusion: 所提框架能有效应对复杂光照下的图像退化问题,在提升图像质量的同时实现良好降噪,具有较强实用性。 Abstract: Under challenging light conditions, captured images often suffer from various degradations, leading to a decline in the performance of vision-based applications. Although numerous methods have been proposed to enhance image quality, they either significantly amplify inherent noise or are only effective under specific illumination conditions. To address these issues, we propose a novel framework for simultaneous enhancement and noise suppression under complex illumination conditions. Firstly, a gradient-domain weighted guided filter (GDWGIF) is employed to accurately estimate illumination and improve image quality. Next, the Retinex model is applied to decompose the captured image into separate illumination and reflection layers. These layers undergo parallel processing, with the illumination layer being corrected to optimize lighting conditions and the reflection layer enhanced to improve image quality. Finally, the dynamic range of the image is optimized through multi-exposure fusion and a linear stretching strategy. The proposed method is evaluated on real-world datasets obtained from practical applications. Experimental results demonstrate that our proposed method achieves better performance compared to state-of-the-art methods in both contrast enhancement and noise suppression.[67] Detection of Digital Facial Retouching utilizing Face Beauty Information
Philipp Srock,Juan E. Tapia,Christoph Busch
Main category: cs.CV
TL;DR: 本文研究了面部修饰对人脸识别系统的影响,并提出利用美学评估算法和人工智能特征提取方法来提高修饰检测的准确率,在未知攻击性修饰算法的情况下,单图像检测的D-EER达到1.1%。
Details
Motivation: 由于经过修饰的图像被用作生物识别样本时会影响人脸识别系统的性能,因此需要有效检测面部修饰。 Method: 分析美化评估算法在修饰图像上的变化,采用基于人工智能的不同特征提取方法,并探索面部美感信息是否有助于提升修饰检测效果。 Result: 在攻击性修饰算法未知的场景下,实现了1.1%的D-EER(检测等错误率),显著提升了检测精度。 Conclusion: 面部修饰检测至关重要,结合美学评估与AI驱动的特征提取可有效提升检测性能,尤其在单图像场景下表现优异。 Abstract: Facial retouching to beautify images is widely spread in social media, advertisements, and it is even applied in professional photo studios to let individuals appear younger, remove wrinkles and skin impurities. Generally speaking, this is done to enhance beauty. This is not a problem itself, but when retouched images are used as biometric samples and enrolled in a biometric system, it is one. Since previous work has proven facial retouching to be a challenge for face recognition systems,the detection of facial retouching becomes increasingly necessary. This work proposes to study and analyze changes in beauty assessment algorithms of retouched images, assesses different feature extraction methods based on artificial intelligence in order to improve retouching detection, and evaluates whether face beauty can be exploited to enhance the detection rate. In a scenario where the attacking retouching algorithm is unknown, this work achieved 1.1% D-EER on single image detection.[68] Towards Visual Re-Identification of Fish using Fine-Grained Classification for Electronic Monitoring in Fisheries
Samitha Nuwan Thilakarathna,Ercan Avsar,Martin Mathias Nielsen,Malte Pedersen
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的鱼类重识别管道,利用AutoFish数据集和Swin-T模型在鱼种内重识别任务中取得了优异性能,关键改进包括难样本三元组挖掘和数据集特定的归一化变换。
Details
Motivation: 由于电子监控系统产生大量视频数据,手动分析不可行,因此需要自动化鱼类重识别技术以支持渔业资源管理。 Method: 采用深度学习方法,结合难样本三元组挖掘和自定义图像变换(含数据集特定归一化),比较Swin-T与ResNet-50在AutoFish数据集上的表现。 Result: Swin-T模型优于ResNet-50,达到41.65% mAP@k和90.43% Rank-1准确率;研究发现视角不一致比部分遮挡更影响识别性能。 Conclusion: 基于Transformer的模型结合优化的训练策略可有效提升鱼类重识别精度,尤其适用于视觉相似物种的区分,具备应用于实际渔业监控的潜力。 Abstract: Accurate fisheries data are crucial for effective and sustainable marine resource management. With the recent adoption of Electronic Monitoring (EM) systems, more video data is now being collected than can be feasibly reviewed manually. This paper addresses this challenge by developing an optimized deep learning pipeline for automated fish re-identification (Re-ID) using the novel AutoFish dataset, which simulates EM systems with conveyor belts with six similarly looking fish species. We demonstrate that key Re-ID metrics (R1 and mAP@k) are substantially improved by using hard triplet mining in conjunction with a custom image transformation pipeline that includes dataset-specific normalization. By employing these strategies, we demonstrate that the Vision Transformer-based Swin-T architecture consistently outperforms the Convolutional Neural Network-based ResNet-50, achieving peak performance of 41.65% mAP@k and 90.43% Rank-1 accuracy. An in-depth analysis reveals that the primary challenge is distinguishing visually similar individuals of the same species (Intra-species errors), where viewpoint inconsistency proves significantly more detrimental than partial occlusion. The source code and documentation are available at: https://github.com/msamdk/Fish_Re_Identification.git[69] SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos
Mingqi Gao,Yunqi Miao,Jungong Han
Main category: cs.CV
TL;DR: 提出SAM-Body4D,一种无需训练的视频中人体3D姿态与形状恢复框架,利用视频中的人体连续性实现时间一致性和遮挡鲁棒性。
Details
Motivation: 现有基于图像的HMR方法在处理视频时因逐帧推理导致时间不一致和遮挡下性能下降,需利用视频中的人体连续性来提升表现。 Method: 使用可提示的视频分割模型生成身份一致的masklet,并通过遮挡感知模块进行细化;利用细化后的masklet指导SAM 3D Body生成连贯的全身网格轨迹,结合基于填充的并行策略实现高效多人推断。 Result: 实验表明SAM-Body4D在复杂的真实视频中显著提升了时间稳定性和遮挡鲁棒性,且无需重新训练。 Conclusion: SAM-Body4D为视频中的人体网格恢复提供了一种高效、无需训练的解决方案,在保持身份一致性的同时增强了对遮挡的鲁棒性。 Abstract: Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human-centric understanding in real-world scenarios. While recent image-based HMR methods such as SAM 3D Body achieve strong robustness on in-the-wild images, they rely on per-frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM-Body4D, a training-free framework for temporally consistent and occlusion-robust HMR from videos. We first generate identity-consistent masklets using a promptable video segmentation model, then refine them with an Occlusion-Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full-body mesh trajectories, while a padding-based parallel strategy enables efficient multi-human inference. Experimental results demonstrate that SAM-Body4D achieves improved temporal stability and robustness in challenging in-the-wild videos, without any retraining. Our code and demo are available at: https://github.com/gaomingqi/sam-body4d.[70] Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
Tao Chen,Shaobo Ju,Qiong Wu,Chenxin Fang,Kun Zhang,Jun Peng,Hui Li,Yiyi Zhou,Rongrong Ji
Main category: cs.CV
TL;DR: 本文提出了一种高效且有效的视频理解范式OneClip-RAG,通过单片段检索增强和查询引导的视频分块算法,显著提升多模态大模型在长视频理解上的性能与效率。
Details
Motivation: 由于内存开销过大,现有大多数多模态大语言模型(MLLMs)只能处理有限帧数的视频,难以有效理解长视频内容。 Method: 提出One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG),结合视频片段的完整性与语义连贯性进行增强理解,并设计查询引导的视频分块算法,将片段划分与跨模态检索统一为一步操作。同时构建SynLongVideo数据集并采用渐进式训练策略以提升指令跟随能力。 Result: OneClip-RAG在五个主流MLLM上插件式集成后,在多个长视频基准上表现优异,例如使InternLV2 8B和Qwen2-VL 7B达到GPT-4o水平,并可在单张4090 GPU上2.2分钟内处理长达一小时的视频。 Conclusion: OneClip-RAG有效缓解了MLLM处理长视频时的内存与效率瓶颈,兼具高性能与高效率,具备良好的通用性和实用价值。 Abstract: Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into five recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting InternLV2 8B and Qwen2-VL 7B to the level of GPT-4o on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 2.2 minutes on a single 4090 GPU.[71] SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking
Nico Leuze,Maximilian Hoh,Samed Doğan,Nicolas R. -Peña,Alfred Schoettl
Main category: cs.CV
TL;DR: 提出了一种基于深度图像的全稀疏6D姿态估计框架,通过多视角融合与密度感知稀疏Transformer,在高遮挡工业分拣场景中实现精确、高效的多目标姿态估计。
Details
Motivation: 在密集堆叠、存在遮挡、反光和无纹理的工业抓取环境中,准确恢复物体的6D姿态仍极具挑战性,现有方法难以兼顾高分辨率细节与计算效率。 Method: 提出一种纯深度的全稀疏6D姿态估计方法,将多视角深度图融合为精细点云或稀疏TSDF;采用分阶段热图机制生成多分辨率的场景自适应注意力先验,聚焦前景区域;设计密度感知的稀疏Transformer模块,动态处理自遮挡和3D数据分布不均问题;通过体素级投票策略实现整个场景的联合姿态预测。 Result: 在IPD和MV-YCB多视角数据集上验证,方法在高度杂乱的工业与家用分拣场景中表现出竞争力,能高效处理高分辨率体积表示,捕捉精细几何细节,支持任意数量目标的同时姿态估计。 Conclusion: 该全稀疏框架有效解决了密集环境下的6D姿态估计难题,结合注意力机制与稀疏表示,在保持内存效率的同时提升了精度,推动了稀疏3D方法在近距离机器人操作中的应用。 Abstract: Accurately recovering 6D poses in densely packed industrial bin-picking environments remain a serious challenge, owing to occlusions, reflections, and textureless parts. We introduce a holistic depth-only 6D pose estimation approach that fuses multi-view depth maps into either a fine-grained 3D point cloud in its vanilla version, or a sparse Truncated Signed Distance Field (TSDF). At the core of our framework lies a staged heatmap mechanism that yields scene-adaptive attention priors across different resolutions, steering computation toward foreground regions, thus keeping memory requirements at high resolutions feasible. Along, we propose a density-aware sparse transformer block that dynamically attends to (self-) occlusions and the non-uniform distribution of 3D data. While sparse 3D approaches has proven effective for long-range perception, its potential in close-range robotic applications remains underexplored. Our framework operates fully sparse, enabling high-resolution volumetric representations to capture fine geometric details crucial for accurate pose estimation in clutter. Our method processes the entire scene integrally, predicting the 6D pose via a novel per-voxel voting strategy, allowing simultaneous pose predictions for an arbitrary number of target objects. We validate our method on the recently published IPD and MV-YCB multi-view datasets, demonstrating competitive performance in heavily cluttered industrial and household bin picking scenarios.[72] LapFM: A Laparoscopic Segmentation Foundation Model via Hierarchical Concept Evolving Pre-training
Qing Xu,Kun Yuan,Yuxiang Luo,Yuhao Zhai,Wenting Duan,Nassir Navab,Zhen Chen
Main category: cs.CV
TL;DR: 本文提出LapFM,一种基于大量无标签腹腔镜图像的层次化概念演化预训练范式,用于解决手术分割中注释稀缺和语义不一致的问题。
Details
Motivation: 现有方法依赖有限监督微调自然基础模型,难以跨多样手术目标泛化,且缺乏统一的语义结构来处理解剖、组织和器械等多类实体。 Method: 提出LapFM,采用分层掩码解码器构建腹腔镜概念层次(LCH),并通过置信度驱动的演化标注迭代生成和筛选伪标签,利用无标签数据进行自训练,形成大规模数据集LapBench-114K。 Result: 实验表明LapFM在通用腹腔镜分割任务中显著优于现有最先进方法,实现了粒度自适应的强泛化能力。 Conclusion: LapFM通过层次化概念建模与渐进式伪标签学习,有效利用无标注数据,为手术场景理解建立了新的基础模型范式。 Abstract: Surgical segmentation is pivotal for scene understanding yet remains hindered by annotation scarcity and semantic inconsistency across diverse procedures. Existing approaches typically fine-tune natural foundation models (e.g., SAM) with limited supervision, functioning merely as domain adapters rather than surgical foundation models. Consequently, they struggle to generalize across the vast variability of surgical targets. To bridge this gap, we present LapFM, a foundation model designed to evolve robust segmentation capabilities from massive unlabeled surgical images. Distinct from medical foundation models relying on inefficient self-supervised proxy tasks, LapFM leverages a Hierarchical Concept Evolving Pre-training paradigm. First, we establish a Laparoscopic Concept Hierarchy (LCH) via a hierarchical mask decoder with parent-child query embeddings, unifying diverse entities (i.e., Anatomy, Tissue, and Instrument) into a scalable knowledge structure with cross-granularity semantic consistency. Second, we propose a Confidence-driven Evolving Labeling that iteratively generates and filters pseudo-labels based on hierarchical consistency, progressively incorporating reliable samples from unlabeled images into training. This process yields LapBench-114K, a large-scale benchmark comprising 114K image-mask pairs. Extensive experiments demonstrate that LapFM significantly outperforms state-of-the-art methods, establishing new standards for granularity-adaptive generalization in universal laparoscopic segmentation. The source code is available at https://github.com/xq141839/LapFM.[73] Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
Luca Cogo,Marco Buzzelli,Simone Bianco,Javier Vazquez-Corral,Raimondo Schettini
Main category: cs.CV
TL;DR: 提出了一种基于学习的端到端框架,联合利用高分辨率RGB传感器和低分辨率多光谱传感器进行色彩校正,显著提升了色彩准确性和稳定性。
Details
Motivation: 现有色彩校正方法通常分阶段处理且过早丢弃多光谱数据,未能充分利用多光谱成像的优势。 Method: 提出一个统一的学习框架,整合RGB和多光谱传感器数据,端到端地完成色彩校正,并重构两种先进图像到图像网络结构以验证通用性。 Result: 实验表明该方法相比仅使用RGB或传统多光谱方法,色彩误差降低高达50%,色彩准确性和稳定性显著提升。 Conclusion: 所提框架能有效融合多光谱与RGB数据,实现高质量色彩校正,具备良好的通用性与应用潜力。 Abstract: Recent advances in snapshot multispectral (MS) imaging have enabled compact, low-cost spectral sensors for consumer and mobile devices. By capturing richer spectral information than conventional RGB sensors, these systems can enhance key imaging tasks, including color correction. However, most existing methods treat the color correction pipeline in separate stages, often discarding MS data early in the process. We propose a unified, learning-based framework that (i) performs end-to-end color correction and (ii) jointly leverages data from a high-resolution RGB sensor and an auxiliary low-resolution MS sensor. Our approach integrates the full pipeline within a single model, producing coherent and color-accurate outputs. We demonstrate the flexibility and generality of our framework by refactoring two different state-of-the-art image-to-image architectures. To support training and evaluation, we construct a dedicated dataset by aggregating and repurposing publicly available spectral datasets, rendering under multiple RGB camera sensitivities. Extensive experiments show that our approach improves color accuracy and stability, reducing error by up to 50% compared to RGB-only and MS-driven baselines. Datasets, code, and models will be made available upon acceptance.[74] Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts
Madhav Gupta,Vishak Prasad C,Ganesh Ramakrishnan
Main category: cs.CV
TL;DR: 提出一种结合子模选择与不确定性估计的框架,提升分布外场景下视觉模型解释的鲁棒性与保真度。
Details
Motivation: 现有基于子集选择的解释方法在分布外(OOD)情况下表现不佳,缺乏稳定性与可靠性,需提高其鲁棒性和可信度。 Method: 结合子模子集选择与逐层梯度驱动的不确定性估计,通过自适应权重扰动估计不确定性,并用于指导子集优化过程。 Result: 在多个ID-OOD数据集上验证,该方法不仅改善了OOD下的解释质量,也在ID设置中优于现有方法,生成更稳定、多样且信息丰富的解释。 Conclusion: 引入不确定性感知的优化机制可有效增强子集选择类解释方法的鲁棒性与可解释性,推动真实场景中AI系统的透明化与可信化。 Abstract: Subset selection-based methods are widely used to explain deep vision models: they attribute predictions by highlighting the most influential image regions and support object-level explanations. While these methods perform well in in-distribution (ID) settings, their behavior under out-of-distribution (OOD) conditions remains poorly understood. Through extensive experiments across multiple ID-OOD sets, we find that reliability of the existing subset based methods degrades markedly, yielding redundant, unstable, and uncertainty-sensitive explanations. To address these shortcomings, we introduce a framework that combines submodular subset selection with layer-wise, gradient-based uncertainty estimation to improve robustness and fidelity without requiring additional training or auxiliary models. Our approach estimates uncertainty via adaptive weight perturbations and uses these estimates to guide submodular optimization, ensuring diverse and informative subset selection. Empirical evaluations show that, beyond mitigating the weaknesses of existing methods under OOD scenarios, our framework also yields improvements in ID settings. These findings highlight limitations of current subset-based approaches and demonstrate how uncertainty-driven optimization can enhance attribution and object-level interpretability, paving the way for more transparent and trustworthy AI in real-world vision applications.[75] Team-Aware Football Player Tracking with SAM: An Appearance-Based Approach to Occlusion Recovery
Chamath Ranasinghe,Uthayasanker Thayasivam
Main category: cs.CV
TL;DR: 提出了一种结合SAM和CSRT的轻量级足球运动员跟踪方法,利用球队感知和基于球衣颜色的重识别来提升遮挡恢复能力,在资源受限条件下表现良好但长期遮挡仍具挑战。
Details
Motivation: 解决足球视频中频繁遮挡、外观相似和密集场景下的运动员跟踪难题,提升现有方法在复杂比赛环境中的鲁棒性和实用性。 Method: 结合Segment Anything Model(SAM)进行精确初始化,使用CSRT跟踪器与基于HSV直方图的球衣颜色外观模型进行跟踪,并引入球队感知机制和遮挡后的重识别策略。 Result: 系统达到7.6-7.7 FPS,内存稳定在约1880 MB;轻度遮挡下跟踪成功率为100%,拥挤罚球区场景下为90%;基于外观的重识别可恢复50%严重遮挡,但长期离框后仅8.66%重捕获率。 Conclusion: SAM+CSRT组合在不同人群密度下表现稳定,适合连续可见场景,但需更强的重识别机制应对长期遮挡,为资源受限的足球跟踪系统提供了实用部署指南。 Abstract: Football player tracking is challenged by frequent occlusions, similar appearances, and rapid motion in crowded scenes. This paper presents a lightweight SAM-based tracking method combining the Segment Anything Model (SAM) with CSRT trackers and jersey color-based appearance models. We propose a team-aware tracking system that uses SAM for precise initialization and HSV histogram-based re-identification to improve occlusion recovery. Our evaluation measures three dimensions: processing speed (FPS and memory), tracking accuracy (success rate and box stability), and robustness (occlusion recovery and identity consistency). Experiments on football video sequences show that the approach achieves 7.6-7.7 FPS with stable memory usage (~1880 MB), maintaining 100 percent tracking success in light occlusions and 90 percent in crowded penalty-box scenarios with 5 or more players. Appearance-based re-identification recovers 50 percent of heavy occlusions, demonstrating the value of domain-specific cues. Analysis reveals key trade-offs: the SAM + CSRT combination provides consistent performance across crowd densities but struggles with long-term occlusions where players leave the frame, achieving only 8.66 percent re-acquisition success. These results offer practical guidelines for deploying football tracking systems under resource constraints, showing that classical tracker-based methods work well with continuous visibility but require stronger re-identification mechanisms for extended absences.[76] ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention
Huiguo He,Pengyu Yan,Ziqi Yi,Weizhi Zhong,Zheng Liu,Yejun Tang,Huan Yang,Kun Gai,Guanbin Li,Lianwen Jin
Main category: cs.CV
TL;DR: 提出ContextDrag,一种基于上下文建模的拖拽式图像编辑新方法,通过引入无噪声参考特征注入和位置一致性注意力机制,在无需微调或反转的情况下实现高保真、语义一致的图像编辑。
Details
Motivation: 现有拖拽式图像编辑方法未能充分利用参考图像中的上下文信息(如纹理细节),导致编辑结果连贯性和保真度不足。 Method: 提出ContextDrag,包含两个核心组件:1)上下文保持令牌注入(CTI),通过潜在空间逆映射(LRM)将VAE编码的无噪声参考特征注入目标位置;2)位置一致性注意力(PCA),通过位置重编码和重叠感知掩码减少无关特征干扰。 Result: 在DragBench-SR和DragBench-DR上实验表明,该方法优于所有现有SOTA方法,显著提升编辑的语义一致性和细节保留能力。 Conclusion: ContextDrag通过有效利用参考图像的上下文信息,实现了更精确、更自然的拖拽式图像编辑,为无需微调的高质量图像编辑提供了新思路。 Abstract: Drag-based image editing aims to modify visual content followed by user-specified drag operations. Despite existing methods having made notable progress, they still fail to fully exploit the contextual information in the reference image, including fine-grained texture details, leading to edits with limited coherence and fidelity. To address this challenge, we introduce ContextDrag, a new paradigm for drag-based editing that leverages the strong contextual modeling capability of editing models, such as FLUX-Kontext. By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details, without the need for finetuning or inversion. Specifically, ContextDrag introduced a novel Context-preserving Token Injection (CTI) that injects noise-free reference features into their correct destination locations via a Latent-space Reverse Mapping (LRM) algorithm. This strategy enables precise drag control while preserving consistency in both semantics and texture details. Second, ContextDrag adopts a novel Position-Consistent Attention (PCA), which positional re-encodes the reference tokens and applies overlap-aware masking to eliminate interference from irrelevant reference features. Extensive experiments on DragBench-SR and DragBench-DR demonstrate that our approach surpasses all existing SOTA methods. Code will be publicly available.[77] Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform
Yuning Gong,Yifei Liu,Yifan Zhan,Muyao Niu,Xueying Li,Yuanjun Liao,Jiaming Chen,Yuanyuan Gao,Jiaqi Chen,Minming Chen,Li Zhou,Yuning Zhang,Wei Wang,Xiaoqing Hou,Huaxi Huang,Shixiang Tang,Le Ma,Dingwen Zhang,Xue Yang,Junchi Yan,Yanchi Zhang,Yinqiang Zheng,Xiao Sun,Zhihang Zhong
Main category: cs.CV
TL;DR: 本文提出了Visionary,一个基于WebGPU的开源、网页原生平台,支持实时渲染多种3D高斯点阵和网格,并集成每帧ONNX推理,实现动态神经处理与生成式后处理,具备轻量、即点即用、易集成等优势。
Details
Motivation: 现有3D高斯点阵(3DGS)可视化工具碎片化、笨重或受限于传统流程,部署困难且难以支持动态内容与生成模型,因此需要一个高效、开放、轻量的网页端解决方案。 Method: 构建基于WebGPU的高效渲染器,结合每帧ONNX推理,设计标准化的Gaussian Generator合约,支持多种3DGS变体的即插即用算法,并提供可集成于three.js的TypeScript API。 Result: 在相同3DGS资源下,Visionary相比现有Web查看器具有更高的渲染效率,已支持MLP-based 3DGS、4DGS、神经头像、风格迁移等多种变体,并实现生成式后处理。 Conclusion: Visionary通过在浏览器中统一推理与渲染,显著降低了3DGS类方法的复现、比较与部署门槛,成为重建与生成范式共用的统一世界模型载体。 Abstract: Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web-native platform for real-time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per-frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, "click-to-run" browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug-and-play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post-processing. The platform further offers a plug in three.js library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU-based primitive sorting. It already supports multiple variants, including MLP-based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.[78] Temporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions
Ada Gorgun,Fawaz Sammani,Nikos Deligiannis,Bernt Schiele,Jonas Fischer
Main category: cs.CV
TL;DR: 本文提出了一种名为PCI(Prompt-Conditioned Intervention)的训练免费、模型无关框架,用于分析扩散模型中概念(如年龄)在去噪过程中何时形成并稳定下来。通过定义“概念插入成功率”(CIS),研究不同时间步插入的概念能否保留在最终图像中,揭示了不同概念在生成轨迹中的动态行为差异,并为文本驱动的图像编辑提供了有效干预时机的实用指导。
Details
Motivation: 扩散模型通常仅通过最终输出评估,但其生成过程是动态的。理解这一过程中概念如何形成和锁定,有助于提升模型的可控性、可靠性和可预测性。现有方法缺乏对概念在去噪轨迹中演变机制的系统分析。 Method: 提出PCI框架,通过在不同时间步向潜在表示注入特定概念(如通过提示引导),并测量该概念是否保留在最终图像中,从而计算概念插入成功率(CIS)。该方法无需训练或访问模型内部结构,适用于多种文本到图像扩散模型。 Result: PCI揭示了不同概念在扩散时间轴上具有不同的敏感阶段,某些概念类型在特定时间段更容易被成功插入和保留;发现不同模型在概念演化动态上表现不同;在图像编辑任务中,基于PCI选择最佳干预时机可实现比强基线更优的语义准确性和内容保持平衡。 Conclusion: PCI为分析扩散模型中概念的时序动态提供了一个通用工具,揭示了生成过程中关键的时间窗口,不仅增强了对模型行为的理解,还为无需训练的外部控制方法提供了实践指导,特别是在文本驱动图像编辑中实现了更优的编辑效果。 Abstract: Diffusion models are usually evaluated by their final outputs, gradually denoising random noise into meaningful images. Yet, generation unfolds along a trajectory, and analyzing this dynamic process is crucial for understanding how controllable, reliable, and predictable these models are in terms of their success/failure modes. In this work, we ask the question: when does noise turn into a specific concept (e.g., age) and lock in the denoising trajectory? We propose PCI (Prompt-Conditioned Intervention) to study this question. PCI is a training-free and model-agnostic framework for analyzing concept dynamics through diffusion time. The central idea is the analysis of Concept Insertion Success (CIS), defined as the probability that a concept inserted at a given timestep is preserved and reflected in the final image, offering a way to characterize the temporal dynamics of concept formation. Applied to several state-of-the-art text-to-image diffusion models and a broad taxonomy of concepts, PCI reveals diverse temporal behaviors across diffusion models, in which certain phases of the trajectory are more favorable to specific concepts even within the same concept type. These findings also provide actionable insights for text-driven image editing, highlighting when interventions are most effective without requiring access to model internals or training, and yielding quantitatively stronger edits that achieve a balance of semantic accuracy and content preservation than strong baselines. Code is available at: https://github.com/adagorgun/PCI-Prompt-Controlled-Interventions[79] On-the-fly Large-scale 3D Reconstruction from Multi-Camera Rigs
Yijia Guo,Tong Hu,Zhiwei Li,Liwen Hu,Keming Qian,Xitong Lin,Shengbo Chen,Tiejun Huang,Lei Ma
Main category: cs.CV
TL;DR: 本文提出了一种首个用于多相机系统的实时3D高斯点阵化重建框架,通过密集RGB流的融合实现快速、鲁棒且高保真的大尺度场景重建。
Details
Motivation: 现有的单目3DGS方法受限于视场角,难以实现完整的3D覆盖,而多相机系统可从根本上缓解该问题,但缺乏高效的在线融合框架。 Method: 提出分层相机初始化策略以实现无需标定的粗略对齐,结合轻量级多相机光束法平差优化轨迹,并设计无冗余的高斯采样策略与频率感知优化调度器以提升效率和保真度。 Result: 仅使用原始多摄像头视频流即可在2分钟内重建数百米的3D场景,实现了前所未有的速度、鲁棒性和保真度。 Conclusion: 该方法为多相机系统下的实时3D场景重建提供了高效解决方案,在大规模动态环境中具有广泛应用潜力。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled efficient free-viewpoint rendering and photorealistic scene reconstruction. While on-the-fly extensions of 3DGS have shown promise for real-time reconstruction from monocular RGB streams, they often fail to achieve complete 3D coverage due to the limited field of view (FOV). Employing a multi-camera rig fundamentally addresses this limitation. In this paper, we present the first on-the-fly 3D reconstruction framework for multi-camera rigs. Our method incrementally fuses dense RGB streams from multiple overlapping cameras into a unified Gaussian representation, achieving drift-free trajectory estimation and efficient online reconstruction. We propose a hierarchical camera initialization scheme that enables coarse inter-camera alignment without calibration, followed by a lightweight multi-camera bundle adjustment that stabilizes trajectories while maintaining real-time performance. Furthermore, we introduce a redundancy-free Gaussian sampling strategy and a frequency-aware optimization scheduler to reduce the number of Gaussian primitives and the required optimization iterations, thereby maintaining both efficiency and reconstruction fidelity. Our method reconstructs hundreds of meters of 3D scenes within just 2 minutes using only raw multi-camera video streams, demonstrating unprecedented speed, robustness, and Fidelity for on-the-fly 3D scene reconstruction.[80] Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models
Jiaming Zhang,Che Wang,Yang Cao,Longtao Huang,Wei Yang Bryan Lim
Main category: cs.CV
TL;DR: 本文提出了一种名为ReasonBreak的新型对抗框架,用于破坏多模态大推理模型(MLRM)中的分层推理过程,以应对从个人图像推断地理位置所带来的隐私风险。
Details
Motivation: 现有的隐私保护技术主要针对感知型模型,难以有效防御MLRM通过多层次推理分析环境线索来推断精确地理位置的能力,因此需要一种专门针对推理过程的防护机制。 Method: 提出ReasonBreak框架,利用概念感知的扰动策略,针对推理链中的关键概念依赖进行干扰;并构建了包含6,341张超高分辨率图像和分层概念标注的数据集GeoPrivacy-6K以支持研究。 Result: 在七个最先进的MLRM(包括GPT-o3、GPT-5、Gemini 2.5 Pro)上的实验表明,ReasonBreak在街区级保护上达到33.8%(相对提升14.4%),区块级保护接近翻倍至33.5%。 Conclusion: ReasonBreak为应对基于推理的隐私威胁建立了新的防护范式,证明了面向概念层级的扰动比均匀噪声更有效。 Abstract: Multi-modal large reasoning models (MLRMs) pose significant privacy risks by inferring precise geographic locations from personal images through hierarchical chain-of-thought reasoning. Existing privacy protection techniques, primarily designed for perception-based models, prove ineffective against MLRMs' sophisticated multi-step reasoning processes that analyze environmental cues. We introduce \textbf{ReasonBreak}, a novel adversarial framework specifically designed to disrupt hierarchical reasoning in MLRMs through concept-aware perturbations. Our approach is founded on the key insight that effective disruption of geographic reasoning requires perturbations aligned with conceptual hierarchies rather than uniform noise. ReasonBreak strategically targets critical conceptual dependencies within reasoning chains, generating perturbations that invalidate specific inference steps and cascade through subsequent reasoning stages. To facilitate this approach, we contribute \textbf{GeoPrivacy-6K}, a comprehensive dataset comprising 6,341 ultra-high-resolution images ($\geq$2K) with hierarchical concept annotations. Extensive evaluation across seven state-of-the-art MLRMs (including GPT-o3, GPT-5, Gemini 2.5 Pro) demonstrates ReasonBreak's superior effectiveness, achieving a 14.4\% improvement in tract-level protection (33.8\% vs 19.4\%) and nearly doubling block-level protection (33.5\% vs 16.8\%). This work establishes a new paradigm for privacy protection against reasoning-based threats.[81] Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models
Vasco Ramos,Regev Cohen,Idan Szpektor,Joao Magalhaes
Main category: cs.CV
TL;DR: 提出NoisyCLIP,首次在去噪过程中利用噪声潜在空间进行文本-图像对齐评估,可在生成时实时检测错配,降低50%计算成本并保持98%的CLIP性能。
Details
Motivation: 现有条件扩散模型常出现文本-图像错配和幻觉问题,需在生成后进行对齐检测,但该过程耗时且昂贵,难以实现实时质量控制。 Method: 提出NoisyCLIP方法,利用双编码器在反向扩散过程的噪声潜在空间中测量语义对齐,实现生成过程中的早期错配检测。 Result: 实验表明NoisyCLIP在Best-of-N设置下可减少50%计算成本,同时达到CLIP对齐性能的98%,支持生成过程中的实时对齐评估。 Conclusion: 在噪声潜在空间中进行早期对齐检测是可行且高效的,NoisyCLIP为降低生成成本、提升语义保真度提供了新路径。 Abstract: Conditional diffusion models rely on language-to-image alignment methods to steer the generation towards semantically accurate outputs. Despite the success of this architecture, misalignment and hallucinations remain common issues and require automatic misalignment detection tools to improve quality, for example by applying them in a Best-of-N (BoN) post-generation setting. Unfortunately, measuring the alignment after the generation is an expensive step since we need to wait for the overall generation to finish to determine prompt adherence. In contrast, this work hypothesizes that text/image misalignments can be detected early in the denoising process, enabling real-time alignment assessment without waiting for the complete generation. In particular, we propose NoisyCLIP a method that measures semantic alignment in the noisy latent space. This work is the first to explore and benchmark prompt-to-latent misalignment detection during image generation using dual encoders in the reverse diffusion process. We evaluate NoisyCLIP qualitatively and quantitatively and find it reduces computational cost by 50% while achieving 98% of CLIP alignment performance in BoN settings. This approach enables real-time alignment assessment during generation, reducing costs without sacrificing semantic fidelity.[82] OCCDiff: Occupancy Diffusion Model for High-Fidelity 3D Building Reconstruction from Noisy Point Clouds
Jialu Sui,Rui Liu,Hongsheng Zhang
Main category: cs.CV
TL;DR: 提出OCCDiff方法,通过在占据函数空间中应用潜在扩散,结合函数自编码器和点云编码器,实现从LiDAR点云中高精度、鲁棒地重建建筑表面。
Details
Motivation: 现有方法在不同点密度和噪声干扰下难以准确重建建筑表面,缺乏对连续3D结构的灵活建模能力。 Method: 设计OCCDiff框架,包括基于潜在扩散的生成模型、函数自编码器以学习连续占据函数,以及点编码器用于条件特征提取与多模态特征融合,并采用多任务训练策略提升特征表示能力。 Result: 实验结果表明,该方法能生成物理一致且高保真的占据函数,对噪声数据具有强鲁棒性,在不同分辨率下均能保持高质量重建效果。 Conclusion: OCCDiff通过在占据函数空间中引入扩散模型,有效解决了LiDAR点云中建筑表面重建的挑战,实现了灵活、精确且鲁棒的3D建模。 Abstract: A major challenge in reconstructing buildings from LiDAR point clouds lies in accurately capturing building surfaces under varying point densities and noise interference. To flexibly gather high-quality 3D profiles of the building in diverse resolution, we propose OCCDiff applying latent diffusion in the occupancy function space. Our OCCDiff combines a latent diffusion process with a function autoencoder architecture to generate continuous occupancy functions evaluable at arbitrary locations. Moreover, a point encoder is proposed to provide condition features to diffusion learning, constraint the final occupancy prediction for occupancy decoder, and insert multi-modal features for latent generation to latent encoder. To further enhance the model performance, a multi-task training strategy is employed, ensuring that the point encoder learns diverse and robust feature representations. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy data.[83] Thinking with Images via Self-Calling Agent
Wenxi Yang,Yuzhong Zhao,Fang Wan,Qixiang Ye
Main category: cs.CV
TL;DR: 提出了一种新的视觉推理范式sCoT,通过将多模态推理重构为仅语言的自调用思维链,提升训练效率与性能。
Details
Motivation: 现有的交错多模态思维链(iMCoT)依赖稀少的高质量推理数据,难以通过强化学习有效优化。 Method: 将复杂视觉任务分解为原子子任务,由主智能体调用参数共享的虚拟子智能体在隔离上下文中解决,采用群相对策略优化来增强推理行为。 Result: 在HR-Bench 4K上,sCoT相比强基线方法提升了最高1.9%的推理性能,并减少了约75%的GPU训练时间。 Conclusion: sCoT是一种高效且有效的视觉推理新范式,无需显式模态交错,显著提升训练效率和推理性能。 Abstract: Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.[84] MVP: Multiple View Prediction Improves GUI Grounding
Yunzhu Zhang,Zeyu Pan,Zhengwen Zeng,Shuheng Shen,Changhua Meng,Linchao Zhu
Main category: cs.CV
TL;DR: 提出了一种无需训练的多视角预测框架MVP,通过注意力引导的视图生成和多坐标聚类来提升GUI定位任务中的坐标预测稳定性与性能。
Details
Motivation: 现有GUI定位模型对微小视觉扰动敏感,导致坐标预测不稳定,尤其在高分辨率和小UI元素场景下表现差,影响实际应用。 Method: 提出Multi-View Prediction (MVP) 框架,包含两个部分:1)注意力引导的视图生成,利用指令-图像注意力分数生成多样化裁剪视图;2)多坐标聚类,通过空间聚类选择最密集簇的中心作为最终预测坐标。 Result: 在ScreenSpot-Pro等基准上显著提升多个模型性能,如将Qwen3VL-32B-Instruct的准确率提升至74.0%,且具有良好的鲁棒性和泛化能力。 Conclusion: MVP是一种有效、通用且无需训练的增强方法,能显著提高GUI接地模型的稳定性和准确性,适用于多种现有模型。 Abstract: GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability, minor visual perturbations (e.g. cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP's effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%. The code is available at https://github.com/ZJUSCL/MVP.[85] PaintFlow: A Unified Framework for Interactive Oil Paintings Editing and Generation
Zhangli Hu,Ye Chen,Jiajun Yao,Bingbing Ni
Main category: cs.CV
TL;DR: 本文提出了一种用于油画生成与编辑的统一多模态框架,支持参考图像、手绘草图和文本提示的联合输入,实现语义、结构与风格的一致性控制。
Details
Motivation: 现有方法受限于训练数据分布,难以实现对油画等艺术风格图像的精细编辑与生成,尤其缺乏对抽象思维与艺术表达结合的有效建模。 Method: 1)在训练阶段引入空间对齐与语义增强条件策略,将草图和掩码作为空间约束,参考图像和文本编码为特征约束;2)提出基于笔触渲染(SBR)的自监督风格迁移流程,构建大规模配对数据集;3)推理时采用AdaIN算子进行特征融合以保持风格一致性。 Result: 实验表明该方法能实现细粒度编辑并保留油画的艺术特性,在风格化油画生成与编辑中达到了更高的想象力实现水平。 Conclusion: 所提框架有效解决了油画生成与编辑中的风格一致性与语义控制难题,为艺术风格图像的交互式创作提供了新思路。 Abstract: Oil painting, as a high-level medium that blends human abstract thinking with artistic expression, poses substantial challenges for digital generation and editing due to its intricate brushstroke dynamics and stylized characteristics. Existing generation and editing techniques are often constrained by the distribution of training data and primarily focus on modifying real photographs. In this work, we introduce a unified multimodal framework for oil painting generation and editing. The proposed system allows users to incorporate reference images for precise semantic control, hand-drawn sketches for spatial structure alignment, and natural language prompts for high-level semantic guidance, while consistently maintaining a unified painting style across all outputs. Our method achieves interactive oil painting creation through three crucial technical advancements. First, we enhance the training stage with spatial alignment and semantic enhancement conditioning strategy, which map masks and sketches into spatial constraints, and encode contextual embedding from reference images and text into feature constraints, enabling object-level semantic alignment. Second, to overcome data scarcity, we propose a self-supervised style transfer pipeline based on Stroke-Based Rendering (SBR), which simulates the inpainting dynamics of oil painting restoration, converting real images into stylized oil paintings with preserved brushstroke textures to construct a large-scale paired training dataset. Finally, during inference, we integrate features using the AdaIN operator to ensure stylistic consistency. Extensive experiments demonstrate that our interactive system enables fine-grained editing while preserving the artistic qualities of oil paintings, achieving an unprecedented level of imagination realization in stylized oil paintings generation and editing.[86] Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement
Xinyue Liang,Zhinyuan Ma,Lingchen Sun,Yanjun Guo,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出Photo3D,一种基于GPT-4o生成图像推动的逼真3D生成框架,通过结构对齐的多视图合成与细节增强策略,提升3D模型的外观真实感。
Details
Motivation: 现有3D生成模型在几何结构上表现良好,但外观不够真实,主要受限于高质量、多样化真实3D资产(尤其是丰富纹理)的缺乏,而实际采集这些数据困难重重。 Method: 提出Photo3D框架:利用GPT-4o生成图像驱动;设计结构对齐的多视图合成流程以缓解单图像3D结构不一致问题;构建具3D几何配对的细节增强多视图数据集;采用感知特征适配和语义结构匹配来增强纹理细节并保持几何一致性;提供适用于耦合与解耦3D生成范式的训练策略。 Result: 实验证明Photo3D在多种3D生成范式上均表现优异,实现了当前最先进的逼真3D生成效果,且具有良好的泛化能力。 Conclusion: Photo3D有效解决了3D生成中外观真实感不足的问题,通过利用大模型生成图像与结构感知的细节增强方法,实现了几何与纹理的高质量协同生成,推动了3D内容生成的发展。 Abstract: Although recent 3D-native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich texture details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non-rigid motions of objects, and the limited precision of 3D scanners. We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT-4o-Image model. Considering that the generated images can distort 3D structures due to their lack of multi-view consistency, we design a structure-aligned multi-view synthesis pipeline and construct a detail-enhanced multi-view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with realistic details while preserving the structural consistency with the 3D-native geometry. Our scheme is general to different 3D-native generators, and we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance.[87] Fast-ARDiff: An Entropy-informed Acceleration Framework for Continuous Space Autoregressive Generation
Zhen Zou,Xiaoxiao Ma,Jie Huang,Zichao Yu,Feng Zhao
Main category: cs.CV
TL;DR: 提出Fast-ARDiff,一种统一的AR-扩散混合框架,通过熵感知的推测解码和联合蒸馏优化,显著加速AR与扩散模型的生成过程,实现高达4.3倍的无损加速。
Details
Motivation: AR-扩散混合模型虽能结合结构化建模与高质量图像生成,但因AR的串行生成和扩散的迭代去噪导致高延迟,亟需端到端联合优化以提升推理速度。 Method: 1)设计熵感知的推测解码策略,使草稿模型输出更匹配目标模型熵特性,降低拒绝率;2)将扩散模块纳入统一框架,通过动态调度优先优化AR部分,并指导扩散;3)采用轨迹与分布匹配的联合蒸馏实现少步数高质量生成;4)推理时利用AR浅层特征熵预过滤低熵草稿,减少冗余计算。 Result: 在ImageNet 256×256上,TransDiff实现4.3倍无损加速,NextStep-1在文本生成任务中达到3倍加速,显著优于现有方法。 Conclusion: Fast-ARDiff通过统一优化AR与扩散模块,有效解决混合模型的高延迟问题,为高效生成提供新范式。 Abstract: Autoregressive(AR)-diffusion hybrid paradigms combine AR's structured modeling with diffusion's photorealistic synthesis, yet suffer from high latency due to sequential AR generation and iterative denoising. In this work, we tackle this bottleneck and propose a unified AR-diffusion framework Fast-ARDiff that jointly optimizes both components, accelerating AR speculative decoding while simultaneously facilitating faster diffusion decoding. Specifically: (1) The entropy-informed speculative strategy encourages draft model to produce higher-entropy representations aligned with target model's entropy characteristics, mitigating entropy mismatch and high rejection rates caused by draft overconfidence. (2) For diffusion decoding, rather than treating it as an independent module, we integrate it into the same end-to-end framework using a dynamic scheduler that prioritizes AR optimization to guide the diffusion part in further steps. The diffusion part is optimized through a joint distillation framework combining trajectory and distribution matching, ensuring stable training and high-quality synthesis with extremely few steps. During inference, shallow feature entropy from AR module is used to pre-filter low-entropy drafts, avoiding redundant computation and improving latency. Fast-ARDiff achieves state-of-the-art acceleration across diverse models: on ImageNet 256$\times$256, TransDiff attains 4.3$\times$ lossless speedup, and NextStep-1 achieves 3$\times$ acceleration on text-conditioned generation. Code will be available at https://github.com/aSleepyTree/Fast-ARDiff.[88] A Novel Wasserstein Quaternion Generative Adversarial Network for Color Image Generation
Zhigang Jia,Duan Wang,Hengkai Wang,Yajun Xie,Meixiang Zhao,Xiaoyu Zhao
Main category: cs.CV
TL;DR: 提出了一种新的四元数Wasserstein距离及其对偶理论,并基于此开发了Wasserstein四元数生成对抗网络,在彩色图像生成中表现出更高的效率和图像质量。
Details
Motivation: 现有生成模型忽略了颜色通道之间的相关性,可能导致色差问题,且缺乏系统解释彩色图像数据分布的理论。 Method: 定义了新的四元数Wasserstein距离,推导其强对偶形式,并结合四元数凸集分离定理和四元数Farkas引理构建Wasserstein四元数生成对抗网络。 Result: 新模型在生成效率和图像质量上优于传统的生成对抗网络和Wasserstein生成对抗网络。 Conclusion: 该方法有效提升了彩色图像生成性能,为衡量不同彩色图像数据集提供了新的理论基础。 Abstract: Color image generation has a wide range of applications, but the existing generation models ignore the correlation among color channels, which may lead to chromatic aberration problems. In addition, the data distribution problem of color images has not been systematically elaborated and explained, so that there is still the lack of the theory about measuring different color images datasets. In this paper, we define a new quaternion Wasserstein distance and develop its dual theory. To deal with the quaternion linear programming problem, we derive the strong duality form with helps of quaternion convex set separation theorem and quaternion Farkas lemma. With using quaternion Wasserstein distance, we propose a novel Wasserstein quaternion generative adversarial network. Experiments demonstrate that this novel model surpasses both the (quaternion) generative adversarial networks and the Wasserstein generative adversarial network in terms of generation efficiency and image quality.[89] An Iteration-Free Fixed-Point Estimator for Diffusion Inversion
Yifei Chen,Kaiyu Song,Yan Pan,Jianxing Yu,Jian Yin,Hanjiang Lai
Main category: cs.CV
TL;DR: 提出了一种无需迭代的固定点估计器用于扩散模型反演,通过误差逼近实现高效且准确的图像重建。
Details
Motivation: 现有的基于固定点迭代的扩散反演方法计算成本高且超参数选择复杂,需要一种更高效的非迭代解决方案。 Method: 推导出理想反演步骤中固定点的显式表达式,并引入前一步可计算的预测误差来近似当前步的未知误差,从而构建一个低方差无偏的可计算固定点估计器。 Result: 在NOCAPS和MS-COCO数据集上验证了该方法的有效性,相比DDIM及其他基于迭代的方法,在无需额外训练或迭代的情况下实现了更优的重建性能。 Conclusion: 所提出的迭代自由固定点估计器在保持高重建精度的同时显著降低了计算开销,为扩散模型反演提供了一种高效实用的新方法。 Abstract: Diffusion inversion aims to recover the initial noise corresponding to a given image such that this noise can reconstruct the original image through the denoising diffusion process. The key component of diffusion inversion is to minimize errors at each inversion step, thereby mitigating cumulative inaccuracies. Recently, fixed-point iteration has emerged as a widely adopted approach to minimize reconstruction errors at each inversion step. However, it suffers from high computational costs due to its iterative nature and the complexity of hyperparameter selection. To address these issues, we propose an iteration-free fixed-point estimator for diffusion inversion. First, we derive an explicit expression of the fixed point from an ideal inversion step. Unfortunately, it inherently contains an unknown data prediction error. Building upon this, we introduce the error approximation, which uses the calculable error from the previous inversion step to approximate the unknown error at the current inversion step. This yields a calculable, approximate expression for the fixed point, which is an unbiased estimator characterized by low variance, as shown by our theoretical analysis. We evaluate reconstruction performance on two text-image datasets, NOCAPS and MS-COCO. Compared to DDIM inversion and other inversion methods based on the fixed-point iteration, our method achieves consistent and superior performance in reconstruction tasks without additional iterations or training.[90] SSCATeR: Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling for Real-Time 3D Object Detection in LiDAR Point Clouds
Alexander Dow,Manduhu Manduhu,Matheus Santos,Ben Bartlett,Gerard Dooly,James Riordan
Main category: cs.CV
TL;DR: 提出了一种基于稀疏散射卷积并结合时间数据复用的LiDAR目标检测方法SSCATeR,通过仅处理点云变化区域显著减少计算量,保持精度的同时实现最高6.61倍的速度提升。
Details
Motivation: 在LiDAR目标检测中,传统方法对每一帧全区域进行卷积计算,造成大量冗余计算。本文旨在利用LiDAR连续扫描中的动态变化特性,避免对无变化区域重复处理,从而提升推理效率。 Method: 采用滑动时间窗口和短步幅策略,存储并复用跨帧的卷积结果,仅对点云发生变化的区域执行卷积操作。扩展了此前的散射卷积工作,提出SSCATeR算法,支持时间维度上的数据复用,并将LiDAR数据视为连续流进行处理。 Result: 该方法在不牺牲检测精度的前提下,最多可将处理时间减少至原来的1/6.61倍;输出的特征图与传统稀疏卷积完全一致,验证了其等效性和高效性。 Conclusion: SSCATeR通过引入时间域的数据复用机制,在保持检测性能的同时极大提升了LiDAR目标检测的计算效率,适用于需要实时处理的自动驾驶等场景。 Abstract: This work leverages the continuous sweeping motion of LiDAR scanning to concentrate object detection efforts on specific regions that receive a change in point data from one frame to another. We achieve this by using a sliding time window with short strides and consider the temporal dimension by storing convolution results between passes. This allows us to ignore unchanged regions, significantly reducing the number of convolution operations per forward pass without sacrificing accuracy. This data reuse scheme introduces extreme sparsity to detection data. To exploit this sparsity, we extend our previous work on scatter-based convolutions to allow for data reuse, and as such propose Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling (SSCATeR). This operation treats incoming LiDAR data as a continuous stream and acts only on the changing parts of the point cloud. By doing so, we achieve the same results with as much as a 6.61-fold reduction in processing time. Our test results show that the feature maps output by our method are identical to those produced by traditional sparse convolution techniques, whilst greatly increasing the computational efficiency of the network.[91] BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain
Navve Wasserman,Matias Cosarinsky,Yuval Golbari,Aude Oliva,Antonio Torralba,Tamar Rott Shaham,Michal Irani
Main category: cs.CV
TL;DR: 提出了一种大规模自动化框架,用于发现和解释人类大脑皮层中的视觉表征。
Details
Motivation: 现有研究受限于规模小、依赖人工检查、缺乏系统验证,难以全面理解复杂脑信号与海量视觉概念的关系。 Method: 采用无监督的数据驱动分解方法发现fMRI活动中的可解释模式,并通过自动化流程识别最能激发每种模式的自然图像,生成其共享视觉意义的自然语言描述。 Result: 发现了数千个跨越多种视觉概念的可解释模式,包括此前未报道的细粒度表征。 Conclusion: 该框架能够系统性地揭示全脑范围内的视觉表征,为理解大脑如何编码视觉概念提供了新工具。 Abstract: Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.[92] Modular Neural Image Signal Processing
Mahmoud Afifi,Zhongling Wang,Ran Zhang,Michael S. Brown
Main category: cs.CV
TL;DR: 提出一种模块化神经图像信号处理(ISP)框架,能够处理原始输入并渲染高质量的显示参考图像,具有高渲染精度、可扩展性、可调试性、泛化能力和风格灵活性。
Details
Motivation: 现有神经ISP设计缺乏对中间渲染阶段的精细控制,难以兼顾通用性、可调试性和用户偏好匹配,因此需要一种更灵活、可控且高效的模块化解决方案。 Method: 设计一个完全基于学习的模块化神经ISP框架,包含多个可独立调控的中间处理阶段,支持不同容量的模型变体,参数量在0.5M到3.9M之间,并开发了一个用户交互式照片编辑工具以展示其灵活性和重渲染能力。 Result: 该方法在多个测试集上实现了具有竞争力的定性和定量结果,具备良好的泛化能力(适用于未见过的相机)、可调试性以及支持多种用户偏好的图像风格,同时支持无限次后编辑重渲染。 Conclusion: 模块化神经ISP框架在保持模型轻量化的同时,显著提升了图像渲染质量、系统灵活性和实用性,为未来智能图像处理和用户定制化编辑提供了有效方案。 Abstract: This paper presents a modular neural image signal processing (ISP) framework that processes raw inputs and renders high-quality display-referred images. Unlike prior neural ISP designs, our method introduces a high degree of modularity, providing full control over multiple intermediate stages of the rendering process.~This modular design not only achieves high rendering accuracy but also improves scalability, debuggability, generalization to unseen cameras, and flexibility to match different user-preference styles. To demonstrate the advantages of this design, we built a user-interactive photo-editing tool that leverages our neural ISP to support diverse editing operations and picture styles. The tool is carefully engineered to take advantage of the high-quality rendering of our neural ISP and to enable unlimited post-editable re-rendering. Our method is a fully learning-based framework with variants of different capacities, all of moderate size (ranging from ~0.5 M to ~3.9 M parameters for the entire pipeline), and consistently delivers competitive qualitative and quantitative results across multiple test sets. Watch the supplemental video at: https://youtu.be/ByhQjQSjxVM[93] Instance-Aware Test-Time Segmentation for Continual Domain Shifts
Seunghwan Lee,Inyoung Jung,Hojoon Lee,Eunil Park,Sungeun Hong
Main category: cs.CV
TL;DR: 提出一种自适应调整伪标签的方法,能够在连续测试时适应(CTTA)中实现细粒度的、类和实例感知的语义分割,有效缓解误差累积,在八种场景下超越现有最先进方法。
Details
Motivation: 现有CTTA方法依赖固定或批次级阈值,无法应对不同类别和实例的难度变化,尤其在需要密集预测的语义分割中表现受限。 Method: 通过根据每幅图像内的置信度分布自适应调整伪标签,并动态平衡受域偏移影响最大的类别学习,实现类和实例感知的持续适应。 Result: 在八个CTTA和TTA场景(包括合成到真实和长期域偏移)中实验表明,该方法在语义分割任务上持续优于现有最先进技术。 Conclusion: 所提方法通过细粒度的自适应伪标签机制,提升了模型在持续演化域中的适应能力,为语义分割下的连续测试时适应设立了新标准。 Abstract: Continual Test-Time Adaptation (CTTA) enables pre-trained models to adapt to continuously evolving domains. Existing methods have improved robustness but typically rely on fixed or batch-level thresholds, which cannot account for varying difficulty across classes and instances. This limitation is especially problematic in semantic segmentation, where each image requires dense, multi-class predictions. We propose an approach that adaptively adjusts pseudo labels to reflect the confidence distribution within each image and dynamically balances learning toward classes most affected by domain shifts. This fine-grained, class- and instance-aware adaptation produces more reliable supervision and mitigates error accumulation throughout continual adaptation. Extensive experiments across eight CTTA and TTA scenarios, including synthetic-to-real and long-term shifts, show that our method consistently outperforms state-of-the-art techniques, setting a new standard for semantic segmentation under evolving conditions.[94] From Cells to Survival: Hierarchical Analysis of Cell Inter-Relations in Multiplex Microscopy for Lung Cancer Prognosis
Olle Edgren Schüllerqvist,Jens Baumann,Joakim Lindblad,Love Nordling,Artur Mezheyeuski,Patrick Micke,Nataša Sladoje
Main category: cs.CV
TL;DR: HiGINE是一种基于层次图的模型,用于从多重免疫荧光图像中的肿瘤微环境特征预测肺癌患者的生存期,并通过整合局部和全局细胞邻域关系及多模态信息提升风险分层性能。
Details
Motivation: 肿瘤微环境(TME)是潜在预后生物标志物的重要来源,现有方法难以充分捕捉不同细胞类型间的复杂相互作用,需更有效的分析手段来提升生存预测和风险分层能力。 Method: 提出HiGINE,一种层次化图模型,利用多重免疫荧光(mIF)图像编码细胞类型和形态信息,建模细胞邻域的局部与全局关系,并融合癌症分期等多模态数据以预测患者生存期(短 vs. 长)。 Result: 在两个公开数据集上验证了HiGINE,结果表明其在风险分层、鲁棒性和泛化性方面均优于现有方法。 Conclusion: HiGINE能有效利用肿瘤微环境的复杂结构信息,结合多模态数据显著提升肺癌患者生存预测的准确性,具有良好的临床应用潜力。 Abstract: The tumor microenvironment (TME) has emerged as a promising source of prognostic biomarkers. To fully leverage its potential, analysis methods must capture complex interactions between different cell types. We propose HiGINE -- a hierarchical graph-based approach to predict patient survival (short vs. long) from TME characterization in multiplex immunofluorescence (mIF) images and enhance risk stratification in lung cancer. Our model encodes both local and global inter-relations in cell neighborhoods, incorporating information about cell types and morphology. Multimodal fusion, aggregating cancer stage with mIF-derived features, further boosts performance. We validate HiGINE on two public datasets, demonstrating improved risk stratification, robustness, and generalizability.[95] Disturbance-Free Surgical Video Generation from Multi-Camera Shadowless Lamps for Open Surgery
Yuna Kato,Shohei Mori,Hideo Saito,Yoshifumi Takatsume,Hiroki Kajita,Mariko Isogawa
Main category: cs.CV
TL;DR: 本文提出了一种自动化方法,用于解决手术视频录制中因无影灯移动导致的多摄像头图像对齐问题,并通过选择遮挡最少的视角生成稳定、高质量的手术视野视频,显著提升了观看舒适度和术区可视性。
Details
Motivation: 在开放手术视频录制中,外科医生常遮挡摄像头视野,且无影灯移动会导致多摄像头系统需频繁手动对齐,费时费力。因此需要一种全自动的方法来实现实时图像对齐与最优视角选择。 Method: 该方法通过检测无影灯移动的关键帧,自动对齐不同摄像头的图像,并基于遮挡程度选择最佳摄像头视图,合成从固定视角持续呈现手术区域的视频。同时实现了多种视频合成选项供比较。 Result: 用户研究表明,所生成的视频在外科医生评估中优于传统方法,在确认手术区域的便利性和观看舒适度方面表现更好,视频质量也优于现有技术。此外,不同合成选项的偏好也通过用户研究得以分析。 Conclusion: 该方法有效实现了手术视频采集中的自动对齐与低遮挡视角选择,提升了医学教育和研究所需视频的质量与可用性,具有临床实用价值。 Abstract: Video recordings of open surgeries are greatly required for education and research purposes. However, capturing unobstructed videos is challenging since surgeons frequently block the camera field of view. To avoid occlusion, the positions and angles of the camera must be frequently adjusted, which is highly labor-intensive. Prior work has addressed this issue by installing multiple cameras on a shadowless lamp and arranging them to fully surround the surgical area. This setup increases the chances of some cameras capturing an unobstructed view. However, manual image alignment is needed in post-processing since camera configurations change every time surgeons move the lamp for optimal lighting. This paper aims to fully automate this alignment task. The proposed method identifies frames in which the lighting system moves, realigns them, and selects the camera with the least occlusion to generate a video that consistently presents the surgical field from a fixed perspective. A user study involving surgeons demonstrated that videos generated by our method were superior to those produced by conventional methods in terms of the ease of confirming the surgical area and the comfort during video viewing. Additionally, our approach showed improvements in video quality over existing techniques. Furthermore, we implemented several synthesis options for the proposed view-synthesis method and conducted a user study to assess surgeons' preferences for each option.[96] Automated Pollen Recognition in Optical and Holographic Microscopy Images
Swarn Singh Warshaneyan,Maksims Ivanovs,Blaž Cugmas,Inese Bērziņa,Laura Goldberga,Mindaugas Tamosiunas,Roberts Kadiķis
Main category: cs.CV
TL;DR: 本研究利用YOLOv8s和MobileNetV3L模型,提升光学与全息显微图像中花粉颗粒的检测与分类性能,尤其应用于兽医细胞学。通过数据集扩展和边界框放大显著改善了全息图像的表现。
Details
Motivation: 提高兽医细胞学中花粉颗粒自动检测与分类的准确性,探索低成本无透镜全息显微技术与深度学习结合的可行性。 Method: 采用YOLOv8s进行目标检测,MobileNetV3L进行分类,比较其在光学与灰度全息图像上的表现,并通过自动标注和扩大边界框区域来增强全息图像数据集。 Result: 光学图像上检测mAP50达91.3%,分类准确率达97%;全息图像初始性能较低,经优化后检测mAP50从2.49%提升至13.3%,分类准确率从42%提升至54%。 Conclusion: 至少在图像分类任务中,深度学习可与低成本无透镜数字全息显微设备有效结合,具备应用潜力。 Abstract: This study explores the application of deep learning to improve and automate pollen grain detection and classification in both optical and holographic microscopy images, with a particular focus on veterinary cytology use cases. We used YOLOv8s for object detection and MobileNetV3L for the classification task, evaluating their performance across imaging modalities. The models achieved 91.3% mAP50 for detection and 97% overall accuracy for classification on optical images, whereas the initial performance on greyscale holographic images was substantially lower. We addressed the performance gap issue through dataset expansion using automated labeling and bounding box area enlargement. These techniques, applied to holographic images, improved detection performance from 2.49% to 13.3% mAP50 and classification performance from 42% to 54%. Our work demonstrates that, at least for image classification tasks, it is possible to pair deep learning techniques with cost-effective lensless digital holographic microscopy devices.[97] Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning
Zhenyu Zhang,Guangyao Chen,Yixiong Zou,Zhimeng Huang,Yuhua Li
Main category: cs.CV
TL;DR: 本文提出了一种通过空提示来解耦CLIP模型中模板偏差的方法,以减少模板-样本相似性(TSS)带来的分类偏差,提升模型的准确性和鲁棒性。
Details
Motivation: 发现CLIP模型在少样本学习中因模板-样本相似性(TSS)产生偏差,导致分类不准确和鲁棒性下降。 Method: 引入表达“空”的无类别信息文本提示,在预训练阶段揭示并减轻模板引起的偏差;在少样本微调阶段使用偏差校准损失,确保图像与其类别正确对齐。 Result: 在多个基准测试上验证了该方法能显著降低TSS引起性能波动,提高分类准确率和模型鲁棒性。 Conclusion: 通过空提示和偏差校准损失可有效缓解CLIP中的模板偏差,增强模型对真实语义对齐的学习能力。 Abstract: The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.[98] OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics
Jisang Yoo,Gyeongjin Kang,Hyun-kyu Ko,Hyeonwoo Yu,Eunbyung Park
Main category: cs.CV
TL;DR: 本文提出了OpenMonoGS-SLAM,首个结合3D高斯点阵与开放词汇语义理解的单目SLAM框架,利用视觉基础模型实现无需深度输入或语义标注的高效定位、建图与开放世界语义理解。
Details
Motivation: 现有SLAM系统多依赖深度传感器或闭集语义模型,限制了在开放环境中的可扩展性与适应性,因此需要一种能在无深度信息和开放词汇条件下进行语义建图的方法。 Method: 提出OpenMonoGS-SLAM,融合3D高斯点阵(3DGS)与视觉基础模型(如MASt3R用于几何,SAM和CLIP用于开放词汇语义),通过自监督学习实现单目SLAM,并设计内存机制管理高维语义特征,构建高斯语义特征图。 Result: 实验表明该方法在闭集和开集分割任务中性能优于或媲美现有基线方法,且不依赖深度传感器或语义标注。 Conclusion: OpenMonoGS-SLAM首次实现了完全基于单目图像的开放词汇语义SLAM,展示了视觉基础模型在空间感知系统中的强大潜力,为未来智能系统提供了更高层次的环境理解能力。 Abstract: Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.[99] Trajectory Densification and Depth from Perspective-based Blur
Tianchen Qiu,Qirun Zhang,Jiajian He,Zhengyue Zhuge,Jiahui Xu,Yueting Chen
Main category: cs.CV
TL;DR: 提出一种通过分析视频流中的模糊模式和密集轨迹来估计度深的新方法,结合光学设计与算法,在无机械稳定器情况下实现高精度深度估计。
Details
Motivation: 在缺乏机械稳定器的情况下,相机拍摄时会产生旋转动态,导致透视模糊,尤其在长曝光场景下更为明显,影响深度估计的准确性。 Method: 利用现成的视觉编码器和点追踪器提取视频信息,通过窗口嵌入和多窗口聚合估计深度图,并使用视觉-语言模型稠密化稀疏轨迹。 Result: 在多个深度数据集上的实验表明,该方法在大深度范围内表现优异,具有良好的泛化能力,且在手持拍摄条件下相比真实轨迹具有更高的精度和重建准确性。 Conclusion: 所提方法能有效利用透视模糊特征进行深度估计,结合光学设计与算法优化,在实际应用场景中具备高精度和强鲁棒性。 Abstract: In the absence of a mechanical stabilizer, the camera undergoes inevitable rotational dynamics during capturing, which induces perspective-based blur especially under long-exposure scenarios. From an optical standpoint, perspective-based blur is depth-position-dependent: objects residing at distinct spatial locations incur different blur levels even under the same imaging settings. Inspired by this, we propose a novel method that estimate metric depth by examining the blur pattern of a video stream and dense trajectory via joint optical design algorithm. Specifically, we employ off-the-shelf vision encoder and point tracker to extract video information. Then, we estimate depth map via windowed embedding and multi-window aggregation, and densify the sparse trajectory from the optical algorithm using a vision-language model. Evaluations on multiple depth datasets demonstrate that our method attains strong performance over large depth range, while maintaining favorable generalization. Relative to the real trajectory in handheld shooting settings, our optical algorithm achieves superior precision and the dense reconstruction maintains strong accuracy.[100] Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu,Zhuoyang Liu,Yixiang Luomei,Feng Xu
Main category: cs.CV
TL;DR: 本文提出了一种基于单目RGB图像和自然语言指令的无人机视觉-语言导航统一框架,通过提示引导的多任务学习联合优化空间感知、轨迹推理与动作预测,并引入关键帧选择和标签重加权机制以提升效率与训练稳定性,在Aerial VLN基准上取得了优于现有RGB-only方法的性能。
Details
Motivation: 现有 aerial VLN 方法依赖全景图、深度或里程计信息,增加了系统成本与复杂性,不利于轻量级无人机的实际部署。因此需要一种仅使用单目RGB图像即可高效导航的解决方案。 Method: 将导航建模为下一个令牌预测问题,采用提示引导的多任务学习框架;提出关键帧选择策略减少视觉冗余,并设计动作合并与标签重加权机制缓解长尾监督不平衡问题,实现稳定多任务协同训练。 Result: 在 Aerial VLN 基准上的实验表明,该方法在可见和不可见环境中均显著优于现有的 RGB-only 基线,并缩小了与最先进的全景 RGB-D 方法之间的性能差距。 Conclusion: 所提出的方法在不依赖额外传感器输入的情况下实现了高效的 aerial VLN,验证了其在轻量化无人机平台上的应用潜力,同时通过关键组件设计提升了多任务学习的性能与稳定性。 Abstract: Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.[101] Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation
Young Kyung Kim,Oded Schlesinger,Yuzhou Zhao,J. Matias Di Martino,Guillermo Sapiro
Main category: cs.CV
TL;DR: 提出Chain-of-Image Generation (CoIG) 框架,将图像生成重构为类人、可监控的语义序列过程,提升生成过程的可解释性与可控性。
Details
Motivation: 现有图像生成模型生成过程不透明,难以监控和干预,且缺乏类人的生成逻辑,限制了其可靠性、安全性与可解释性。 Method: 利用大语言模型(LLM)将复杂文本提示分解为逐步的简单指令,指导图像生成模型按序生成并编辑图像,每步聚焦单一语义实体;引入CoIG Readability与Causal Relevance两个新指标评估中间步骤的可读性与因果影响。 Result: CoIG显著提升了生成过程的定量可监控性,在组合生成鲁棒性方面表现与主流基线模型相当;有效缓解实体坍塌问题。 Conclusion: CoIG框架通过类人的分步生成机制,实现了更透明、可控的图像生成,具有模型无关性,可广泛集成于现有生成模型中。 Abstract: While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a "black box." This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.[102] C-DIRA: Computationally Efficient Dynamic ROI Routing and Domain-Invariant Adversarial Learning for Lightweight Driver Behavior Recognition
Keito Inoshita
Main category: cs.CV
TL;DR: 本文提出了一种名为C-DIRA的轻量级驾驶员分心行为识别框架,结合动态ROI路由和对抗学习,在保持高精度的同时显著降低计算成本,并具备良好的跨域鲁棒性和实时性。
Details
Motivation: 现有轻量模型在驾驶员行为识别中难以兼顾效率与性能,且对未见驾驶员或变化环境泛化能力差;基于ROI的方法计算开销大,缺乏领域不变特征学习机制。 Method: 提出C-DIRA框架,采用显著性驱动的Top-K ROI池化与融合分类进行局部特征提取,通过动态ROI路由实现难样本的自适应计算分配,并引入伪域标签与对抗学习提升模型对驾驶员和背景变化的鲁棒性。 Result: 在State Farm数据集上,C-DIRA以更少的FLOPs和更低延迟实现了优于先前轻量模型的准确率,同时在模糊、低光等视觉退化条件下及跨域场景中表现出强健的稳定性。 Conclusion: C-DIRA在紧凑性、高效性和泛化能力之间取得了良好平衡,适用于边缘设备上的实时驾驶员行为识别。 Abstract: Driver distraction behavior recognition using in-vehicle cameras demands real-time inference on edge devices. However, lightweight models often fail to capture fine-grained behavioral cues, resulting in reduced performance on unseen drivers or under varying conditions. ROI-based methods also increase computational cost, making it difficult to balance efficiency and accuracy. This work addresses the need for a lightweight architecture that overcomes these constraints. We propose Computationally efficient Dynamic region of Interest Routing and domain-invariant Adversarial learning for lightweight driver behavior recognition (C-DIRA). The framework combines saliency-driven Top-K ROI pooling and fused classification for local feature extraction and integration. Dynamic ROI routing enables selective computation by applying ROI inference only to high difficulty data samples. Moreover, pseudo-domain labeling and adversarial learning are used to learn domain-invariant features robust to driver and background variation. Experiments on the State Farm Distracted Driver Detection Dataset show that C-DIRA maintains high accuracy with significantly fewer FLOPs and lower latency than prior lightweight models. It also demonstrates robustness under visual degradation such as blur and low-light, and stable performance across unseen domains. These results confirm C-DIRA's effectiveness in achieving compactness, efficiency, and generalization.[103] Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank
Shaofeng Zhang,Xuanqi Chen,Ning Liao,Haoxiang Zhao,Xiaoxing Wang,Haoru Tan,Sitong Wu,Xiaosong Jia,Qi Fan,Junchi Yan
Main category: cs.CV
TL;DR: Error
Details
Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: The dominance of denoising generative models (e.g., diffusion, flow-matching) in visual synthesis is tempered by their substantial training costs and inefficiencies in representation learning. While injecting discriminative representations via auxiliary alignment has proven effective, this approach still faces key limitations: the reliance on external, pre-trained encoders introduces overhead and domain shift. A dispersed-based strategy that encourages strong separation among in-batch latent representations alleviates this specific dependency. To assess the effect of the number of negative samples in generative modeling, we propose {\mname}, a plug-and-play training framework that requires no external encoders. Our method integrates a memory bank mechanism that maintains a large, dynamically updated queue of negative samples across training iterations. This decouples the number of negatives from the mini-batch size, providing abundant and high-quality negatives for a contrastive objective without a multiplicative increase in computational cost. A low-dimensional projection head is used to further minimize memory and bandwidth overhead. {\mname} offers three principal advantages: (1) it is self-contained, eliminating dependency on pretrained vision foundation models and their associated forward-pass overhead; (2) it introduces no additional parameters or computational cost during inference; and (3) it enables substantially faster convergence, achieving superior generative quality more efficiently. On ImageNet-256, {\mname} achieves a state-of-the-art FID of \textbf{2.40} within 400k steps, significantly outperforming comparable methods.[104] Dual-Branch Center-Surrounding Contrast: Rethinking Contrastive Learning for 3D Point Clouds
Shaofeng Zhang,Xuanqi Chen,Xiangdong Zhang,Sitong Wu,Junchi Yan
Main category: cs.CV
TL;DR: 本文提出了一种用于3D点云的双分支中心-周围对比学习框架(CSCon),通过分别对中心和周围区域进行掩码构造双分支输入,并引入patch-level对比损失,有效提升了高阶特征和局部细节的表示能力,在多个下游任务协议下达到先进性能。
Details
Motivation: 现有的3D点云自监督学习方法多基于生成式Masked Autoencoders,难以有效捕捉高层判别性特征;而对比学习在3D领域应用较少,且直接将2D对比方法应用于3D数据无法充分学习局部几何细节。 Method: 提出Dual-Branch Center-Surrounding Contrast (CSCon)框架:分别对中心和周围区域进行掩码处理,构建具有中心偏向和周围偏向的双分支输入,并引入patch-level对比损失以增强高层语义和局部敏感性。 Result: 在FULL和ALL协议下性能与生成式方法相当;在MLP-LINEAR、MLP-3和ONLY-NEW协议下达到SOTA,尤其在MLP-LINEAR协议上相比Point-MAE在ScanObjectNN三个变体上分别提升7.9%、6.7%和10.3%。 Conclusion: CSCon通过结构化对比学习有效提升了3D点云的判别性特征表示能力,显著优于现有生成式方法,验证了对比学习在3D自监督学习中的潜力。 Abstract: Most existing self-supervised learning (SSL) approaches for 3D point clouds are dominated by generative methods based on Masked Autoencoders (MAE). However, these generative methods have been proven to struggle to capture high-level discriminative features effectively, leading to poor performance on linear probing and other downstream tasks. In contrast, contrastive methods excel in discriminative feature representation and generalization ability on image data. Despite this, contrastive learning (CL) in 3D data remains scarce. Besides, simply applying CL methods designed for 2D data to 3D fails to effectively learn 3D local details. To address these challenges, we propose a novel Dual-Branch \textbf{C}enter-\textbf{S}urrounding \textbf{Con}trast (CSCon) framework. Specifically, we apply masking to the center and surrounding parts separately, constructing dual-branch inputs with center-biased and surrounding-biased representations to better capture rich geometric information. Meanwhile, we introduce a patch-level contrastive loss to further enhance both high-level information and local sensitivity. Under the FULL and ALL protocols, CSCon achieves performance comparable to generative methods; under the MLP-LINEAR, MLP-3, and ONLY-NEW protocols, our method attains state-of-the-art results, even surpassing cross-modal approaches. In particular, under the MLP-LINEAR protocol, our method outperforms the baseline (Point-MAE) by \textbf{7.9\%}, \textbf{6.7\%}, and \textbf{10.3\%} on the three variants of ScanObjectNN, respectively. The code will be made publicly available.[105] What really matters for person re-identification? A Mixture-of-Experts Framework for Semantic Attribute Importance
Athena Psalta,Vasileios Tsironis,Konstantinos Karantzalos
Main category: cs.CV
TL;DR: 提出MoSAIC-ReID,一种基于LoRA专家混合的框架,用于量化行人属性在重识别中的重要性,实现可解释的ReID分析。
Details
Motivation: 现有ReID模型性能优秀但缺乏可解释性,不清楚模型依赖哪些高层语义属性,本文旨在系统分析属性重要性。 Method: 采用LoRA-based专家混合架构,每个专家对应一个语义属性,通过oracle路由机制进行可控归因分析,并结合广义线性模型和特征重要性分析量化属性贡献。 Result: 在Market-1501和DukeMTMC上取得有竞争力的性能,发现服装颜色和内在特征贡献最大,而罕见属性(如配饰)影响较小。 Conclusion: MoSAIC-ReID为可解释的行人重识别提供了原则性框架,揭示了显式语义知识在实际应用中的作用与需求。 Abstract: State-of-the-art person re-identification methods achieve impressive accuracy but remain largely opaque, leaving open the question: which high-level semantic attributes do these models actually rely on? We propose MoSAIC-ReID, a Mixture-of-Experts framework that systematically quantifies the importance of pedestrian attributes for re-identification. Our approach uses LoRA-based experts, each linked to a single attribute, and an oracle router that enables controlled attribution analysis. While MoSAIC-ReID achieves competitive performance on Market-1501 and DukeMTMC under the assumption that attribute annotations are available at test time, its primary value lies in providing a large-scale, quantitative study of attribute importance across intrinsic and extrinsic cues. Using generalized linear models, statistical tests, and feature-importance analyses, we reveal which attributes, such as clothing colors and intrinsic characteristics, contribute most strongly, while infrequent cues (e.g. accessories) have limited effect. This work offers a principled framework for interpretable ReID and highlights the requirements for integrating explicit semantic knowledge in practice. Code is available at https://github.com/psaltaath/MoSAIC-ReID[106] Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth
Kyumin Hwang,Wonhyeok Choi,Kiljoon Han,Wonjoon Choi,Minwoo Choi,Yongcheon Na,Minwoo Park,Sunghoon Im
Main category: cs.CV
TL;DR: 提出一种新颖的知识蒸馏策略,将基础模型的深度知识迁移到轻量级全向单目深度估计网络中,通过混合回归框架和跨视图关系建模,实现高效且具尺度一致性的深度估计。
Details
Motivation: 现有基础模型在全向单目深度估计中存在计算成本高和难以估计度量尺度深度的问题,限制了其实时应用与精度。 Method: 提出跨交互知识蒸馏和视图关系知识蒸馏,结合深度分箱模块与混合回归框架,从基础模型蒸馏尺度不变的概率分布,并引导学生网络学习度量尺度的深度中心。 Result: 在DDAD和nuScenes数据集上验证了方法的有效性,相比传统监督和现有蒸馏方法提升了性能与效率的平衡,满足实时性要求。 Conclusion: 所提方法实现了高性能、高效率的全向单目深度估计,推动了基础模型在实际场景中的部署应用。 Abstract: Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.[107] SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images
Kaiyu Li,Shengqi Zhang,Yupeng Deng,Zhi Wang,Deyu Meng,Xiangyong Cao
Main category: cs.CV
TL;DR: 本文初步探索了无需训练的SAM 3模型在遥感开放词汇语义分割(OVSS)中的应用,提出了一种掩码融合策略和基于存在性分数的类别过滤方法,以提升密集小目标场景下的分割精度。
Details
Motivation: 现有基于CLIP的OVSS方法在遥感场景中面临定位不准或流程复杂的问题,尤其难以处理密集且多样的小目标。因此需要一种更简洁高效的方法来提升开放词汇分割性能。 Method: 利用SAM 3统一的可提示分割框架,结合其语义头与实例头输出进行掩码融合,并使用存在性头的得分过滤不存在的类别,从而提升地物覆盖完整性和减少误检。 Result: 在多个遥感数据集上验证了该方法的有效性,尽管未经过训练,但取得了有竞争力的结果,尤其在减少假阳性方面表现良好。 Conclusion: SAM 3具备用于遥感开放词汇语义分割的巨大潜力,简单的适配即可实现良好的性能,为后续研究提供了可行方向。 Abstract: Most existing methods for training-free Open-Vocabulary Semantic Segmentation (OVSS) are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a preliminary exploration of applying SAM 3 to the remote sensing OVSS task without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3's semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. We evaluate our method on extensive remote sensing datasets. Experiments show that this simple adaptation achieves promising performance, demonstrating the potential of SAM 3 for remote sensing OVSS. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.[108] Mitigating Individual Skin Tone Bias in Skin Lesion Classification through Distribution-Aware Reweighting
Kuniko Paxton,Zeinab Dehghani,Koorosh Aslansefat,Dhavalkumar Thakker,Yiannis Papadopoulos
Main category: cs.CV
TL;DR: 本研究提出了一种基于分布的框架,用于评估和减轻皮肤病变分类中的个体公平性偏差,将肤色视为连续属性并采用核密度估计建模,结合距离度量与重加权损失函数,提升了对少数肤色群体的表征公平性。
Details
Motivation: 现有机器学习公平性研究多依赖粗粒度的群体分类,忽视个体差异,可能导致子群内偏差被掩盖,尤其在医疗影像中影响肤色多样性表现。 Method: 将肤色作为连续属性处理,使用核密度估计(KDE)建模其分布;比较12种统计距离度量,并提出基于距离的重加权(DRW)损失函数以纠正少数肤色样本的表征不足。 Result: 实验显示类别重加权难以捕捉个体层面的差异,而基于分布的重加权(特别是使用Fidelity Similarity、Wasserstein Distance、Hellinger Metric和Harmonic Mean Similarity)在CNN和Transformer模型上均表现出更优的公平性与性能。 Conclusion: 该方法为皮肤病AI系统中的个体公平性提供了可靠的技术路径,并对医学图像分析中其他敏感连续属性的公平性研究具有广泛启示。 Abstract: Skin color has historically been a focal point of discrimination, yet fairness research in machine learning for medical imaging often relies on coarse subgroup categories, overlooking individual-level variations. Such group-based approaches risk obscuring biases faced by outliers within subgroups. This study introduces a distribution-based framework for evaluating and mitigating individual fairness in skin lesion classification. We treat skin tone as a continuous attribute rather than a categorical label, and employ kernel density estimation (KDE) to model its distribution. We further compare twelve statistical distance metrics to quantify disparities between skin tone distributions and propose a distance-based reweighting (DRW) loss function to correct underrepresentation in minority tones. Experiments across CNN and Transformer models demonstrate: (i) the limitations of categorical reweighting in capturing individual-level disparities, and (ii) the superior performance of distribution-based reweighting, particularly with Fidelity Similarity (FS), Wasserstein Distance (WD), Hellinger Metric (HM), and Harmonic Mean Similarity (HS). These findings establish a robust methodology for advancing fairness at individual level in dermatological AI systems, and highlight broader implications for sensitive continuous attributes in medical image analysis.[109] A Scalable Pipeline Combining Procedural 3D Graphics and Guided Diffusion for Photorealistic Synthetic Training Data Generation in White Button Mushroom Segmentation
Artúr I. Károly,Péter Galambos
Main category: cs.CV
TL;DR: 提出了一种结合Blender 3D渲染与受限扩散模型的自动化流程,生成高质量、带标注的合成蘑菇图像,实现了仅用合成数据训练即可在真实场景中达到最先进分割性能(F1=0.859)的方法。
Details
Motivation: 工业蘑菇种植依赖计算机视觉进行监测和自动采收,但真实数据标注成本高;现有合成数据缺乏足够真实性以泛化到实际应用。 Method: 在Blender中进行3D渲染,并结合受限扩散模型生成逼真的合成蘑菇图像,保留对场景配置和标注的完全控制,同时提升图像真实感。 Result: 发布了两个包含6000张图像、超过25万个蘑菇实例的合成数据集,在零样本设置下测试Mask R-CNN模型,在两个独立真实数据集上取得最先进分割效果(M18K上F1=0.859)。 Conclusion: 该方法证明了无需真实训练数据即可实现高精度蘑菇实例分割,且框架可扩展至其他蘑菇种类或农业领域如水果、叶片检测。 Abstract: Industrial mushroom cultivation increasingly relies on computer vision for monitoring and automated harvesting. However, developing accurate detection and segmentation models requires large, precisely annotated datasets that are costly to produce. Synthetic data provides a scalable alternative, yet often lacks sufficient realism to generalize to real-world scenarios. This paper presents a novel workflow that integrates 3D rendering in Blender with a constrained diffusion model to automatically generate high-quality annotated, photorealistic synthetic images of Agaricus Bisporus mushrooms. This approach preserves full control over 3D scene configuration and annotations while achieving photorealism without the need for specialized computer graphics expertise. We release two synthetic datasets (each containing 6,000 images depicting over 250k mushroom instances) and evaluate Mask R-CNN models trained on them in a zero-shot setting. When tested on two independent real-world datasets (including a newly collected benchmark), our method achieves state-of-the-art segmentation performance (F1 = 0.859 on M18K), despite using only synthetic training data. Although the approach is demonstrated on Agaricus Bisporus mushrooms, the proposed pipeline can be readily adapted to other mushroom species or to other agricultural domains, such as fruit and leaf detection.[110] Skewness-Guided Pruning of Multimodal Swin Transformers for Federated Skin Lesion Classification on Edge Devices
Kuniko Paxton,Koorosh Aslansefat,Dhavalkumar Thakker,Yiannis Papadopoulos
Main category: cs.CV
TL;DR: 提出了一种基于偏度引导的剪枝方法,用于在联邦学习环境中压缩多模态Swin Transformer,在减少约36%模型大小的同时保持准确率。
Details
Motivation: 高性能医学图像模型难以部署于边缘设备,且医疗数据隐私要求高,需采用联邦学习避免集中式数据处理。 Method: 基于Multi-Head Self-Attention和Multi-Layer Perceptron层输出分布的统计偏度,对多模态Swin Transformer进行选择性剪枝。 Result: 在水平联邦学习环境中验证,实现了约36%的模型压缩率且无精度损失。 Conclusion: 该方法在保护隐私的同时实现高效的模型压缩,适用于边缘设备上的多模态医疗AI系统。 Abstract: In recent years, high-performance computer vision models have achieved remarkable success in medical imaging, with some skin lesion classification systems even surpassing dermatology specialists in diagnostic accuracy. However, such models are computationally intensive and large in size, making them unsuitable for deployment on edge devices. In addition, strict privacy constraints hinder centralized data management, motivating the adoption of Federated Learning (FL). To address these challenges, this study proposes a skewness-guided pruning method that selectively prunes the Multi-Head Self-Attention and Multi-Layer Perceptron layers of a multimodal Swin Transformer based on the statistical skewness of their output distributions. The proposed method was validated in a horizontal FL environment and shown to maintain performance while substantially reducing model complexity. Experiments on the compact Swin Transformer demonstrate approximately 36\% model size reduction with no loss in accuracy. These findings highlight the feasibility of achieving efficient model compression and privacy-preserving distributed learning for multimodal medical AI on edge devices.[111] Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
Ruihang Chu,Yefei He,Zhekai Chen,Shiwei Zhang,Xiaogang Xu,Bin Xia,Dingdong Wang,Hongwei Yi,Xihui Liu,Hengshuang Zhao,Yu Liu,Yingya Zhang,Yujiu Yang
Main category: cs.CV
TL;DR: Wan-Move是一个简单且可扩展的框架,通过将物体运动表示为密集点轨迹并将其投影到潜在空间,实现对视频生成模型的精确运动控制,无需修改模型架构或使用辅助运动编码器。
Details
Motivation: 现有运动可控的视频生成方法通常存在控制粒度粗糙和可扩展性有限的问题,导致输出质量不足以满足实际应用需求。Wan-Move旨在解决这些限制,提升运动控制的精度与实用性。 Method: 将物体运动表示为密集点轨迹,将这些轨迹投影到潜在空间,并沿每条轨迹传播首帧特征,生成对齐的时空特征图作为更新后的潜在条件,融入现成的图像到视频模型中作为运动引导。 Result: Wan-Move能在不改变模型结构的情况下,生成5秒、480p的高质量视频,其运动可控性可媲美Kling 1.5 Pro的商业Motion Brush;在自建基准MoveBench和公开数据集上均表现出优越的运动质量。 Conclusion: Wan-Move实现了高精度、高质量的运动控制,具有良好的可扩展性和实用性,推动了运动可控视频生成的发展,并通过公开代码、模型和基准数据促进了该领域的进一步研究。 Abstract: We present Wan-Move, a simple and scalable framework that brings motion control to video generative models. Existing motion-controllable methods typically suffer from coarse control granularity and limited scalability, leaving their outputs insufficient for practical use. We narrow this gap by achieving precise and high-quality motion control. Our core idea is to directly make the original condition features motion-aware for guiding video synthesis. To this end, we first represent object motions with dense point trajectories, allowing fine-grained control over the scene. We then project these trajectories into latent space and propagate the first frame's features along each trajectory, producing an aligned spatiotemporal feature map that tells how each scene element should move. This feature map serves as the updated latent condition, which is naturally integrated into the off-the-shelf image-to-video model, e.g., Wan-I2V-14B, as motion guidance without any architecture change. It removes the need for auxiliary motion encoders and makes fine-tuning base models easily scalable. Through scaled training, Wan-Move generates 5-second, 480p videos whose motion controllability rivals Kling 1.5 Pro's commercial Motion Brush, as indicated by user studies. To support comprehensive evaluation, we further design MoveBench, a rigorously curated benchmark featuring diverse content categories and hybrid-verified annotations. It is distinguished by larger data volume, longer video durations, and high-quality motion annotations. Extensive experiments on MoveBench and the public dataset consistently show Wan-Move's superior motion quality. Code, models, and benchmark data are made publicly available.[112] Refining Visual Artifacts in Diffusion Models via Explainable AI-based Flaw Activation Maps
Seoyeon Lee,Gwangyeol Yu,Chaewon Kim,Jonghyuk Park
Main category: cs.CV
TL;DR: 提出了一种基于可解释人工智能(XAI)的自优化扩散模型框架,通过识别并优化图像生成中的伪影和不真实区域,显著提升多种任务下的图像合成质量。
Details
Motivation: 尽管扩散模型在图像合成方面取得了成功,但在生成结果中仍存在伪影和不真实区域,影响图像质量,需要一种能主动检测并优化这些缺陷的方法。 Method: 提出自优化扩散框架,利用XAI技术构建缺陷高亮器生成缺陷激活图(FAMs),在前向过程中放大有缺陷区域的噪声,在反向过程中集中优化这些区域,从而提升重建质量。 Result: 在多种扩散模型上实现了最高27.3%的FID指标改善,在图像生成、文本到图像生成和图像修复等任务中均表现出强鲁棒性和有效性。 Conclusion: 该框架证明了可解释AI不仅能用于模型解释,还可主动参与并提升图像生成质量,具有广泛适用性,推动了图像合成领域的发展。 Abstract: Diffusion models have achieved remarkable success in image synthesis. However, addressing artifacts and unrealistic regions remains a critical challenge. We propose self-refining diffusion, a novel framework that enhances image generation quality by detecting these flaws. The framework employs an explainable artificial intelligence (XAI)-based flaw highlighter to produce flaw activation maps (FAMs) that identify artifacts and unrealistic regions. These FAMs improve reconstruction quality by amplifying noise in flawed regions during the forward process and by focusing on these regions during the reverse process. The proposed approach achieves up to a 27.3% improvement in Fréchet inception distance across various diffusion-based models, demonstrating consistently strong performance on diverse datasets. It also shows robust effectiveness across different tasks, including image generation, text-to-image generation, and inpainting. These results demonstrate that explainable AI techniques can extend beyond interpretability to actively contribute to image refinement. The proposed framework offers a versatile and effective approach applicable to various diffusion models and tasks, significantly advancing the field of image synthesis.[113] LoFA: Learning to Predict Personalized Priors for Fast Adaptation of Visual Generative Models
Yiming Hao,Mutian Xu,Chongjie Ye,Jie Qin,Shunlin Lu,Yipeng Qin,Xiaoguang Han
Main category: cs.CV
TL;DR: 提出了一种名为LoFA的新框架,用于快速预测个性化先验,实现视觉生成模型的高效个性化,无需大量数据和长时间优化。
Details
Motivation: 现有的个性化方法如LoRA需要特定任务的数据和长时间优化,而现有超网络方法难以将细粒度用户提示映射到复杂的LoRA分布,限制了实用性。 Method: 基于发现LoRA参数中存在结构化分布模式,设计了一个两阶段超网络:首先预测捕捉关键适应区域的相对分布模式,然后利用这些模式指导最终的LoRA权重预测。 Result: 实验表明,该方法能在几秒内跨多个任务和用户提示一致地预测高质量的个性化先验,性能甚至优于需数小时处理的传统LoRA方法。 Conclusion: LoFA提供了一种高效、通用的解决方案,显著提升了视觉生成模型个性化的速度和实用性。 Abstract: Personalizing visual generative models to meet specific user needs has gained increasing attention, yet current methods like Low-Rank Adaptation (LoRA) remain impractical due to their demand for task-specific data and lengthy optimization. While a few hypernetwork-based approaches attempt to predict adaptation weights directly, they struggle to map fine-grained user prompts to complex LoRA distributions, limiting their practical applicability. To bridge this gap, we propose LoFA, a general framework that efficiently predicts personalized priors for fast model adaptation. We first identify a key property of LoRA: structured distribution patterns emerge in the relative changes between LoRA and base model parameters. Building on this, we design a two-stage hypernetwork: first predicting relative distribution patterns that capture key adaptation regions, then using these to guide final LoRA weight prediction. Extensive experiments demonstrate that our method consistently predicts high-quality personalized priors within seconds, across multiple tasks and user prompts, even outperforming conventional LoRA that requires hours of processing. Project page: https://jaeger416.github.io/lofa/.[114] MatteViT: High-Frequency-Aware Document Shadow Removal with Shadow Matte Guidance
Chaewon Kim,Seoyeon Lee,Jonghyuk Park
Main category: cs.CV
TL;DR: 本文提出了一种名为MatteViT的新型文档去阴影框架,结合空间和频域信息,通过高频增强模块和连续亮度阴影matte机制,有效去除阴影并保留文本细节,在公开数据集上实现了最先进性能。
Details
Motivation: 文档阴影会遮挡或扭曲细小结构(如文字边缘),影响数字化文档的清晰度和后续识别任务,因此需要在去阴影过程中更好地保留高频细节。 Method: 提出MatteViT框架:1)引入轻量级高频增强模块(HFAM),分解并自适应放大高频成分;2)设计基于连续亮度的阴影matte,利用自建数据集和matte生成器提供早期精确空间引导。 Result: 在RDD和Kligler数据集上实验表明,MatteViT在去阴影效果和高频细节保留方面达到SOTA水平,并在OCR等下游任务中显著提升识别准确率。 Conclusion: MatteViT通过融合频域与空间指导策略,有效解决了文档去阴影中细节丢失的问题,具有良好的实用性和鲁棒性,适用于真实场景的文档图像恢复。 Abstract: Document shadow removal is essential for enhancing the clarity of digitized documents. Preserving high-frequency details (e.g., text edges and lines) is critical in this process because shadows often obscure or distort fine structures. This paper proposes a matte vision transformer (MatteViT), a novel shadow removal framework that applies spatial and frequency-domain information to eliminate shadows while preserving fine-grained structural details. To effectively retain these details, we employ two preservation strategies. First, our method introduces a lightweight high-frequency amplification module (HFAM) that decomposes and adaptively amplifies high-frequency components. Second, we present a continuous luminance-based shadow matte, generated using a custom-built matte dataset and shadow matte generator, which provides precise spatial guidance from the earliest processing stage. These strategies enable the model to accurately identify fine-grained regions and restore them with high fidelity. Extensive experiments on public benchmarks (RDD and Kligler) demonstrate that MatteViT achieves state-of-the-art performance, providing a robust and practical solution for real-world document shadow removal. Furthermore, the proposed method better preserves text-level details in downstream tasks, such as optical character recognition, improving recognition performance over prior methods.[115] Training-Free Dual Hyperbolic Adapters for Better Cross-Modal Reasoning
Yi Zhang,Chun-Wun Cheng,Junyi He,Ke Yu,Yushun Tang,Carola-Bibiane Schönlieb,Zhihai He,Angelica I. Aviles-Rivero
Main category: cs.CV
TL;DR: 提出了一种无需训练的双曲适配器方法(T-DHA),利用双曲空间对视觉-语言模型中的层次化语义结构进行建模,显著提升了少样本图像识别和领域泛化性能。
Details
Motivation: 现有视觉-语言模型在跨域场景下性能下降,且微调需要大量计算资源,因此需要一种高效、无需训练的适应方法。 Method: 在双曲空间(Poincaré球模型)中建模视觉-语言语义关系的层次树结构,并结合负学习机制,使用双曲适配器增强表示与判别能力。 Result: 在多个数据集上的实验表明,T-DHA在少样本图像识别和领域泛化任务上显著优于现有的最先进方法。 Conclusion: 双曲空间更适合建模视觉-语言模型中的层次化语义关系,T-DHA提供了一种高效、低维且无需训练的模型适应方案。 Abstract: Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textit{Training-free Dual Hyperbolic Adapters} (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincaré ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.[116] InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Hongyuan Tao,Bencheng Liao,Shaoyu Chen,Haoran Yin,Qian Zhang,Wenyu Liu,Xinggang Wang
Main category: cs.CV
TL;DR: 本文提出了InfiniteVL,一种结合滑动窗口注意力与Gated DeltaNet的线性复杂度视觉语言模型,通过三阶段训练策略在低数据量下实现高性能,并在长序列处理和实时视频理解中表现出优越的推理速度与内存效率。
Details
Motivation: 现有基于窗口或线性注意力的VLM在序列长度超过窗口时性能下降,或在信息密集任务(如OCR)中表现不佳,需克服其局限性。 Method: 提出InfiniteVL架构,融合滑动窗口注意力(SWA)与Gated DeltaNet,设计包含蒸馏预训练、指令微调和长序列SFT的三阶段训练策略。 Result: 使用不到领先VLM 2%的数据量,性能超越先前线性复杂度VLM,媲美主流Transformer-based VLM,在推理速度上实现3.6倍以上加速,保持恒定延迟与内存占用,并在流式视频中维持24 FPS实时处理。 Conclusion: InfiniteVL有效解决了窗口与线性注意力的固有缺陷,在资源受限下实现了高效、可扩展且具备长期记忆能力的多模态建模,适用于长序列与实时应用场景。 Abstract: Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.[117] Generation is Required for Data-Efficient Perception
Jack Brady,Bernhard Schölkopf,Thomas Kipf,Simon Buchholz,Wieland Brendel
Main category: cs.CV
TL;DR: 该论文探讨了生成式与非生成式方法在实现组合泛化(compositional generalization)方面的能力,指出生成式方法通过解码器约束和反演能更有效地满足必要的归纳偏置,从而实现人类水平的视觉感知。
Details
Motivation: 探索生成是否为机器达到人类级别视觉感知所必需,并比较生成式与非生成式模型在组合泛化上的表现。 Method: 在组合性数据生成过程下,形式化生成式(基于解码器)与非生成式(基于编码器)方法所需的归纳偏置,并理论分析其可行性;同时通过梯度搜索或生成回放实现解码器反演。 Result: 理论表明,在正则化或结构约束下难以在编码器上施加必要的归纳偏置,而生成式方法可轻松实现;实验显示生成式方法在无需额外数据的情况下显著提升组合泛化能力。 Conclusion: 生成机制通过解码器反演能够有效引入必要归纳偏置,是实现组合泛化和迈向人类级视觉感知的关键路径。 Abstract: It has been hypothesized that human-level visual perception requires a generative approach in which internal representations result from inverting a decoder. Yet today's most successful vision models are non-generative, relying on an encoder that maps images to representations without decoder inversion. This raises the question of whether generation is, in fact, necessary for machines to achieve human-level visual perception. To address this, we study whether generative and non-generative methods can achieve compositional generalization, a hallmark of human perception. Under a compositional data generating process, we formalize the inductive biases required to guarantee compositional generalization in decoder-based (generative) and encoder-based (non-generative) methods. We then show theoretically that enforcing these inductive biases on encoders is generally infeasible using regularization or architectural constraints. In contrast, for generative methods, the inductive biases can be enforced straightforwardly, thereby enabling compositional generalization by constraining a decoder and inverting it. We highlight how this inversion can be performed efficiently, either online through gradient-based search or offline through generative replay. We examine the empirical implications of our theory by training a range of generative and non-generative methods on photorealistic image datasets. We find that, without the necessary inductive biases, non-generative methods often fail to generalize compositionally and require large-scale pretraining or added supervision to improve generalization. By comparison, generative methods yield significant improvements in compositional generalization, without requiring additional data, by leveraging suitable inductive biases on a decoder along with search and replay.[118] Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference
Amit Bendkhale
Main category: cs.CV
TL;DR: 提出Tri-Bench,一个用于评估视觉语言模型在平面三角形几何推理中可验证性和可控性的紧凑基准,揭示现有VLM在相机姿态变化和几何类别识别上表现不佳。
Details
Motivation: 视觉语言模型在现实场景变化下常表现不稳定,缺乏可靠的几何推理能力,影响其在代理式AI中的可信部署。 Method: 构建Tri-Bench基准,包含六种基于平面三角形的二分类与连续预测任务,控制相机姿态(正视 vs 倾斜)和场景干扰物(10种日常物品),使用固定提示词并引入带边框参考系的同源性约束以测试可验证性。 Result: 四个主流VLM平均准确率约为69%(3D真值),在2D投影下略高(72%);对等边、等腰、直角三角形等少数类识别准确率降至约0%;相机倾斜导致性能下降4.1%;物体干扰无显著影响,表明模型忽略提示中的参考系而依赖2D图像线索。 Conclusion: 当前VLM在显式空间参考提示下仍无法稳定进行相对几何推理,主要依赖图像平面特征,缺乏对3D几何结构和语义形状类别的理解,限制了其在需要精确控制与验证的应用中的可靠性。 Abstract: Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.[119] Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning
Jing Jie Tan,Anissa Mokraoui,Ban-Hoe Kwan,Danny Wee-Kiat Ng,Yan-Chai Hum
Main category: cs.CV
TL;DR: 本文提出了一种名为SOLI的轻量级图像描述方法,用于解决低分辨率图像在资源受限场景下的描述问题,通过Siamese网络优化潜在嵌入,提升效率与准确性。
Details
Motivation: 低分辨率图像在图像描述任务中面临挑战,而现有基于大型模型(如Transformer)的方法计算开销大,难以在资源受限环境下训练和部署。 Method: 提出SOLI方法,采用Siamese网络架构和双路径神经网络结构,优化低分辨率图像的潜在嵌入表示,以降低计算开销并保持高性能。 Result: SOLI在减少计算资源消耗的同时,提升了低分辨率图像描述的准确性和效率,适用于资源受限场景。 Conclusion: SOLI为低分辨率图像描述提供了一种高效、轻量的解决方案,平衡了模型性能与计算成本,具有良好的应用前景。 Abstract: Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.[120] SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing
Aysim Toker,Andreea-Maria Oncescu,Roy Miles,Ismail Elezi,Jiankang Deng
Main category: cs.CV
TL;DR: 提出一种结合结构化定位机制的视觉语言模型,通过专用控制令牌联合推理语言和空间信息,在卫星图像视觉定位任务中显著提升性能,相对之前方法提升24.8%。
Details
Motivation: 现有视觉语言模型在卫星图像中精确定位物体能力有限,缺乏对空间信息的有效整合。 Method: 在预训练VLM基础上,引入专门用于定位的控制令牌,并微调其在多样化指令跟随任务上的表现,实现语言与空间信息的联合推理。 Result: 在多个遥感基准测试上取得当前最优性能,视觉定位任务相比先前方法相对提升24.8%。 Conclusion: 将结构化空间推理融入VLM可显著提升其在复杂卫星场景中的定位能力,有助于更可靠的现实世界卫星数据分析。 Abstract: Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.[121] Accelerated Rotation-Invariant Convolution for UAV Image Segmentation
Manduhu Manduhu,Alexander Dow,Gerard Dooly,James Riordan
Main category: cs.CV
TL;DR: 本文提出了一种GPU优化的旋转不变卷积框架,通过消除传统im2col步骤并利用旋转滤波器间的结构化数据共享,显著降低了多方向卷积的计算冗余和内存开销,同时保持高精度。
Details
Motivation: 传统卷积网络如U-Net缺乏旋转不变性,在无人机航拍图像分割中性能下降,尤其在目标具有任意朝向时;现有实现旋转不变性的方法计算成本高、内存开销大。 Method: 提出一种GPU优化的旋转不变卷积框架,消除im2col操作,利用对称旋转滤波器之间的数据共享来减少内存访问和计算冗余,并推广至任意旋转角度的卷积。 Result: 在多个基准测试中,相比CUDNN实现20%-55%的训练加速和15%-45%的能耗降低;在八方向设置下,256×256输入上最高达45%加速和41%节能,1024×1024输入上达32%加速和23%节能;集成到U-Net中较基线提升最高6%的分割精度。 Conclusion: 所提方法在保持与现有旋转不变CNN相当精度的同时,显著提升了效率和能效,为旋转不变语义分割提供了高效可行的解决方案。 Abstract: Rotation invariance is essential for precise, object-level segmentation in UAV aerial imagery, where targets can have arbitrary orientations and exhibit fine-scale details. Conventional segmentation architectures like U-Net rely on convolution operators that are not rotation-invariant, leading to degraded segmentation accuracy across varying viewpoints. Rotation invariance can be achieved by expanding the filter bank across multiple orientations; however, this will significantly increase computational cost and memory traffic. In this paper, we introduce a GPU-optimized rotation-invariant convolution framework that eliminates the traditional data-lowering (im2col) step required for matrix-multiplication-based convolution. By exploiting structured data sharing among symmetrically rotated filters, our method achieves multi-orientation convolution with greatly reduced memory traffic and computational redundancy. We further generalize the approach to accelerate convolution with arbitrary (non-symmetric) rotation angles. Across extensive benchmarks, the proposed convolution achieves 20--55% faster training and 15--45% lower energy consumption than CUDNN, while maintaining accuracy comparable to state-of-the-art rotation-invariant methods. In the eight-orientation setting, our approach achieves up to 45% speedup and 41% energy savings on 256\(\times\)256 inputs, and 32% speedup and 23% lower energy usage on 1024\(\times\)1024 inputs. Integrated into a U-Net segmentation model, the framework yields up to 6% improvement in accuracy over the non-rotation-aware baseline. These results demonstrate that the proposed method provides an effective and highly efficient alternative to existing rotation-invariant CNN frameworks.[122] No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
Damiano Marsili,Georgia Gkioxari
Main category: cs.CV
TL;DR: 提出了一种无需人工标注的视觉推理训练框架,利用AI驱动的验证器(LLM和VLM)通过强化学习和自动难例挖掘来提升推理与视觉定位能力,在多种空间推理任务上优于现有开源和闭源模型。
Details
Motivation: 现有视觉推理方法依赖大量标注数据或存在逻辑与定位错误,难以兼顾精确的对象定位与复杂空间关系理解,因此需要一种无需人工标注且能同时提升推理与 grounding 的框架。 Method: 提出一个双验证器框架:使用LLM验证器通过强化学习优化语言推理过程,VLM验证器通过自动挖掘难负例增强视觉定位,无需真实标签;结合语言模型的推理分解能力与视觉模型的专业性,并由高性能VLM批评者进行改进。 Result: 在多种空间推理任务上,该方法在视觉推理和视觉定位方面均优于现有的开源及闭源模型,尤其在无需人工标注的情况下显著超越近期纯文本视觉推理方法。 Conclusion: 该研究展示了无需人工标注的视觉推理训练的可行性,通过结合LLM与VLM验证器的优势,有效提升了复杂空间推理中的语言推理与视觉 grounding 能力,为未来视觉语言学习提供了新方向。 Abstract: Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/[123] UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation
Zeyang Liu,Le Wang,Sanping Zhou,Yuxuan Wu,Xiaolong Sun,Gang Hua,Haoxiang Li
Main category: cs.CV
TL;DR: 本文提出了UniLayDiff,首个统一的扩散Transformer模型,用于解决多种内容感知布局生成任务,通过将布局约束视为独立模态,并结合多模态扩散框架与LoRA微调,实现了端到端的多样化条件生成,取得了最先进的性能。
Details
Motivation: 现有方法无法统一处理多样化的布局生成子任务,通常需为不同条件设计独立模型或仅支持部分任务,缺乏真正统一且高效的解决方案。 Method: 提出UniLayDiff,一种基于多模态扩散Transformer的统一框架,将布局约束作为独立模态输入;采用预训练加LoRA微调策略整合关系约束,实现端到端的联合建模。 Result: 在无条件到多种条件布局生成任务上均达到最先进性能,首次实现对全范围内容感知布局任务的统一建模。 Conclusion: UniLayDiff是首个能统一处理各类内容感知布局生成任务的单一模型,有效融合图像、元素与多元约束,在保持架构统一的同时提升了生成质量与泛化能力。 Abstract: Content-aware layout generation is a critical task in graphic design automation, focused on creating visually appealing arrangements of elements that seamlessly blend with a given background image. The variety of real-world applications makes it highly challenging to develop a single model capable of unifying the diverse range of input-constrained generation sub-tasks, such as those conditioned by element types, sizes, or their relationships. Current methods either address only a subset of these tasks or necessitate separate model parameters for different conditions, failing to offer a truly unified solution. In this paper, we propose UniLayDiff: a Unified Diffusion Transformer, that for the first time, addresses various content-aware layout generation tasks with a single, end-to-end trainable model. Specifically, we treat layout constraints as a distinct modality and employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints. Moreover, we integrate relation constraints through fine-tuning the model with LoRA after pretraining the model on other tasks. Such a schema not only achieves unified conditional generation but also enhances overall layout quality. Extensive experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks and, to the best of our knowledge, is the first model to unify the full range of content-aware layout generation tasks.[124] Self-Evolving 3D Scene Generation from a Single Image
Kaizhi Zheng,Yue Fan,Jing Gu,Zishuo Xu,Xuehai He,Xin Eric Wang
Main category: cs.CV
TL;DR: EvoScene是一个无需训练的自进化框架,通过结合3D生成模型和视频生成模型的优势,从单张图像中逐步重建完整的3D场景,在几何稳定性、纹理一致性和未见区域补全方面优于现有方法。
Details
Motivation: 现有的单图生成3D方法因对象中心化训练难以泛化到复杂的大规模场景,且在结构和纹理保真度上存在局限。 Method: 提出EvoScene框架,包含三个迭代阶段:空间先验初始化、视觉引导的3D场景网格生成和空间引导的新视角生成,交替在2D与3D域间优化结构与外观。 Result: 在多种场景上实验表明,EvoScene在几何稳定性、视角一致性纹理和未见区域完成方面优于强基线方法,能生成可直接用于实际应用的3D网格。 Conclusion: EvoScene通过融合多模型优势,实现了高质量、纹理一致且结构完整的单图到3D场景重建,具备良好的实用性和泛化能力。 Abstract: Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.[125] LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
Simon de Moreau,Andrei Bursuc,Hafid El-Idrissi,Fabien Moutarde
Main category: cs.CV
TL;DR: 提出了一种名为LiDAS的闭环主动照明系统,通过动态优化光照分布来提升夜间视觉感知性能,可在不重新训练的情况下显著提高检测与分割精度,并降低能耗。
Details
Motivation: 夜间环境对基于相机的感知构成挑战,现有方法依赖被动照明,无法有效应对光照不足的问题。 Method: 结合现成的视觉感知模型与高清车灯,构建闭环主动照明系统LiDAS,利用合成数据训练并动态预测最优光照场,将光线集中于目标区域而非均匀照明。 Result: 在真实驾驶场景中零样本部署下,相比标准近光灯,在相同功耗下mAP50提升+18.7%,mIoU提升+5.0%;同时可减少40%能耗并保持性能。 Conclusion: LiDAS能将普通车灯转化为主动视觉执行器,实现成本低廉、鲁棒性强的夜间感知,且可与领域泛化方法互补,增强整体性能。 Abstract: Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting. We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high-definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero-shot in real-world closed-loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain-generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost-effective solution to robust nighttime perception.[126] Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration
Jin Hyeon Kim,Paul Hyunbin Cho,Claire Kim,Jaewon Min,Jaeeun Lee,Jihye Park,Yeji Choi,Seungryong Kim
Main category: cs.CV
TL;DR: 本文提出了一种名为UniT的统一文本修复框架,通过结合扩散Transformer(DiT)、视觉-语言模型(VLM)和文本检测模块(TSM),在去噪过程中迭代利用显式文本指导和中间OCR预测,有效恢复退化文本并抑制文本幻觉,在SA-Text和Real-Text基准上实现了最先进的性能。
Details
Motivation: 现有的扩散模型在文本感知图像修复任务中容易产生文本幻觉,因其缺乏显式的语言知识,难以准确恢复图像中的退化文本内容。 Method: UniT框架将Diffusion Transformer(DiT)、视觉-语言模型(VLM)和文本检测模块(TSM)以迭代方式结合:VLM从退化图像中提取文本内容提供语言指导,TSM基于扩散特征在每一步去噪中生成中间OCR结果,帮助VLM逐步优化文本指导,DiT则利用这些多模态线索进行高质量文本恢复。 Result: 在SA-Text和Real-Text两个真实文本修复基准上,UniT显著减少了文本幻觉,实现了端到端F1分数的最先进表现,验证了其在高保真文本恢复方面的有效性。 Conclusion: UniT通过融合视觉、语言与文本识别模块,在扩散模型中引入显式且可迭代的文本监督,有效解决了文本感知图像修复中的幻觉问题,为高质量文字恢复提供了新思路。 Abstract: Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine-grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.[127] Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Chuhan Zhang,Guillaume Le Moing,Skanda Koppula,Ignacio Rocco,Liliane Momeni,Junyu Xie,Shuyang Sun,Rahul Sukthankar,Joëlle K Barral,Raia Hadsell,Zoubin Ghahramani,Andrew Zisserman,Junlin Zhang,Mehdi SM Sajjadi
Main category: cs.CV
TL;DR: 本文提出了D4RT,一种基于统一Transformer架构的前馈模型,用于从单个视频中高效地联合推断动态场景的深度、时空对应关系和相机参数。其核心创新是一种新的查询机制,避免了密集逐帧解码的高计算成本,并通过灵活的3D点查询实现轻量且可扩展的4D重建。
Details
Motivation: 动态场景的复杂几何与运动重建在计算机视觉中仍具挑战性,现有方法常依赖复杂的多解码器或计算昂贵的密集解码,限制了效率与可扩展性。 Method: 提出D4RT模型,采用统一的Transformer架构,引入新颖的查询机制,允许模型独立、灵活地探测任意时空点的3D位置,实现端到端的深度、对应关系和相机参数联合估计。 Result: D4RT在多种4D重建任务上实现了最先进的性能,显著优于先前方法,同时具备高效的训练与推理速度。 Conclusion: D4RT通过简洁高效的架构设计,在保持高性能的同时大幅降低计算复杂度,为动态场景的4D重建提供了一种可扩展的新范式。 Abstract: Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.[128] Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment
Youming Deng,Songyou Peng,Junyi Zhang,Kathryn Heal,Tiancheng Sun,John Flynn,Steve Marschner,Lucy Chai
Main category: cs.CV
TL;DR: 本文提出Selfi,一种通过特征对齐实现自提升的3D重建流程,将VGGT骨干网络转化为高保真3D重建引擎,在新视角合成与相机位姿估计中实现最先进性能。
Details
Motivation: 现有基于视觉基础模型的方法(如VGGT)虽能从无标定图像中隐式学习3D知识,但缺乏显式的多视图几何一致性,限制了其在新视角合成和位姿估计中的表现。 Method: 提出Selfi框架,利用VGGT自身的输出作为伪真值,通过基于重投影的一致性损失训练一个轻量级特征适配器,将VGGT特征蒸馏到一个新的具有几何一致性的特征空间中,从而增强3D特征的空间邻近性与几何一致性。 Result: 在新视角合成(NVS)和相机位姿估计任务上达到最先进的性能,验证了特征对齐对下游3D推理任务的有效性。 Conclusion: 特征对齐是提升视觉基础模型在3D重建任务中性能的关键步骤,Selfi提供了一种自改进、无需额外标注的高效方法。 Abstract: Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.[129] Astra: General Interactive World Model with Autoregressive Denoising
Yixuan Zhu,Jiaqi Feng,Wenzhao Zheng,Yuan Gao,Xin Tao,Pengfei Wan,Jie Zhou,Jiwen Lu
Main category: cs.CV
TL;DR: Astra 是一个交互式通用世界模型,通过扩散变换器实现精确的长时程未来预测,支持多种动作模态,在真实世界任务中表现出更高的保真度、时序一致性和动作对齐能力。