Skip to content

Table of Contents

cs.CL [Back]

[1] ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages

Subham Kumar,Prakrithi Shivaprakash,Abhishek Manoharan,Astut Kurariya,Diptadhi Mukherjee,Lekhansh Shukla,Animesh Mukherjee,Prabhat Chand,Pratima Murthy

Main category: cs.CL

TL;DR: 本研究首次系统评估了自动语音识别(ASR)在印度多语言临床环境中的表现,涵盖卡纳达语、印地语和印度英语,揭示了不同模型在语言、说话人角色和性别上的显著性能差异,强调医疗ASR系统需更具文化与人口包容性。

Details Motivation: 评估当前ASR系统在印度多语言、多样化人口背景下的临床转录可靠性,尤其是对患者与医生、不同性别及交叉群体之间的公平性问题。 Method: 使用真实世界临床访谈数据,对比包括Indic Whisper、Whisper、Sarvam、Google语音转文本等在内的多种主流ASR模型,评估其在多语言、说话人角色和人口统计子群中的转录准确性,并分析错误模式。 Result: 发现各模型在不同语言间表现差异大,部分模型在印度英语上表现良好,但在混合语码或方言口语上失败;同时存在与说话人身份(患者vs.医生)和性别相关的系统性性能差距。 Conclusion: 现有ASR系统在印度多元医疗环境中存在显著公平性和准确性问题,需推动更具文化与人口代表性的ASR技术发展以确保临床应用的公正与可靠。 Abstract: Automatic Speech Recognition (ASR) is increasingly used to document clinical encounters, yet its reliability in multilingual and demographically diverse Indian healthcare contexts remains largely unknown. In this study, we conduct the first systematic audit of ASR performance on real world clinical interview data spanning Kannada, Hindi, and Indian English, comparing leading models including Indic Whisper, Whisper, Sarvam, Google speech to text, Gemma3n, Omnilingual, Vaani, and Gemini. We evaluate transcription accuracy across languages, speakers, and demographic subgroups, with a particular focus on error patterns affecting patients vs. clinicians and gender based or intersectional disparities. Our results reveal substantial variability across models and languages, with some systems performing competitively on Indian English but failing on code mixed or vernacular speech. We also uncover systematic performance gaps tied to speaker role and gender, raising concerns about equitable deployment in clinical settings. By providing a comprehensive multilingual benchmark and fairness analysis, our work highlights the need for culturally and demographically inclusive ASR development for healthcare ecosystem in India.

[2] Benchmarking Automatic Speech Recognition Models for African Languages

Alvin Nahabwe,Sulaiman Kagumire,Denis Musinguzi,Bruno Beijuka,Jonah Mubuuke Kyagaba,Peter Nabende,Andrew Katumba,Joyce Nakatumba-Nabende

Main category: cs.CL

TL;DR: 本研究系统评估了四种最先进的ASR模型在13种非洲语言中的表现,探讨了模型选择、数据规模和解码策略对低资源语言的影响,揭示了不同模型在不同数据量下的优劣及外部语言模型的有效性。

Details Motivation: 非洲语言的自动语音识别因标注数据稀缺以及缺乏对模型选择、数据扩展和解码策略的系统性指导而受限,亟需在统一框架下比较主流预训练模型的表现。 Method: 在13种非洲语言上对Whisper、XLS-R、MMS和W2v-BERT四种模型进行微调,使用从1小时到400小时不等的标注数据,并分析不同数据规模和外部语言模型解码策略下的性能差异。 Result: MMS和W2v-BERT在极低资源情况下更具数据效率,XLS-R在数据量增加时扩展性更好,Whisper在中等资源条件下表现优异;外部语言模型的效果取决于声学与文本资源的一致性,某些情况下会饱和甚至引入错误。 Conclusion: 模型选择应综合考虑预训练覆盖范围、架构、数据域和资源可用性,本研究为构建低资源语言的ASR系统提供了实践指导。 Abstract: Automatic speech recognition (ASR) for African languages remains constrained by limited labeled data and the lack of systematic guidance on model selection, data scaling, and decoding strategies. Large pre-trained systems such as Whisper, XLS-R, MMS, and W2v-BERT have expanded access to ASR technology, but their comparative behavior in African low-resource contexts has not been studied in a unified and systematic way. In this work, we benchmark four state-of-the-art ASR models across 13 African languages, fine-tuning them on progressively larger subsets of transcribed data ranging from 1 to 400 hours. Beyond reporting error rates, we provide new insights into why models behave differently under varying conditions. We show that MMS and W2v-BERT are more data efficient in very low-resource regimes, XLS-R scales more effectively as additional data becomes available, and Whisper demonstrates advantages in mid-resource conditions. We also analyze where external language model decoding yields improvements and identify cases where it plateaus or introduces additional errors, depending on the alignment between acoustic and text resources. By highlighting the interaction between pre-training coverage, model architecture, dataset domain, and resource availability, this study offers practical and insights into the design of ASR systems for underrepresented languages.

[3] MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA

Seonok Kim

Main category: cs.CL

TL;DR: MedBioRAG 是一个结合语义与词法搜索、文档检索和监督微调的检索增强模型,显著提升了生物医学问答任务的性能,在多个基准测试中超越了现有最先进模型和 GPT-4o。

Details Motivation: 为了提升大型语言模型在复杂生物医学问答任务中的准确性与上下文感知能力,解决传统方法在检索与生成一致性上的不足。 Method: 提出 MedBioRAG 模型,融合语义与词法检索技术进行高效文档检索与排序,并结合监督微调优化生成结果。 Result: 在 NFCorpus、TREC-COVID、MedQA、PubMedQA 和 BioASQ 等数据集上,MedBioRAG 在 NDCG、MRR、准确率和 ROUGE 分数上均优于现有最先进模型和 GPT-4o。 Conclusion: 语义搜索驱动的检索与大模型微调相结合,能有效提升生物医学领域问答系统的性能,具有广泛应用潜力。 Abstract: Recent advancements in retrieval-augmented generation (RAG) have significantly enhanced the ability of large language models (LLMs) to perform complex question-answering (QA) tasks. In this paper, we introduce MedBioRAG, a retrieval-augmented model designed to improve biomedical QA performance through a combination of semantic and lexical search, document retrieval, and supervised fine-tuning. MedBioRAG efficiently retrieves and ranks relevant biomedical documents, enabling precise and context-aware response generation. We evaluate MedBioRAG across text retrieval, close-ended QA, and long-form QA tasks using benchmark datasets such as NFCorpus, TREC-COVID, MedQA, PubMedQA, and BioASQ. Experimental results demonstrate that MedBioRAG outperforms previous state-of-the-art (SoTA) models and the GPT-4o base model in all evaluated tasks. Notably, our approach improves NDCG and MRR scores for document retrieval, while achieving higher accuracy in close-ended QA and ROUGE scores in long-form QA. Our findings highlight the effectiveness of semantic search-based retrieval and LLM fine-tuning in biomedical applications.

[4] KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering

Xin Sun,Zhongqi Chen,Xing Zheng,Qiang Liu,Shu Wu,Bowen Song,Zilei Wang,Weiqiang Wang,Liang Wang

Main category: cs.CL

TL;DR: 本文提出了KBQA-R1框架,通过强化学习将知识库问答从文本模仿转变为交互优化,结合执行反馈提升推理准确性。

Details Motivation: 现有KBQA方法存在生成幻觉查询或过于依赖模板的问题,缺乏对知识图谱环境的真实理解。 Method: 将KBQA建模为多轮决策过程,采用强化学习(GRPO)和提出的参考拒绝采样(RRS)进行训练,利用执行反馈优化策略。 Result: 在WebQSP、GrailQA和GraphQuestions上实现了最先进的性能。 Conclusion: KBQA-R1有效增强了LLM在KBQA中的可验证执行能力,解决了幻觉与僵化推理的双重缺陷。 Abstract: Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we present \textbf{KBQA-R1}, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions, leveraging Group Relative Policy Optimization (GRPO) to refine its strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce \textbf{Referenced Rejection Sampling (RRS)}, a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance, effectively grounding LLM reasoning in verifiable execution.

[5] PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data

Pawel Batorski,Paul Swoboda

Main category: cs.CL

TL;DR: 提出一种快速自动提示构造算法,通过蒙特卡洛Shapley估计选择少量有效的few-shot示例,提升LLM在文本简化、GSM8K等任务上的表现,在有限计算预算下优于现有自动提示方法。

Details Motivation: 大模型对提示设计敏感,但手工设计高效提示困难且耗时,需要自动化方法来减少对复杂few-shot示例设计的依赖。 Method: 提出一种基于Monte Carlo Shapley值评估示例效用的自动提示构建算法,通过迭代替换/删除/保留few-shot样例,并结合子采样和回放缓冲区加速评估。 Result: 在文本简化和GSM8K任务上超越现有自动提示方法,在分类与摘要任务上取得第二好结果;增加计算预算后在多个任务上达到自动提示方法中的最先进水平。 Conclusion: 精心构建的few-shot示例比穷举式指令搜索更能有效提升提示效率和性能,是快速且数据高效的提示工程关键。 Abstract: LLMs are highly sensitive to prompt design, but handcrafting effective prompts is difficult and often requires intricate crafting of few-shot examples. We propose a fast automatic prompt construction algorithm that augments human instructions by generating a small set of few shot examples. Our method iteratively replaces/drops/keeps few-shot examples using Monte Carlo Shapley estimation of example utility. For faster execution, we use aggressive subsampling and a replay buffer for faster evaluations. Our method can be run using different compute time budgets. On a limited budget, we outperform existing automatic prompting methods on text simplification and GSM8K and obtain second best results on classification and summarization. With an extended, but still modest compute budget we set a new state of the art among automatic prompting methods on classification, simplification and GSM8K. Our results show that carefully constructed examples, rather than exhaustive instruction search, are the dominant lever for fast and data efficient prompt engineering. Our code is available at https://github.com/Batorskq/PIAST.

[6] MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data

Christopher Driggers-Ellis,Detravious Brinkley,Ray Chen,Aashish Dhawan,Daisy Zhe Wang,Christan Grant

Main category: cs.CL

TL;DR: 本文提出了MultiScript30k,这是对现有Multi30k数据集的扩展,旨在支持更多样化的语言和文字,以促进多模态机器翻译研究。

Details Motivation: 由于原始Multi30k数据集仅包含四种欧洲语言,限制了多语言多模态翻译的研究,因此需要一个涵盖更多语言和文字的数据集来推动该领域的发展。 Method: 利用NLLB200-3.3B模型将Multi30k的英文部分翻译成阿拉伯语、西班牙语、乌克兰语、简体中文和繁体中文,从而构建新的数据集MultiScript30k。 Result: 新数据集包含超过30,000个句子,并且在除繁体中文外的所有语言中均表现出高于0.8的余弦相似度以及低于0.000251的对称KL散度;COMETKiwi评分显示其与已有工作的比较结果不一。 Conclusion: MultiScript30k为非拉丁字母系统的全球语言提供了宝贵的资源,有助于推进多模态机器翻译领域的多样性研究。 Abstract: Multi30k is frequently cited in the multimodal machine translation (MMT) literature, offering parallel text data for training and fine-tuning deep learning models. However, it is limited to four languages: Czech, English, French, and German. This restriction has led many researchers to focus their investigations only on these languages. As a result, MMT research on diverse languages has been stalled because the official Multi30k dataset only represents European languages in Latin scripts. Previous efforts to extend Multi30k exist, but the list of supported languages, represented language families, and scripts is still very short. To address these issues, we propose MultiScript30k, a new Multi30k dataset extension for global languages in various scripts, created by translating the English version of Multi30k (Multi30k-En) using NLLB200-3.3B. The dataset consists of over \(30000\) sentences and provides translations of all sentences in Multi30k-En into Ar, Es, Uk, Zh\_Hans and Zh\_Hant. Similarity analysis shows that Multi30k extension consistently achieves greater than \(0.8\) cosine similarity and symmetric KL divergence less than \(0.000251\) for all languages supported except Zh\_Hant which is comparable to the previous Multi30k extensions ArEnMulti30k and Multi30k-Uk. COMETKiwi scores reveal mixed assessments of MultiScript30k as a translation of Multi30k-En in comparison to the related work. ArEnMulti30k scores nearly equal MultiScript30k-Ar, but Multi30k-Uk scores $6.4\%$ greater than MultiScript30k-Uk per split.

[7] Applying NLP to iMessages: Understanding Topic Avoidance, Responsiveness, and Sentiment

Alan Gerber,Sam Cooperman

Main category: cs.CL

TL;DR: 本文探讨了如何利用苹果iMessage在Mac上存储的本地消息数据文件,开发一个文本消息分析工具,以回答关于话题建模、回复时间、犹豫评分和情感分析的五个主要研究问题,并展示其在未来iMessage数据分析中的潜力。

Details Motivation: 随着社会对短文本电子通信的依赖增加,用户通常忽视其消息数据可能被收集和利用的问题。苹果iMessage在Mac上提供了一个包含所有消息及元数据的本地文件,为用户数据的自我探索提供了机会,促使研究者思考这些数据如何能为用户自身服务。 Method: 通过构建一个iMessage文本消息分析工具,利用本地存储的消息数据文件,进行探索性数据分析,针对话题建模、响应时间、犹豫评分和情感分析等维度提出并回答五个研究问题。 Result: 成功开发出能够解析iMessage数据的分析工具,并利用该工具对实际消息数据进行了分析,展示了在话题建模、响应时间模式识别、交流犹豫程度评估以及情感倾向判断方面的应用效果。 Conclusion: 该分析工具不仅使用户能够更好地理解自己的沟通行为,还为未来基于iMessage数据的行为研究提供了方法论基础和实践范例。 Abstract: What is your messaging data used for? While many users do not often think about the information companies can gather based off of their messaging platform of choice, it is nonetheless important to consider as society increasingly relies on short-form electronic communication. While most companies keep their data closely guarded, inaccessible to users or potential hackers, Apple has opened a door to their walled-garden ecosystem, providing iMessage users on Mac with one file storing all their messages and attached metadata. With knowledge of this locally stored file, the question now becomes: What can our data do for us? In the creation of our iMessage text message analyzer, we set out to answer five main research questions focusing on topic modeling, response times, reluctance scoring, and sentiment analysis. This paper uses our exploratory data to show how these questions can be answered using our analyzer and its potential in future studies on iMessage data.

[8] Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

Jonathan Kamp,Roos Bakker,Dominique Blok

Main category: cs.CL

TL;DR: 本文提出了一种模型和方法无关的框架,通过三个评估指标系统地分析了特征归因方法在Transformer模型中的词汇和位置偏差,发现在不同类型偏差之间存在结构性不平衡。

Details Motivation: 不同特征归因方法在同一输入上的解释可能存在显著差异,导致用户对解释结果的信任问题。为了提升对语言模型和数据的理解,需要深入探究这些方法背后的偏差。 Method: 构建了一个模型和方法无关的评估框架,包含三个评价指标,用于衡量特征归因方法的词汇偏差和位置偏差,并在人工数据的伪随机分类任务和自然数据的因果关系检测任务上对两种Transformer模型进行了系统评估。 Result: 发现词汇和位置偏差在模型间存在结构性不平衡,某一类偏差得分高的模型在另一类上得分较低;同时发现产生异常解释的方法更可能本身具有较大偏差。 Conclusion: 特征归因方法的偏差需被系统评估,所提出的框架有助于揭示不同方法的潜在偏差模式,提升解释的可靠性和可信度。 Abstract: Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both the lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find that lexical and position biases are structurally unbalanced in our model comparison, with models that score high on one type score low on the other. We also find signs that methods producing anomalous explanations are more likely to be biased themselves.

[9] FIBER: A Multilingual Evaluation Resource for Factual Inference Bias

Evren Ayberk Munis,Deniz Yılmaz,Arianna Muti,Çağrı Toraman

Main category: cs.CL

TL;DR: 本文提出了FIBER,一个用于评估大语言模型在单实体和多实体场景下事实知识的多语言基准测试,涵盖英语、意大利语和土耳其语。研究发现提示语言会影响模型输出,存在语言相关的推理偏见,且模型在处理多实体问题时表现更差。

Details Motivation: 现有事实知识评测基准主要集中于单实体和单语言数据,缺乏对多语言和多实体场景下大语言模型表现的系统评估。 Method: 构建包含句子补全、问答和对象计数任务的多语言基准FIBER,覆盖英语、意大利语和土耳其语,并在不同规模模型上评估其在单实体与多实体问题上的表现及提示语言引发的推理偏见。 Result: 提示语言会影响模型输出,尤其对对应国家相关实体影响显著;31%主题的偏见得分超过0.5;83%主题中土耳其语提示偏见高于意大利语;模型在多实体问题上表现更差;英文表现最好,土耳其语和意大利语较低;大模型(如Llama-3.1-8B)优于小模型。 Conclusion: 大语言模型的事实知识表现受提示语言、实体数量、语言类型和模型规模影响,存在明显的语言依赖性推理偏见,且多实体推理仍是挑战。 Abstract: Large language models are widely used across domains, yet there are concerns about their factual reliability and biases. Factual knowledge probing offers a systematic means to evaluate these aspects. Most existing benchmarks focus on single-entity facts and monolingual data. We therefore present FIBER, a multilingual benchmark for evaluating factual knowledge in single- and multi-entity settings. The dataset includes sentence completion, question-answering, and object-count prediction tasks in English, Italian, and Turkish. Using FIBER, we examine whether the prompt language induces inference bias in entity selection and how large language models perform on multi-entity versus single-entity questions. The results indicate that the language of the prompt can influence the model's generated output, particularly for entities associated with the country corresponding to that language. However, this effect varies across different topics such that 31% of the topics exhibit factual inference bias score greater than 0.5. Moreover, the level of bias differs across languages such that Turkish prompts show higher bias compared to Italian in 83% of the topics, suggesting a language-dependent pattern. Our findings also show that models face greater difficulty when handling multi-entity questions than the single-entity questions. Model performance differs across both languages and model sizes. The highest mean average precision is achieved in English, while Turkish and Italian lead to noticeably lower scores. Larger models, including Llama-3.1-8B and Qwen-2.5-7B, show consistently better performance than smaller 3B-4B models.

[10] SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing

Luca Foppiano,Sotaro Takeshita,Pedro Ortiz Suarez,Ekaterina Borisova,Raia Abu Ahmad,Malte Ostendorff,Fabio Barth,Julian Moreno-Schneider,Georg Rehm

Main category: cs.CL

TL;DR: SciLaD是一个基于开源框架和公开数据构建的大规模科学语言数据集,包含超过1000万篇英文论文和3500多万篇多语言论文,并发布可扩展的生成管道和预训练模型,验证了其高质量与实用性。

Details Motivation: 为了推动科学语言处理领域的可重复性、透明性和进一步研究,需要一个完全开放、大规模且高质量的科学文本数据集。 Method: 利用开源工具和公开数据源构建SciLaD数据集,包括一个精选的英文子集和一个多语言TEI XML原始子集,并开发了可扩展的数据生成流程;在此基础上预训练了一个RoBERTa模型并通过多项基准测试进行评估。 Result: 成功构建了包含超1000万英文和3500多万多语言科学出版物的数据集,预训练的RoBERTa模型在多个基准上表现与其他同规模科学语言模型相当,验证了数据集的质量和有效性。 Conclusion: SciLaD展示了开源工具在大规模科学数据整理中的潜力,提供了一个高质量、开放获取的数据资源,有助于促进自然科学研究语言处理的发展。 Abstract: SciLaD is a novel, large-scale dataset of scientific language constructed entirely using open-source frameworks and publicly available data sources. It comprises a curated English split containing over 10 million scientific publications and a multilingual, unfiltered TEI XML split including more than 35 million publications. We also publish the extensible pipeline for generating SciLaD. The dataset construction and processing workflow demonstrates how open-source tools can enable large-scale, scientific data curation while maintaining high data quality. Finally, we pre-train a RoBERTa model on our dataset and evaluate it across a comprehensive set of benchmarks, achieving performance comparable to other scientific language models of similar size, validating the quality and utility of SciLaD. We publish the dataset and evaluation pipeline to promote reproducibility, transparency, and further research in natural scientific language processing and understanding including scholarly document processing.

Di Wu,Ruiyu Fang,Liting Jiang,Shuangyong Song,Xiaomeng Huang,Shiquan Wang,Zhongqiu Li,Lingling Shi,Mengjiao Bao,Yongxiang Li,Hao Huang

Main category: cs.CL

TL;DR: 本文综述了多意图口语理解(multi-intent SLU)的最新进展,从解码范式和建模方法两个角度系统回顾了现有研究,比较了代表性模型的性能,并讨论了当前挑战与未来研究方向。

Details Motivation: 缺乏对多意图SLU领域的全面、系统性综述,亟需梳理现有工作以推动后续研究。 Method: 从解码范式和建模方法两个维度对现有研究进行分类和综述,比较典型模型的性能并分析其优缺点。 Result: 总结了多意图SLU领域的主要技术路线和代表性成果,揭示了不同方法在性能和适用场景上的差异。 Conclusion: 该综述为多意图SLU的研究提供了有价值的参考,明确了当前面临的挑战,并指出了未来有前景的研究方向。 Abstract: Multi-intent spoken language understanding (SLU) involves two tasks: multiple intent detection and slot filling, which jointly handle utterances containing more than one intent. Owing to this characteristic, which closely reflects real-world applications, the task has attracted increasing research attention, and substantial progress has been achieved. However, there remains a lack of a comprehensive and systematic review of existing studies on multi-intent SLU. To this end, this paper presents a survey of recent advances in multi-intent SLU. We provide an in-depth overview of previous research from two perspectives: decoding paradigms and modeling approaches. On this basis, we further compare the performance of representative models and analyze their strengths and limitations. Finally, we discuss the current challenges and outline promising directions for future research. We hope this survey will offer valuable insights and serve as a useful reference for advancing research in multi-intent SLU.

[12] Leveraging LLMs for Title and Abstract Screening for Systematic Review: A Cost-Effective Dynamic Few-Shot Learning Approach

Yun-Chung Liu,Rui Yang,Jonathan Chong Kai Liew,Ziran Yin,Henry Foote,Christopher J. Lindsell,Chuan Hong

Main category: cs.CL

TL;DR: 提出一种两阶段动态少样本学习(DFSL)方法,利用大语言模型提升系统评价中文献筛选的效率与性能,兼顾准确性与计算成本。

Details Motivation: 系统评价的文献筛选过程耗时耗力,尤其是标题和摘要的筛查已成为主要瓶颈,亟需提高自动化水平以减轻人工负担。 Method: 设计了一种两阶段动态少样本学习(DFSL)方法:首先用低成本大模型进行初筛,再对低置信度样本由高性能大模型复核,从而在控制计算成本的同时提升整体筛选性能。 Result: 在10个系统评价数据集上验证了该方法的有效性,显示出良好的泛化能力和成本效益,显著减少人工筛查工作量。 Conclusion: DFSL方法能有效平衡准确性与成本,具备广泛应用于加速系统评价实践的潜力。 Abstract: Systematic reviews are a key component of evidence-based medicine, playing a critical role in synthesizing existing research evidence and guiding clinical decisions. However, with the rapid growth of research publications, conducting systematic reviews has become increasingly burdensome, with title and abstract screening being one of the most time-consuming and resource-intensive steps. To mitigate this issue, we designed a two-stage dynamic few-shot learning (DFSL) approach aimed at improving the efficiency and performance of large language models (LLMs) in the title and abstract screening task. Specifically, this approach first uses a low-cost LLM for initial screening, then re-evaluates low-confidence instances using a high-performance LLM, thereby enhancing screening performance while controlling computational costs. We evaluated this approach across 10 systematic reviews, and the results demonstrate its strong generalizability and cost-effectiveness, with potential to reduce manual screening burden and accelerate the systematic review process in practical applications.

[13] When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents

Mrinal Rawat,Arkajyoti Chakraborty,Neha Gupta,Roberto Pieraccini

Main category: cs.CL

TL;DR: 提出一种基于强化学习(RL)的方法,通过Group Relative Policy Optimization(GRPO)让大语言模型从任务结果中自主学习推理策略,提升推理质量和工具调用精度,在Qwen3-1.7B基础上实现40%的增益。

Details Motivation: 监督微调(SFT)在分布变化时泛化能力有限,且高质量推理标注成本高、难以扩展;而近期研究表明推理能力有助于提升模型的泛化性和可靠性。 Method: 采用强化学习框架,使用GRPO算法,结合工具准确率和答案正确性设计奖励函数,使LLM在生成推理步骤的同时指导工具调用和答案生成。 Result: 相比无显式推理的SFT模型有1.5%的相对提升,相比基础Qwen3-1.7B模型提升达40%,显著改善推理质量与工具调用精度。 Conclusion: 通过强化学习统一推理与行动学习,可有效提升对话代理的能力与泛化性,具备构建更强大智能体的潜力。 Abstract: Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks. However, SFT can have difficulty generalizing when the underlying data distribution changes, even when the new data does not fall completely outside the training domain. Recent reasoning-focused models such as o1 and R1 have demonstrated consistent gains over their non-reasoning counterparts, highlighting the importance of reasoning for improved generalization and reliability. However, collecting high-quality reasoning traces for SFT remains challenging -- annotations are costly, subjective, and difficult to scale. To address this limitation, we leverage Reinforcement Learning (RL) to enable models to learn reasoning strategies directly from task outcomes. We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools (e.g., function calls) and the final answer generation for conversational agents. Our method employs Group Relative Policy Optimization (GRPO) with rewards designed around tool accuracy and answer correctness, allowing the model to iteratively refine its reasoning and actions. Experimental results demonstrate that our approach improves both the quality of reasoning and the precision of tool invocations, achieving a 1.5% relative improvement over the SFT model (trained without explicit thinking) and a 40% gain compared to the base of the vanilla Qwen3-1.7B model. These findings demonstrate the promise of unifying reasoning and action learning through RL to build more capable and generalizable conversational agents.

[14] AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

Kuan-Wei Lu,Ding-Yong Hong,Pangfeng Liu

Main category: cs.CL

TL;DR: 本文提出了一种无需超参数的自适应推测解码方法AdaSD,通过动态调整生成长度和接受标准,实现了比标准推测解码最高49%的加速,同时将准确率损失控制在2%以内。

Details Motivation: 大型语言模型推理速度慢,现有推测解码方法依赖额外训练或复杂调参,缺乏即插即用的自适应方案。 Method: 提出AdaSD,引入基于token熵和Jensen-Shannon距离实时更新的两个自适应阈值,动态控制候选token生成停止时机与接受标准,无需预设超参数或模型微调。 Result: 在多个基准数据集上实验表明,AdaSD相比标准推测解码最高可加速49%,且准确率损失小于2%。 Conclusion: AdaSD是一种无需超参数、无需训练的高效自适应推测解码方法,兼容现成模型,显著提升LLM推理效率,具有实际应用价值。 Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive thresholds: one to determine when to stop candidate token generation and another to decide token acceptance, both updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49\% speedup over standard speculative decoding while limiting accuracy degradation to under 2\%, making it a practical solution for efficient and adaptive LLM inference.

[15] CIP: A Plug-and-Play Causal Prompting Framework for Mitigating Hallucinations under Long-Context Noise

Qingsen Ma,Dianyun Wang,Ran Jing,Yujun Sun,Zhenbo Xu

Main category: cs.CL

TL;DR: 本文提出了一种轻量级、即插即用的因果提示框架CIP,通过在输入阶段引入因果关系序列来抑制大语言模型在长而噪声上下文中的幻觉,显著提升了推理质量、事实准确性和响应效率。

Details Motivation: 大语言模型在处理长且含噪声的检索上下文时容易产生幻觉,因其依赖虚假相关而非真实因果关系。为此,需要一种能增强模型因果推理能力的方法以减少幻觉。 Method: CIP构建实体、动作和事件之间的因果关系序列,并将其注入提示中;利用因果干预和反事实推理抑制非因果推理路径,引导模型关注因果相关的证据。 Result: 在包括GPT-4o、Gemini 2.0 Flash和Llama 3.1在内的七个主流模型上实验表明,CIP使归因率提升2.6点,因果一致性得分提高0.38,有效信息密度增加四倍;API分析显示其可加快上下文理解并减少最高达55.1%的端到端响应延迟。 Conclusion: CIP能有效提升大语言模型的推理质量、可靠性和效率,表明因果推理可能成为改善模型可解释性、稳定性和效率的有前景范式。 Abstract: Large language models often hallucinate when processing long and noisy retrieval contexts because they rely on spurious correlations rather than genuine causal relationships. We propose CIP, a lightweight and plug-and-play causal prompting framework that mitigates hallucinations at the input stage. CIP constructs a causal relation sequence among entities, actions, and events and injects it into the prompt to guide reasoning toward causally relevant evidence. Through causal intervention and counterfactual reasoning, CIP suppresses non causal reasoning paths, improving factual grounding and interpretability. Experiments across seven mainstream language models, including GPT-4o, Gemini 2.0 Flash, and Llama 3.1, show that CIP consistently enhances reasoning quality and reliability, achieving 2.6 points improvement in Attributable Rate, 0.38 improvement in Causal Consistency Score, and a fourfold increase in effective information density. API level profiling further shows that CIP accelerates contextual understanding and reduces end to end response latency by up to 55.1 percent. These results suggest that causal reasoning may serve as a promising paradigm for improving the explainability, stability, and efficiency of large language models.

Shogo Fujita,Yuji Naraki,Yiqing Zhu,Shinsuke Mori

Main category: cs.CL

TL;DR: 本文介绍了LegalRikai: Open Benchmark,一个由法律专业人士在律师监督下创建的、模拟日本公司法律实践的开源基准测试,包含四个复杂任务和100个需要长文本结构化输出的样本,通过人类与自动化方式对GPT-5、Gemini 2.5 Pro和Claude Opus 4.1等主流大模型进行评估,发现抽象指令易导致不必要的修改,暴露出模型在文档级编辑上的缺陷,同时验证了自动化评估在具有明确语言依据的标准下与人工判断高度一致,可作为专家资源有限时的筛选工具,但也指出结构一致性评估仍具挑战,并提出一种数据集评估框架以推动法律领域更贴近实际应用的研究。

Details Motivation: 现有法律领域基准多集中于短文本任务,难以反映真实法律实践中复杂的文档级处理需求,因此需要构建更贴近实际应用场景、能评估模型在结构化长文本生成与编辑能力的实践导向型 benchmark。 Method: 设计包含四个复杂任务的 LegalRikai 基准,共100个需生成长形式结构化输出的样本,由法律专业人士创建并在律师监督下完成;采用人工评估与自动化评估相结合的方式,对多个领先的大语言模型(如 GPT-5、Gemini 2.5 Pro、Claude Opus 4.1)进行测试,并分析两种评估方式在不同评价标准下的相关性。 Result: 人类评估发现模型在面对抽象指令时容易做出不必要的修改,暴露其在文档级编辑中的弱点;自动化评估在具有明确语言依据的标准上与人工判断高度一致,但在评估输出的结构一致性方面仍存在挑战;结果表明自动化评估可作为专家资源不足时的有效筛查工具。 Conclusion: LegalRikai 是一个更贴近实际法律实践的高质量 benchmark,能够有效揭示现有模型在复杂法律任务中的局限性,同时验证了自动化评估在特定条件下的有效性,并提出了一个数据集评估框架,有助于推动法律人工智能研究向更实践导向的方向发展。 Abstract: This paper introduces LegalRikai: Open Benchmark, a new benchmark comprising four complex tasks that emulate Japanese corporate legal practices. The benchmark was created by legal professionals under the supervision of an attorney. This benchmark has 100 samples that require long-form, structured outputs, and we evaluated them against multiple practical criteria. We conducted both human and automated evaluations using leading LLMs, including GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1. Our human evaluation revealed that abstract instructions prompted unnecessary modifications, highlighting model weaknesses in document-level editing that were missed by conventional short-text tasks. Furthermore, our analysis reveals that automated evaluation aligns well with human judgment on criteria with clear linguistic grounding, and assessing structural consistency remains a challenge. The result demonstrates the utility of automated evaluation as a screening tool when expert availability is limited. We propose a dataset evaluation framework to promote more practice-oriented research in the legal domain.

[17] Unifying Dynamic Tool Creation and Cross-Task Experience Sharing through Cognitive Memory Architecture

Jiarun Liu,Shiyue Xu,Yang Li,Shangkun Liu,Yongli Yu,Peng Cao

Main category: cs.CL

TL;DR: 本文提出了SMITH,一种通过分层记忆组织实现动态工具创建与跨任务经验共享的统一认知架构,显著提升了大语言模型代理在新任务中的适应能力。

Details Motivation: 现有方法在工具可用性和经验复用方面存在局限,导致代理难以高效适应新任务:要么依赖覆盖有限的预定义工具,要么从零构建工具而无法利用过往经验。 Method: 提出SMITH架构,将代理记忆分为程序性、语义性和情景性三个层次;将工具创建形式化为沙箱环境中的迭代代码生成,并通过基于语义相似性的情景记忆检索实现经验共享;同时采用基于代理集成难度重估的课程学习策略。 Result: 在GAIA基准上的实验表明,SMITH达到81.8%的Pass@1准确率,优于Alita(75.2%)和Memento(70.9%)等最先进基线方法。 Conclusion: SMITH通过系统化整合工具创建与经验积累,为构建能够持续演进的真正自适应代理奠定了基础。 Abstract: Large Language Model agents face fundamental challenges in adapting to novel tasks due to limitations in tool availability and experience reuse. Existing approaches either rely on predefined tools with limited coverage or build tools from scratch without leveraging past experiences, leading to inefficient exploration and suboptimal performance. We introduce SMITH (Shared Memory Integrated Tool Hub), a unified cognitive architecture that seamlessly integrates dynamic tool creation with cross-task experience sharing through hierarchical memory organization. SMITH organizes agent memory into procedural, semantic, and episodic components, enabling systematic capability expansion while preserving successful execution patterns. Our approach formalizes tool creation as iterative code generation within controlled sandbox environments and experience sharing through episodic memory retrieval with semantic similarity matching. We further propose a curriculum learning strategy based on agent-ensemble difficulty re-estimation. Extensive experiments on the GAIA benchmark demonstrate SMITH's effectiveness, achieving 81.8% Pass@1 accuracy and outperforming state-of-the-art baselines including Alita (75.2%) and Memento (70.9%). Our work establishes a foundation for building truly adaptive agents that continuously evolve their capabilities through principled integration of tool creation and experience accumulation.

[18] qa-FLoRA: Data-free query-adaptive Fusion of LoRAs for LLMs

Shreya Shukla,Aditya Sriram,Milinda Kuppur Narayanaswamy,Hiteshi Jain

Main category: cs.CL

TL;DR: 提出了一种无需训练和数据的查询自适应LoRA融合方法qa-FLoRA,通过测量基础模型与各适配器之间的分布差异动态计算层级别融合权重,在多领域复合任务中显著优于静态和无训练基线方法。

Details Motivation: 现有LoRA融合方法依赖静态权重或需要大量标注数据进行训练,难以有效处理复杂多域复合查询,限制了在实际场景中的应用。 Method: qa-FLoRA通过衡量基础模型与各个LoRA适配器之间的分布差异来动态生成每层的融合权重,无需任何训练或额外数据,实现查询自适应的LoRA融合。 Result: 在九个涵盖数学、编程和医学领域的多语言复合任务上实验表明,相比LLaMA-2和LLaMA-3上的静态融合分别提升约5%和6%,优于无训练基线方法7%-10%,并显著缩小与有监督方法的差距。 Conclusion: qa-FLoRA是一种高效、通用且可解释的LoRA融合框架,能够在无需训练的情况下实现鲁棒的多域适应,为大规模模型的复合任务部署提供了新思路。 Abstract: The deployment of large language models for specialized tasks often requires domain-specific parameter-efficient finetuning through Low-Rank Adaptation (LoRA) modules. However, effectively fusing these adapters to handle complex, multi-domain composite queries remains a critical challenge. Existing LoRA fusion approaches either use static weights, which assign equal relevance to each participating LoRA, or require data-intensive supervised training for every possible LoRA combination to obtain respective optimal fusion weights. We propose qa-FLoRA, a novel query-adaptive data-and-training-free method for LoRA fusion that dynamically computes layer-level fusion weights by measuring distributional divergence between the base model and respective adapters. Our approach eliminates the need for composite training data or domain-representative samples, making it readily applicable to existing adapter collections. Extensive experiments across nine multilingual composite tasks spanning mathematics, coding, and medical domains, show that qa-FLoRA outperforms static fusion by ~5% with LLaMA-2 and ~6% with LLaMA-3, and the training-free baselines by ~7% with LLaMA-2 and ~10% with LLaMA-3, while significantly closing the gap with supervised baselines. Further, layer-level analysis of our fusion weights reveals interpretable fusion patterns, demonstrating the effectiveness of our approach for robust multi-domain adaptation.

Tomáš Koref,Lena Held,Mahammad Namazov,Harun Kumru,Yassine Thlija,Christoph Burchard,Ivan Habernal

Main category: cs.CL

TL;DR: 本研究通过开发自动化方法,利用自然语言处理技术分析捷克最高法院判决中的司法推理,挑战了关于中东欧地区司法形式主义的既有观点。研究构建了包含272份判决、9183段标注的MADON数据集,并采用针对捷克法律领域优化的Transformer模型,实现了对论证段落检测、法律论证类型分类及判决形式主义程度判断的高准确率。提出的三阶段混合模型在降低计算成本的同时提升了可解释性,研究成果具有跨司法管辖区的可复制性。

Details Motivation: 司法决策需具备正当理由,但大规模系统性分析司法推理仍具挑战。中东欧地区被普遍认为存在形式主义审判现象,但缺乏实证支持。因此需要通过量化方法检验这一主张。 Method: 构建MADON数据集,包含272份捷克最高法院判决,由专家标注9183个段落的八类论证类型及整体形式主义标签;使用30万份捷克法院判决语料,对Transformer大模型进行持续预训练以适应捷克法律领域;实验比较多种应对数据不平衡的方法(如非对称损失和类别加权);提出结合ModernBERT、Llama 3.1与传统特征机器学习的三阶段管道模型。 Result: 最佳模型在论证段落检测上达到82.6% macro-F1,在传统法律论证类型分类上达77.5% macro-F1,在判决形式主义分类上达83.2% macro-F1;三阶段管道在保持高性能的同时降低了计算成本并提高了模型可解释性;实证结果挑战了关于中东欧司法普遍形式主义的主流叙事。 Conclusion: 法律论证挖掘能够可靠地用于司法哲学分类,并展现出在计算法学其他任务中的广泛应用潜力;本研究提供的方法论、数据集、模型和代码均公开可复现,为跨司法管辖区的研究提供了可行路径。 Abstract: Courts must justify their decisions, but systematically analyzing judicial reasoning at scale remains difficult. This study refutes claims about formalistic judging in Central and Eastern Europe (CEE) by developing automated methods to detect and classify judicial reasoning in Czech Supreme Courts' decisions using state-of-the-art natural language processing methods. We create the MADON dataset of 272 decisions from two Czech Supreme Courts with expert annotations of 9,183 paragraphs with eight argument types and holistic formalism labels for supervised training and evaluation. Using a corpus of 300k Czech court decisions, we adapt transformer LLMs for Czech legal domain by continued pretraining and experiment with methods to address dataset imbalance including asymmetric loss and class weighting. The best models successfully detect argumentative paragraphs (82.6\% macro-F1), classify traditional types of legal argument (77.5\% macro-F1), and classify decisions as formalistic/non-formalistic (83.2\% macro-F1). Our three-stage pipeline combining ModernBERT, Llama 3.1, and traditional feature-based machine learning achieves promising results for decision classification while reducing computational costs and increasing explainability. Empirically, we challenge prevailing narratives about CEE formalism. This work shows that legal argument mining enables reliable judicial philosophy classification and shows the potential of legal argument mining for other important tasks in computational legal studies. Our methodology is easily replicable across jurisdictions, and our entire pipeline, datasets, guidelines, models, and source codes are available at https://github.com/trusthlt/madon.

[20] Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis

Felipe Ribeiro Fujita de Mello,Hideyuki Takada

Main category: cs.CL

TL;DR: 研究了数据选择对开放大语言模型机器翻译微调的影响,发现语义选择器在性能上优于词汇和基于几何的启发式方法,且即使选择的数据差异小于3%,也会显著影响模型性能。

Details Motivation: 探索不同数据选择策略对机器翻译微调效果的影响,以提升开放大语言模型的性能。 Method: 使用日英语料库,在受控训练条件下比较五种选择器:TF-IDF、COMET Kiwi、QuRate、FD-Score和随机选择。 Result: 语义选择器表现优于词汇和几何启发式方法;即使所选数据差异小于3%,对模型性能也有显著影响。 Conclusion: 数据质量对微调过程极为敏感,语义层面的数据选择更有利于提升翻译性能。 Abstract: We investigated the impact of data selection on machine translation fine-tuning for open LLMs. Using Japanese-English corpora, we compare five selectors: TF-IDF, COMET Kiwi, QuRate, FD-Score, and random selection, under controlled training conditions. We observed that semantic selectors consistently outperform lexical and geometry-based heuristics, and that even when the selected data differ by less than 3%, the impact on model performance is substantial, underscoring the sensitivity of fine-tuning to data quality.

[21] Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction

Galann Pennec,Zhengyuan Liu,Nicholas Asher,Philippe Muller,Nancy F. Chen

Main category: cs.CL

TL;DR: 提出一种基于轻量级视频描述模型和大语言模型的视频片段选择方法,用于构建高效的多模态视频摘要。

Details Motivation: 现有视觉-语言模型在处理长视频时容易丢失重要信息,且缺乏低成本分析长视频内容的有效工具。 Method: 将视频分割为短片段,使用轻量级视频描述模型生成每个片段的紧凑视觉描述,并利用大语言模型从中选择包含最关键信息的K个片段以构成多模态摘要。 Result: 在MovieSum数据集上验证,所选片段(不足电影总时长6%)能有效支持完整多模态摘要生成,性能接近人工标注的关键片段,显著优于随机选择,并保持低计算成本。 Conclusion: 该方法能够高效识别长视频中的关键片段,在保证摘要质量的同时大幅降低计算开销,适用于低成本、高效率的长视频理解任务。 Abstract: Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.

[22] CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare

Akash Ghosh,Srivarshinee Sridhar,Raghav Kaushik Ravi,Muhsin Muhsin,Sriparna Saha,Chirag Agarwal

Main category: cs.CL

TL;DR: 本文提出了CLINIC,一个用于评估医疗领域语言模型在多语言环境下可信度的综合基准。该基准涵盖15种语言和五个可信维度,揭示了现有模型在事实准确性、公平性、安全性等方面的不足。

Details Motivation: 现有的语言模型主要针对高资源语言训练,在中低资源语言的医疗场景中表现不佳,缺乏对其可信度的系统评估,限制了其在全球医疗中的应用。 Method: 提出CLINIC多语言基准,从真实性、公平性、安全性、鲁棒性和隐私性五个维度,通过18个任务对语言模型进行系统评估,覆盖15种语言和多种医疗主题。 Result: 实验表明现有语言模型在事实正确性方面存在缺陷,对不同群体存在偏见,并易受隐私泄露和对抗攻击影响。 Conclusion: CLINIC为提升多语言环境下医疗语言模型的可靠性与安全性提供了基础,有助于推动其在全球医疗中的广泛应用。 Abstract: Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.

[23] Mistake Notebook Learning: Selective Batch-Wise Context Optimization for In-Context Learning

Xuanbo Su,Yingfang Zhang,Hao Luo,Xiaoteng Liu,Leo Huang

Main category: cs.CL

TL;DR: Mistake Notebook Learning (MNL) 是一种无需训练的框架,通过批处理错误抽象和动态知识库存储可泛化的纠错模式,在多个复杂推理任务上显著优于现有训练-free方法,并接近监督微调性能。

Details Motivation: 现有的大模型任务适配方法如微调存在计算量大和灾难性遗忘问题,而上下文学习则鲁棒性差且难以从错误中学习,因此需要一种更高效、可持续改进的训练-free适应机制。 Method: 提出Mistake Notebook Learning (MNL),利用批处理方式从多个失败实例中提取抽象错误模式,存入动态notebook,并通过保留集验证仅保留优于基线的表现性指导,实现单调提升。 Result: 在GSM8K、Spider、AIME和KaggleDBQA等多个基准上,MNL显著优于其他训练-free方法;在KaggleDBQA上Qwen3-8B达到28%准确率(相对提升47%),接近监督微调水平(GSM8K上93.9% vs 94.3%)。 Conclusion: MNL是一种有效的训练-free自适应框架,能够系统化地从批量错误中学习并持续改进,为复杂推理任务提供了一种强大且实用的新范式。 Abstract: Large language models (LLMs) adapt to tasks via gradient fine-tuning (heavy computation, catastrophic forgetting) or In-Context Learning (ICL: low robustness, poor mistake learning). To fix this, we introduce Mistake Notebook Learning (MNL), a training-free framework with a persistent knowledge base of abstracted error patterns. Unlike prior instance/single-trajectory memory methods, MNL uses batch-wise error abstraction: it extracts generalizable guidance from multiple failures, stores insights in a dynamic notebook, and retains only baseline-outperforming guidance via hold-out validation (ensuring monotonic improvement). We show MNL nearly matches Supervised Fine-Tuning (93.9% vs 94.3% on GSM8K) and outperforms training-free alternatives on GSM8K, Spider, AIME, and KaggleDBQA. On KaggleDBQA (Qwen3-8B), MNL hits 28% accuracy (47% relative gain), outperforming Memento (15.1%) and Training-Free GRPO (22.1) - proving it's a strong training-free alternative for complex reasoning.

[24] Building Patient Journeys in Hebrew: A Language Model for Clinical Timeline Extraction

Kai Golan Hashiloni,Brenda Kasabe Nokai,Michal Shevach,Esthy Shemesh,Ronit Bartin,Anna Bergrin,Liran Harel,Nachum Dershowitz,Liat Nadai Arad,Kfir Bar

Main category: cs.CL

TL;DR: 提出了一种基于DictaBERT 2.0的希伯来语医学语言模型,通过在五百万条去标识化医院记录上持续预训练,用于从电子健康记录中提取结构化临床时间线,支持构建患者诊疗路径。

Details Motivation: 为了从希伯来语电子健康记录中有效提取结构化临床时间线,以构建完整的患者旅程,现有模型在语言和领域适应性方面存在不足。 Method: 基于DictaBERT 2.0模型进行持续预训练,使用超过五百万条去标识化的医院记录,并对词汇表进行适应性调整;构建两个新的标注数据集(分别来自内科/急诊和肿瘤科),用于评估事件间时序关系抽取性能。 Result: 该模型在两个新构建的时序关系数据集上均表现出色;词汇表适应提高了标记效率,且去标识化处理未影响下游任务性能。 Conclusion: 所提出的希伯来语医学语言模型能高效提取临床时间线,支持隐私保护下的医学自然语言处理研究,已依伦理规范向研究社区开放。 Abstract: We present a new Hebrew medical language model designed to extract structured clinical timelines from electronic health records, enabling the construction of patient journeys. Our model is based on DictaBERT 2.0 and continually pre-trained on over five million de-identified hospital records. To evaluate its effectiveness, we introduce two new datasets -- one from internal medicine and emergency departments, and another from oncology -- annotated for event temporal relations. Our results show that our model achieves strong performance on both datasets. We also find that vocabulary adaptation improves token efficiency and that de-identification does not compromise downstream performance, supporting privacy-conscious model development. The model is made available for research use under ethical restrictions.

[25] Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs

Mohor Banerjee,Nadya Yuki Wangsajaya,Syed Ali Redha Alsagoff,Min Sen Tan,Zachary Choy Kit Chun,Alvin Chan Guo Wei

Main category: cs.CL

TL;DR: 研究探讨了三种减少大语言模型幻觉的技术(CoVe、DoLa、RAG)对创造力的影响,发现它们在发散性思维上具有相反作用:CoVe增强、DoLa抑制、RAG影响较小,为科学应用中准确性和创造性之间的权衡提供了指导。

Details Motivation: 大语言模型虽能力强,但存在产生错误信息(幻觉)的问题,现有去幻觉方法对创造性生成的影响尚不清楚,尤其在需要事实准确与创意并重的AI辅助科研场景中这一问题尤为关键。 Method: 评估了三种去幻觉技术——验证链(CoVe)、对比层解码(DoLa)和检索增强生成(RAG),在多个不同规模的模型(LLaMA、Qwen、Mistral)上于两个创造力基准(NeoCoder、CS4)的表现。 Result: CoVe提升了发散性思维,DoLa抑制了创造力,而RAG对创造力影响甚微。不同模型家族和规模下结果具有一致性。 Conclusion: 不同的去幻觉技术对LLM的创造力有显著差异性影响,选择合适的方法需权衡科学应用中事实准确性与创造性探索的需求。 Abstract: Large Language Models (LLMs) exhibit remarkable capabilities in natural language understanding and reasoning, but suffer from hallucination: the generation of factually incorrect content. While numerous methods have been developed to reduce hallucinations, their impact on creative generations remains unexplored. This gap is particularly critical for AI-assisted scientific discovery, which requires both factual accuracy and creative hypothesis generation. We investigate how three hallucination-reduction techniques: Chain of Verification (CoVe), Decoding by Contrasting Layers (DoLa), and Retrieval-Augmented Generation (RAG), affect creativity in LLMs. Evaluating multiple model families (LLaMA, Qwen, Mistral) at varying scales (1B - 70B parameters) on two creativity benchmarks (NeoCoder and CS4), we find that these methods have opposing effects on divergent creativity. CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact. Our findings provide guidance for selecting appropriate hallucination-reduction methods in scientific applications, where the balance between factual accuracy and creative exploration is crucial.

[26] Extending a Parliamentary Corpus with MPs' Tweets: Automatic Annotation and Evaluation Using MultiParTweet

Mevlüt Bagci,Ali Abusaleh,Daniel Baumartz,Giueseppe Abrami,Maxim Konca,Alexander Mehler

Main category: cs.CL

TL;DR: 本文提出了MultiParTweet,一个多语言推文语料库,并结合德国议会语料库GerParCor,支持对社交媒体言论与议会辩论的比较分析。语料库包含近4万条推文和丰富的自动文本与媒体标注,且通过人工标注验证。同时提供了数据采集工具TTLABTweetCrawler,并展示模型间可相互预测,其中视觉语言模型更符合人类理解。

Details Motivation: 为了实现政治人物在社交媒体与议会中言论的跨平台比较分析,弥补现有语料库在多语言、多模态和自动化标注方面的不足。 Method: 构建MultiParTweet语料库并链接至GerParCor;使用九个文本模型和一个视觉语言模型(VLM)进行情感、情感倾向和主题标注;通过人工标注子集评估自动标注质量;利用TTLABTweetCrawler工具实现数据可复现采集;分析各模型输出间的可预测性。 Result: MultiParTweet包含39,546条推文和19,056个媒体项,自动标注结果与人工标注具有较高一致性;模型间输出具有可预测性;人类评估更偏好VLM生成的标注,表明多模态表征更贴近人类理解。 Conclusion: MultiParTweet是一个经过人工验证、集成多模态自动标注的政治传播研究资源,支持跨平台比较分析;配套工具TTLABTweetCrawler保障数据可复现性;模型间可预测性及VLM优越性表明多模态方法在社交媒体分析中的潜力。 Abstract: Social media serves as a critical medium in modern politics because it both reflects politicians' ideologies and facilitates communication with younger generations. We present MultiParTweet, a multilingual tweet corpus from X that connects politicians' social media discourse with German political corpus GerParCor, thereby enabling comparative analyses between online communication and parliamentary debates. MultiParTweet contains 39 546 tweets, including 19 056 media items. Furthermore, we enriched the annotation with nine text-based models and one vision-language model (VLM) to annotate MultiParTweet with emotion, sentiment, and topic annotations. Moreover, the automated annotations are evaluated against a manually annotated subset. MultiParTweet can be reconstructed using our tool, TTLABTweetCrawler, which provides a framework for collecting data from X. To demonstrate a methodological demonstration, we examine whether the models can predict each other using the outputs of the remaining models. In summary, we provide MultiParTweet, a resource integrating automatic text and media-based annotations validated with human annotations, and TTLABTweetCrawler, a general-purpose X data collection tool. Our analysis shows that the models are mutually predictable. In addition, VLM-based annotation were preferred by human annotators, suggesting that multimodal representations align more with human interpretation.

[27] Visualizing token importance for black-box language models

Paulius Rauba,Qiyao Wei,Mihaela van der Schaar

Main category: cs.CL

TL;DR: 提出了一种名为分布基敏感性分析(DBSA)的轻量级、模型无关方法,用于评估黑盒大语言模型对每个输入token的输出敏感性,支持快速可视化探索。

Details Motivation: 现有LLM审计方法多关注孤立行为(如偏见检测),缺乏对输入token级依赖性的系统理解,尤其在高风险领域需要可解释性工具来确保可靠部署。 Method: 提出Distribution-Based Sensitivity Analysis (DBSA),通过无分布假设的方式分析黑盒LLM输出对各输入token的敏感性,支持模型无关、轻量化的敏感性评估与可视化。 Result: DBSA能够在不访问模型内部结构的情况下,有效识别LLM对特定输入token的敏感性,发现传统可解释性方法可能忽略的问题。 Conclusion: DBSA为黑盒LLM提供了一种实用、即插即用的审计工具,有助于提升模型在高风险应用场景中的可靠性与透明度。 Abstract: We consider the problem of auditing black-box large language models (LLMs) to ensure they behave reliably when deployed in production settings, particularly in high-stakes domains such as legal, medical, and regulatory compliance. Existing approaches for LLM auditing often focus on isolated aspects of model behavior, such as detecting specific biases or evaluating fairness. We are interested in a more general question -- can we understand how the outputs of black-box LLMs depend on each input token? There is a critical need to have such tools in real-world applications that rely on inaccessible API endpoints to language models. However, this is a highly non-trivial problem, as LLMs are stochastic functions (i.e. two outputs will be different by chance), while computing prompt-level gradients to approximate input sensitivity is infeasible. To address this, we propose Distribution-Based Sensitivity Analysis (DBSA), a lightweight model-agnostic procedure to evaluate the sensitivity of the output of a language model for each input token, without making any distributional assumptions about the LLM. DBSA is developed as a practical tool for practitioners, enabling quick, plug-and-play visual exploration of LLMs reliance on specific input tokens. Through illustrative examples, we demonstrate how DBSA can enable users to inspect LLM inputs and find sensitivities that may be overlooked by existing LLM interpretability methods.

[28] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

Björn Deiseroth,Max Henning Höth,Kristian Kersting,Letitia Parcalabescu

Main category: cs.CL

TL;DR: 提出一种基于Merlin-Arthur协议的训练框架,将RAG系统视为交互式证明系统,通过引入对抗性证据和可解释AI方法,提升LLM在检索增强生成中的可信度、拒答能力和证据依赖性,显著减少幻觉并提高检索质量。

Details Motivation: 现有RAG系统将检索结果视为弱启发而非可验证证据,导致LLM产生幻觉、依赖虚假证据或在无充分支持下作答。需要一种更可靠机制使生成器能区分有效与误导信息,并真正基于证据推理。 Method: 构建一个类似Merlin-Arthur(M/A)协议的交互式训练框架:Merlin提供有益证据,Morgana注入对抗性误导上下文,Arthur(生成器LLM)在此混合上下文中学习判断;使用线性时间XAI方法识别对预测最具影响力的证据片段,并据此优化Arthur的响应行为;同时自动构建难例正负样本以改进检索器。 Result: 在三个RAG数据集和两种不同规模模型上验证,M/A训练后的LLM展现出更强的 groundedness、完整性、正确性和拒答能力,幻觉得到显著抑制;检索器的召回率和MRR也因自动生成的硬样本而提升;提出Explained Information Fraction (EIF)指标,用于量化解释保真度并解耦基线误差影响。 Conclusion: 自主式的交互证明监督为构建可靠RAG系统提供了原则性且实用的路径,使检索文档从建议转变为可验证证据,推动LLM实现更可信的推理与生成。 Abstract: Retrieval-augmented generation (RAG) models rely on retrieved evidence to guide large language model (LLM) generators, yet current systems treat retrieval as a weak heuristic rather than verifiable evidence. As a result, LLMs answer without support, hallucinate under incomplete or misleading context, and rely on spurious evidence. We introduce a training framework that treats the entire RAG pipeline -- both the retriever and the generator -- as an interactive proof system via an adaptation of the Merlin-Arthur (M/A) protocol. Arthur (the generator LLM) trains on questions of unkown provenance: Merlin provides helpful evidence, while Morgana injects adversarial, misleading context. Both use a linear-time XAI method to identify and modify the evidence most influential to Arthur. Consequently, Arthur learns to (i) answer when the context support the answer, (ii) reject when evidence is insufficient, and (iii) rely on the specific context spans that truly ground the answer. We further introduce a rigorous evaluation framework to disentangle explanation fidelity from baseline predictive errors. This allows us to introduce and measure the Explained Information Fraction (EIF), which normalizes M/A certified mutual-information guarantees relative to model capacity and imperfect benchmarks. Across three RAG datasets and two model families of varying sizes, M/A-trained LLMs show improved groundedness, completeness, soundness, and reject behavior, as well as reduced hallucinations -- without needing manually annotated unanswerable questions. The retriever likewise improves recall and MRR through automatically generated M/A hard positives and negatives. Our results demonstrate that autonomous interactive-proof-style supervision provides a principled and practical path toward reliable RAG systems that treat retrieved documents not as suggestions, but as verifiable evidence.

[29] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

Keerthana Murugaraj,Salima Lamsiyah,Marten During,Martin Theobald

Main category: cs.CL

TL;DR: 本研究使用BERTopic方法分析1955年至2018年报纸中关于核能与核安全的公共话语,克服传统主题模型在处理历史文本时的局限性。

Details Motivation: 传统主题模型(如LDA)难以有效捕捉历史文本中复杂且动态变化的主题,同时面临OCR噪声和海量文本的挑战。 Method: 采用基于Transformer嵌入的神经主题建模方法BERTopic,对1955–2018年间的报纸文章进行主题提取与分类,并分析主题的时间演化。 Result: 成功识别出核能与核武器主题的共现模式及其随时间的重要性变化,揭示了公共话语中的长期趋势。 Conclusion: BERTopic具有良好的可扩展性和语境敏感性,能为历史文本分析提供比传统方法更丰富的洞察,推动历史学与社会科学研究的发展。 Abstract: Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.

[30] Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks

Sergey Pankratov,Dan Alistarh

Main category: cs.CL

TL;DR: 本文通过将推测性生成与分支随机游走相类比,建立了任何确定性推测性生成算法的紧致运行时间下界,揭示了并行token生成的理论极限,并通过实验验证了理论界的紧密性。

Details Motivation: 推测性生成虽能加速大语言模型推理,但其可实现的加速比的基本极限尚不明确,因此需要建立理论下界以指导系统设计。 Method: 通过将token生成过程与分支随机游走建立类比,分析最优草案树选择问题,推导出期望成功预测token数的理论上界。 Result: 证明了在基本假设下,每次推测迭代成功预测的token数期望满足 E[X] ≤ (μ + μ_(2))log(P)/μ² + O(1),其中P为验证器容量,μ为输出分布的期望熵,μ_(2)为二阶对数矩。 Conclusion: 该结果首次给出了推测性生成算法的紧致下界,为未来推测性解码系统的设计提供了理论依据,并在Llama模型上的实验验证了理论的准确性。 Abstract: Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable speedup remain poorly understood. In this work, we establish the first ``tight'' lower bounds on the runtime of any deterministic speculative generation algorithm. This is achieved by drawing a parallel between the token generation process and branching random walks, which allows us to analyze the optimal draft tree selection problem. We prove, under basic assumptions, that the expected number of tokens successfully predicted per speculative iteration is bounded as $\mathbb{E}[X] \leq (μ+ μ_{(2)})\log(P )/μ^2 + O(1)$, where $P$ is the verifier's capacity, $μ$ is the expected entropy of the verifier's output distribution, and $μ_{(2)}$ is the expected second log-moment. This result provides new insights into the limits of parallel token generation, and could guide the design of future speculative decoding systems. Empirical evaluations on Llama models validate our theoretical predictions, confirming the tightness of our bounds in practical settings.

[31] SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support

Yuming Feng,Xinrui Jiang

Main category: cs.CL

TL;DR: 本文提出了一种可引导的评论摘要框架SUMFORU,通过结合用户画像和两阶段对齐方法(带不对称知识蒸馏的监督微调与基于AI反馈的强化学习),实现个性化的商品评论摘要,提升了摘要的一致性、事实性和与用户偏好的匹配度。

Details Motivation: 现有基于大语言模型的评论摘要方法过于通用,无法根据用户的个性化偏好生成摘要,且在线评论数据噪声多,影响决策效率。因此需要一种能够根据用户画像定制摘要的方法。 Method: 提出SUMFORU框架:首先构建高质量数据管道(基于Amazon 2023 Review Dataset),然后采用两阶段对齐策略——第一阶段是基于不对称知识蒸馏的画像感知监督微调(SFT),第二阶段是利用偏好估计器进行基于AI反馈的强化学习(RLAIF),以捕捉细粒度的个性化信号。 Result: 在基于规则、基于LLM以及人工评估等多种指标下,SUMFORU在一致性、事实依据和偏好对齐方面均优于现有方法,并在未见过的产品类别上表现出良好的泛化能力。 Conclusion: 通过引入用户画像和两阶段对齐训练,SUMFORU实现了更符合个体偏好的评论摘要,展示了可引导的多元化对齐在下一代个性化决策支持系统中的潜力。 Abstract: Online product reviews contain rich but noisy signals that overwhelm users and hinder effective decision-making. Existing LLM-based summarizers remain generic and fail to account for individual preferences, limiting their practical utility. We propose SUMFORU, a steerable review summarization framework that aligns outputs with explicit user personas to support personalized purchase decisions. Our approach integrates a high-quality data pipeline built from the Amazon 2023 Review Dataset with a two-stage alignment procedure: (1) persona-aware Supervised Fine-Tuning (SFT) via asymmetric knowledge distillation, and (2) Reinforcement Learning with AI Feedback (RLAIF) using a preference estimator to capture fine-grained, persona-relevant signals. We evaluate the model across rule-based, LLM-based, and human-centered metrics, demonstrating consistent improvements in consistency, grounding, and preference alignment. Our framework achieves the highest performance across all evaluation settings and generalizes effectively to unseen product categories. Our results highlight the promise of steerable pluralistic alignment for building next-generation personalized decision-support systems.

cs.CV [Back]

[32] Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification

Anoop Krishnan

Main category: cs.CV

TL;DR: 本文提出了一种基于文本引导的多模态方法,通过图像-文本匹配和融合策略,在无需 demographic 标签的情况下提升面部性别分类中的公平性和准确性。

Details Motivation: 为了解决面部性别分类中普遍存在的种族和性别偏见问题,尤其是在缺乏显式人口统计标签的情况下实现更公平的AI决策。 Method: 采用图像-文本匹配(ITM)指导和图像-文本融合两种策略,利用图像标题中的语义信息进行模型训练,增强跨模态表征能力。 Result: 在基准数据集上的实验表明,所提方法有效减少了不同性别和种族群体间的偏差,并提高了整体分类准确率,且具备良好的可解释性。 Conclusion: 文本引导是一种有效且应用无关的途径,可用于缓解面部分析中的 demographic 偏见,推动更公平的人工智能系统发展。 Abstract: In the quest for fairness in artificial intelligence, novel approaches to enhance it in facial image based gender classification algorithms using text guided methodologies are presented. The core methodology involves leveraging semantic information from image captions during model training to improve generalization capabilities. Two key strategies are presented: Image Text Matching (ITM) guidance and Image Text fusion. ITM guidance trains the model to discern fine grained alignments between images and texts to obtain enhanced multimodal representations. Image text fusion combines both modalities into comprehensive representations for improved fairness. Exensive experiments conducted on benchmark datasets demonstrate these approaches effectively mitigate bias and improve accuracy across gender racial groups compared to existing methods. Additionally, the unique integration of textual guidance underscores an interpretable and intuitive training paradigm for computer vision systems. By scrutinizing the extent to which semantic information reduces disparities, this research offers valuable insights into cultivating more equitable facial analysis algorithms. The proposed methodologies contribute to addressing the pivotal challenge of demographic bias in gender classification from facial images. Furthermore, this technique operates in the absence of demographic labels and is application agnostic.

[33] SoccerMaster: A Vision Foundation Model for Soccer Understanding

Haolin Yang,Jiayuan Rao,Haoning Wu,Weidi Xie

Main category: cs.CV

TL;DR: 本文提出了SoccerMaster,首个针对足球视觉理解的统一基础模型,通过多任务预训练框架整合了从细粒度感知到语义推理的多种任务,并构建了大规模数据集SoccerFactory,实验证明其在多个下游任务上优于专用模型。

Details Motivation: 现有的足球理解研究通常依赖于孤立的任务特定专家模型,缺乏一个能够统一处理多样化任务的通用框架,限制了模型的泛化能力和实际应用价值。 Method: 提出SoccerMaster,采用监督式多任务预训练框架;开发自动化数据标注流水线,整合多个现有足球视频数据集,构建大规模预训练数据集SoccerFactory。 Result: 在多个下游任务(如运动员检测、事件分类等)上,SoccerMaster consistently 优于任务特定的专家模型,展现出更强的泛化性和优越性。 Conclusion: SoccerMaster作为首个足球领域专用的视觉基础模型,验证了统一建模范式的有效性,为未来体育智能分析提供了新的基础设施和研究方向。 Abstract: Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection) to semantic reasoning (e.g., event classification). Specifically, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline to generate scalable spatial annotations, and integrate them with various existing soccer video datasets to construct SoccerFactory, a comprehensive pretraining data resource; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.

[34] Weakly Supervised Tuberculosis Localization in Chest X-rays through Knowledge Distillation

Marshal Ashif Shawkat,Moidul Hasan,Taufiq Hasan

Main category: cs.CV

TL;DR: 本研究利用知识蒸馏技术训练CNN模型,以减少肺结核(TB)影像分类中的虚假相关性,并在无需边界框标注的情况下定位TB相关异常,在TBX11k数据集上取得了0.2428的mIOU成绩,且学生模型表现优于教师模型。

Details Motivation: 由于资源有限地区缺乏专业影像解读人员,且现有机器学习模型易依赖虚假相关性、泛化能力差,同时高质量医学图像标注成本高昂,因此需要一种低成本、高鲁棒性的TB筛查方法。 Method: 采用基于ResNet50架构的教师-学生框架,通过知识蒸馏技术在TBX11k数据集上训练CNN模型,利用该技术减少模型对虚假特征的依赖,并实现病灶区域的定位,而无需额外的边界框标注。 Result: 该方法在TBX11k数据集上实现了0.2428的mIOU分数,实验结果显示学生模型在多个指标上持续优于教师模型,表现出更强的鲁棒性和更好的泛化能力。 Conclusion: 所提出的知识蒸馏方法能有效提升模型在TB影像识别中的鲁棒性和可解释性,降低对专家标注的依赖,具有在多样化临床环境中推广应用的潜力。 Abstract: Tuberculosis (TB) remains one of the leading causes of mortality worldwide, particularly in resource-limited countries. Chest X-ray (CXR) imaging serves as an accessible and cost-effective diagnostic tool but requires expert interpretation, which is often unavailable. Although machine learning models have shown high performance in TB classification, they often depend on spurious correlations and fail to generalize. Besides, building large datasets featuring high-quality annotations for medical images demands substantial resources and input from domain specialists, and typically involves several annotators reaching agreement, which results in enormous financial and logistical expenses. This study repurposes knowledge distillation technique to train CNN models reducing spurious correlations and localize TB-related abnormalities without requiring bounding-box annotations. By leveraging a teacher-student framework with ResNet50 architecture, the proposed method trained on TBX11k dataset achieve impressive 0.2428 mIOU score. Experimental results further reveal that the student model consistently outperforms the teacher, underscoring improved robustness and potential for broader clinical deployment in diverse settings.

[35] Synthetic Vasculature and Pathology Enhance Vision-Language Model Reasoning

Chenjun Li,Cheng Wan,Laurin Lux,Alexander Berger,Richard B. Rosen,Martin J. Menten,Johannes C. Paetzold

Main category: cs.CV

TL;DR: 提出了一种名为SVR的框架,用于合成视网膜血管图像及对应的细粒度病理描述文本,构建了包含10万对样本的OCTA-100K-SVR数据集,训练出的视觉语言模型在真实OCTA图像上表现出优异的零样本分类性能和临床解释能力。

Details Motivation: 在医学领域(如OCTA图像分析)缺乏大量精确配对的图像-文本数据来训练具备推理能力的视觉语言模型,限制了其在临床诊断中的应用。 Method: 提出Synthetic Vasculature Reasoning (SVR)框架,通过可控方式合成具有糖尿病视网膜病变特征(如毛细血管缺失、微动脉瘤等)的视网膜血管图像,并自动生成对应的细粒度推理文本,构建OCTA-100K-SVR数据集,用于训练通用VLM(Qwen3-VL-8b)。 Result: 在真实OCTA图像上,训练后的模型实现了89.67%的零样本平衡分类准确率,超过有监督基线方法;专家评估表明其显著提升了临床数据上的解释质量和病灶定位能力。 Conclusion: SVR框架能有效缓解专业医学领域中高质量图文数据稀缺的问题,为训练可解释的医学视觉语言模型提供了可行路径。 Abstract: Vision-Language Models (VLMs) offer a promising path toward interpretable medical diagnosis by allowing users to ask about clinical explanations alongside predictions and across different modalities. However, training VLMs for detailed reasoning requires large-scale image-text datasets. In many specialized domains, for example in reading Optical Coherence Tomography Angiography (OCTA) images, such precise text with grounded description of pathologies is scarce or even non-existent. To overcome this bottleneck, we introduce Synthetic Vasculature Reasoning (SVR), a framework that controllably synthesizes images and corresponding text, specifically: realistic retinal vasculature with Diabetic Retinopathy (DR) features: capillary dropout, microaneurysms, neovascularization, and tortuosity, while automatically generating granular reasoning texts. Based on this we curate OCTA-100K-SVR, an OCTA image-reasoning dataset with 100,000 pairs. Our experiments show that a general-purpose VLM (Qwen3-VL-8b) trained on the dataset achieves a zero-shot balanced classification accuracy of 89.67% on real OCTA images, outperforming supervised baselines. Through human expert evaluation we also demonstrate that it significantly enhances explanation quality and pathology localization on clinical data.

[36] VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

Felix O'Mahony,Roberto Cipolla,Ayush Tewari

Main category: cs.CV

TL;DR: 提出VDAWorld框架,利用视觉语言模型(VLM)构建可模拟的抽象场景表示,并结合自适应物理仿真,实现高质量、符合逻辑的动态场景预测。

Details Motivation: 生成式视频模型常违背物理和逻辑规则,缺乏交互性且难以构建可查询的结构化世界,亟需一种更透明、可控的世界建模新范式。 Method: 将图像-文本对蒸馏为可处理的抽象表示,由VLM作为智能代理选择视觉工具构建2D/3D场景,并匹配合适的物理模拟器进行动态推演,从而推理静态场景中的潜在动力学。 Result: 实验表明VDAWorld能在多种动态场景中生成高质量仿真,具备良好的适应性和仿真准确性。 Conclusion: 通过智能抽象与自适应模拟的结合,VDAWorld提供了一种更具结构性、可解释性和物理合理性的世界建模新路径。 Abstract: Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.

[37] E-CHUM: Event-based Cameras for Human Detection and Urban Monitoring

Jack Brady,Andrew Dailey,Kristen Schang,Zo Vic Shong

Main category: cs.CV

TL;DR: 本文综述了基于事件相机在城市动态研究中的应用,探讨其优势、挑战及与机器学习的结合,并提出多传感器融合的潜力。

Details Motivation: 传统城市监测方法存在局限,需要更高效、隐私保护的技术来理解城市动态。 Method: 通过分析事件相机的特性、应用场景及其与其他传感器的融合,结合机器学习技术进行综述。 Result: 事件相机具有低光工作能力和隐私保护优势,多传感器融合可提升其性能。 Conclusion: 事件相机是一种有前景的城市动态信息采集工具,结合多传感器和机器学习可推动该领域发展。 Abstract: Understanding human movement and city dynamics has always been challenging. From traditional methods of manually observing the city's inhabitant, to using cameras, to now using sensors and more complex technology, the field of urban monitoring has evolved greatly. Still, there are more that can be done to unlock better practices for understanding city dynamics. This paper surveys how the landscape of urban dynamics studying has evolved with a particular focus on event-based cameras. Event-based cameras capture changes in light intensity instead of the RGB values that traditional cameras do. They offer unique abilities, like the ability to work in low-light, that can make them advantageous compared to other sensors. Through an analysis of event-based cameras, their applications, their advantages and challenges, and machine learning applications, we propose event-based cameras as a medium for capturing information to study urban dynamics. They offer the ability to capture important information while maintaining privacy. We also suggest multi-sensor fusion of event-based cameras and other sensors in the study of urban dynamics. Combining event-based cameras and infrared, event-LiDAR, or vibration has to potential to enhance the ability of event-based cameras and overcome the challenges that event-based cameras have.

[38] Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description

Nazanin Mahjourian,Vinh Nguyen

Main category: cs.CV

TL;DR: 本文提出了VLM-IRIS,一种将视觉-语言模型(VLM)应用于红外工业感知的零样本框架,通过将红外图像预处理为RGB兼容格式,实现了无需重新训练即可在3D打印床上进行工件存在检测。

Details Motivation: 由于制造环境常处于低光照或封闭机器中,传统视觉系统难以工作,而红外相机在此类场景中具有优势;然而现有的视觉-语言模型仅在RGB数据上训练,无法理解红外数据,因此需要一种能直接适配红外输入的零样本学习方法。 Method: 提出VLM-IRIS框架,将FLIR Boson传感器捕获的红外图像转换为magma色彩表示,并作为CLIP ViT-B/32编码器的输入;采用质心提示集成(centroid prompt ensembling)策略提升分类性能,整个过程无需微调或重新训练模型。 Result: 在3D打印机床的工件存在检测任务中,VLM-IRIS在没有使用任何标注数据和模型再训练的情况下,实现了对热成像图像的高精度零样本识别。 Conclusion: VLM-IRIS成功扩展了现有视觉-语言模型至红外热成像领域,证明了通过简单预处理和提示工程即可使CLIP等模型有效理解红外图像,适用于工业中无需标签的监测应用。 Abstract: Many manufacturing environments operate in low-light conditions or within enclosed machines where conventional vision systems struggle. Infrared cameras provide complementary advantages in such environments. Simultaneously, supervised AI systems require large labeled datasets, which makes zero-shot learning frameworks more practical for applications including infrared cameras. Recent advances in vision-language foundation models (VLMs) offer a new path in zero-shot predictions from paired image-text representations. However, current VLMs cannot understand infrared camera data since they are trained on RGB data. This work introduces VLM-IRIS (Vision-Language Models for InfraRed Industrial Sensing), a zero-shot framework that adapts VLMs to infrared data by preprocessing infrared images captured by a FLIR Boson sensor into RGB-compatible inputs suitable for CLIP-based encoders. We demonstrate zero-shot workpiece presence detection on a 3D printer bed where temperature differences between the build plate and workpieces make the task well-suited for thermal imaging. VLM-IRIS converts the infrared images to magma representation and applies centroid prompt ensembling with a CLIP ViT-B/32 encoder to achieve high accuracy on infrared images without any model retraining. These findings demonstrate that the proposed improvements to VLMs can be effectively extended to thermal applications for label-free monitoring.

[39] VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

Weitai Kang,Jason Kuen,Mengwei Ren,Zijun Wei,Yan Yan,Kangning Liu

Main category: cs.CV

TL;DR: 本文提出VGent,一种模块化的编码器-解码器架构,用于多目标视觉定位,通过分离高层推理与边界框预测,在保持快速推理的同时显著提升性能。

Details Motivation: 现有视觉定位模型存在自回归解码速度慢、易产生幻觉,或重对齐破坏LLM推理能力的问题,亟需一种既能利用MLLM强大推理能力又能避免其缺陷的新方法。 Method: 提出VGent:使用冻结的MLLM作为编码器进行推理,解码器以检测器生成的高质量候选框为查询,通过交叉注意力选择目标框;并引入QuadThinker(基于强化学习的训练)、mask感知标签和全局目标识别等改进模块。 Result: 在多目标视觉定位基准上,VGent相比先前方法F1提升+20.6%,在视觉指代挑战下gIoU提升+8.2%、cIoU提升+5.8%,且保持快速恒定的推理延迟。 Conclusion: VGent通过模块化设计有效结合了MLLM的推理能力和目标检测的优势,避免了自回归解码的弊端,实现了性能与效率的双重提升,为视觉定位提供了新范式。 Abstract: Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM's pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.

[40] Information-driven Fusion of Pathology Foundation Models for Enhanced Disease Characterization

Brennan Flannery,Thomas DeSilvio,Jane Nguyen,Satish E. Viswanath

Main category: cs.CV

TL;DR: 本研究提出了一种基于相关性引导的智能融合策略,用于整合多种病理学基础模型(FMs),并在肾癌、前列腺癌和直肠癌的分级与分期任务中实现了优于单一模型和简单融合方法的性能。

Details Motivation: 尽管多种基础模型在病理学任务中表现良好,但其嵌入空间的互补性、冗余性及生物学解释仍缺乏深入理解。因此,需要一种有效的方法来整合多模型信息以提升性能与可解释性。 Method: 采用三种融合方案(多数投票、特征拼接、相关性引导的智能融合)对 tile-level 和 slide-level 的病理基础模型进行集成,并在三个癌症数据集上通过患者分层交叉验证评估其性能。 Result: 智能融合在所有三种癌症中均优于最佳单模型和朴素融合方法;全局相似性分析显示模型间嵌入空间高度对齐但局部邻域一致性较低,表明存在细粒度互补信息;注意力图显示智能融合更聚焦于肿瘤区域。 Conclusion: 基于相关性引导的智能融合能够生成紧凑且任务定制的表示,提升下游计算病理任务的预测性能与可解释性。 Abstract: Foundation models (FMs) have demonstrated strong performance across diverse pathology tasks. While there are similarities in the pre-training objectives of FMs, there is still limited understanding of their complementarity, redundancy in embedding spaces, or biological interpretation of features. In this study, we propose an information-driven, intelligent fusion strategy for integrating multiple pathology FMs into a unified representation and systematically evaluate its performance for cancer grading and staging across three distinct diseases. Diagnostic H&E whole-slide images from kidney (519 slides), prostate (490 slides), and rectal (200 slides) cancers were dichotomized into low versus high grade or stage. Both tile-level FMs (Conch v1.5, MUSK, Virchow2, H-Optimus1, Prov-Gigapath) and slide-level FMs (TITAN, CHIEF, MADELEINE) were considered to train downstream classifiers. We then evaluated three FM fusion schemes at both tile and slide levels: majority-vote ensembling, naive feature concatenation, and intelligent fusion based on correlation-guided pruning of redundant features. Under patient-stratified cross-validation with hold-out testing, intelligent fusion of tile-level embeddings yielded consistent gains in classification performance across all three cancers compared with the best single FMs and naive fusion. Global similarity metrics revealed substantial alignment of FM embedding spaces, contrasted by lower local neighborhood agreement, indicating complementary fine-grained information across FMs. Attention maps showed that intelligent fusion yielded concentrated attention on tumor regions while reducing spurious focus on benign regions. Our findings suggest that intelligent, correlation-guided fusion of pathology FMs can yield compact, task-tailored representations that enhance both predictive performance and interpretability in downstream computational pathology tasks.

[41] Learning from a Generative Oracle: Domain Adaptation for Restoration

Yuyang Hu,Mojtaba Sahraee-Ardakan,Arpit Bansal,Kangfu Mei,Christian Qi,Peyman Milanfar,Mauricio Delbracio

Main category: cs.CV

TL;DR: 提出LEGO框架,通过生成式先验模型实现无需配对数据的图像恢复模型领域自适应。

Details Motivation: 预训练的图像恢复模型在真实世界、分布外退化场景下表现不佳,且缺乏真值数据进行传统自适应。 Method: 三阶段框架:1)用预训练模型获得初步恢复结果;2)利用冻结的大规模生成模型优化结果作为伪真值;3)结合原始数据和伪样本进行混合监督微调。 Result: 在多个真实世界基准上显著提升性能,有效缩小域间差距。 Conclusion: LEGO实现了无需配对数据和架构修改的高效后训练领域自适应,保持原模型鲁棒性的同时提升泛化能力。 Abstract: Pre-trained image restoration models often fail on real-world, out-of-distribution degradations due to significant domain gaps. Adapting to these unseen domains is challenging, as out-of-distribution data lacks ground truth, and traditional adaptation methods often require complex architectural changes. We propose LEGO (Learning from a Generative Oracle), a practical three-stage framework for post-training domain adaptation without paired data. LEGO converts this unsupervised challenge into a tractable pseudo-supervised one. First, we obtain initial restorations from the pre-trained model. Second, we leverage a frozen, large-scale generative oracle to refine these estimates into high-quality pseudo-ground-truths. Third, we fine-tune the original model using a mixed-supervision strategy combining in-distribution data with these new pseudo-pairs. This approach adapts the model to the new distribution without sacrificing its original robustness or requiring architectural modifications. Experiments demonstrate that LEGO effectively bridges the domain gap, significantly improving performance on diverse real-world benchmarks.

[42] Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Bowen Wen,Shaurya Dewan,Stan Birchfield

Main category: cs.CV

TL;DR: 提出Fast-FoundationStereo,首次实现在实时帧率下具备强零样本泛化能力的立体视觉模型。

Details Motivation: 现有立体视觉基础模型虽具有良好的零样本泛化能力,但计算成本高,难以实时应用;而高效模型则牺牲鲁棒性且依赖昂贵的域特定微调。需在速度与泛化性之间取得平衡。 Method: 采用分而治之的加速策略:(1) 知识蒸馏压缩混合骨干网络;(2) 块级神经架构搜索在延迟约束下自动优化代价滤波设计;(3) 结构化剪枝减少迭代细化模块冗余。并构建1.4M真实场景伪标签立体图像对辅助训练。 Result: 模型速度比FoundationStereo快10倍以上,零样本精度接近原模型,在实时方法中达到SOTA性能。 Conclusion: Fast-FoundationStereo成功兼顾了实时性与强零样本泛化能力,为高效部署高性能立体匹配模型提供了可行方案。 Abstract: Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo/

[43] Learning complete and explainable visual representations from itemized text supervision

Yiwei Lyu,Chenhui Zhao,Soumyanil Banerjee,Shixuan Liu,Akshay Rao,Akhil Kondepudi,Honglak Lee,Todd C. Hollon

Main category: cs.CV

TL;DR: 提出ItemizedCLIP框架,用于从条目化文本监督中学习完整且可解释的视觉表示,在多个领域实现优于基线的零样本性能和细粒度可解释性。

Details Motivation: 现有视觉模型的语言监督方法难以处理非物体中心领域中多个语义独立条目描述同一图像的情况,缺乏对条目独立性和表征完整性的建模。 Method: 引入ItemizedCLIP,采用交叉注意力模块生成基于文本条目的视觉嵌入,并设计特定目标函数联合优化条目独立性和表征完整性。 Result: 在脑MRI、头CT、胸CT、遥感及一个合成数据集上,ItemizedCLIP在零样本分类和检索任务中显著优于基线方法,并展现出更强的细粒度可解释性。 Conclusion: ItemizedCLIP能有效利用条目化文本监督,学习到语义 grounded、可区分条目、完整且可视化的视觉表示,适用于多种专业视觉领域。 Abstract: Training vision models with language supervision enables general and transferable representations. However, many visual domains, especially non-object-centric domains such as medical imaging and remote sensing, contain itemized text annotations: multiple text items describing distinct and semantically independent findings within a single image. Such supervision differs from standard multi-caption supervision, where captions are redundant or highly overlapping. Here, we introduce ItemizedCLIP, a framework for learning complete and explainable visual representations from itemized text supervision. ItemizedCLIP employs a cross-attention module to produce text item-conditioned visual embeddings and a set of tailored objectives that jointly enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items). Across four domains with naturally itemized text supervision (brain MRI, head CT, chest CT, remote sensing) and one additional synthetically itemized dataset, ItemizedCLIP achieves substantial improvements in zero-shot performance and fine-grained interpretability over baselines. The resulting ItemizedCLIP representations are semantically grounded, item-differentiable, complete, and visually interpretable. Our code is available at https://github.com/MLNeurosurg/ItemizedCLIP.

[44] Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context

Anatole Jacquin de Margerie,Alexis Roger,Irina Rish

Main category: cs.CV

TL;DR: 本文复现并分析了Monkey视觉语言模型(VLM),验证了图像分块能有效恢复局部细节,并进一步研究了全局上下文的影响,发现结果受任务类型和分块粒度显著影响。

Details Motivation: 复杂多模态模型常缺乏透明实现细节和可访问的训练设施,本文旨在通过复现Monkey VLM模型提升可复现性与透明度。 Method: 使用公开检查点复现原模型,重新实现训练流程,并系统评估分块策略及全局上下文对高分辨率图像理解的影响。 Result: 确认分块策略可恢复局部细节;引入全局上下文带来性能提升;但效果受任务类型和分块粒度显著影响,存在结果偏差。 Conclusion: 图像分块有助于高分辨率视觉理解,但其有效性依赖于任务需求和粒度设计,需结合全局上下文以获得更鲁棒的结果。 Abstract: Reproducibility remains a cornerstone of scientific progress, yet complex multimodal models often lack transparent implementation details and accessible training infrastructure. In this work, we present a detailed reproduction and critical analysis of the Monkey Vision-Language Model (VLM) (Li et al. 2023b) published in CVPR24, a recent approach to high-resolution image understanding via image tiling. The original paper proposed splitting large images into tiles to recover fine-grained visual details while maintaining computational efficiency. Our study replicates this strategy using open checkpoints and reimplements the training pipeline. We confirm the key finding of the original Monkey VLM work, namely that tiling effectively recovers local details. We then extend this work further, by investigating the effect of the inclusion of the global context, which provide practical insights for future high-resolution multimodal modeling. However, we also report deviations in the results, with the magnitude of these effects depending heavily on task type and tile granularity.

[45] Lightweight 3D Gaussian Splatting Compression via Video Codec

Qi Yang,Geert Van Der Auwera,Zhu Li

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的3D高斯点阵视频压缩方法(LGSCV),通过两阶段Morton扫描和MiniPLAS排序,结合球谐函数降维,在保持高质量的同时显著提升编码效率和压缩性能。

Details Motivation: 现有基于PLAS的3D高斯点阵压缩方法计算复杂、耗时长,难以在轻量设备上应用,因此需要一种更高效、低复杂度的压缩方案。 Method: 提出两阶段Morton扫描生成适合视频编码器的块状2D映射,并引入SH降维的PCA与轻量级MiniPLAS对块内图元排序,优化率失真性能并指导编码块大小设置。 Result: 在MPEG数据集上,相比现有方法实现20%以上的率失真增益,2D映射生成时间降至约1秒,编码时间减少50%。 Conclusion: LGSCV在保证压缩质量的同时大幅提升效率,适用于资源受限设备,为3D高斯点阵的实时传输与部署提供了可行方案。 Abstract: Current video-based GS compression methods rely on using Parallel Linear Assignment Sorting (PLAS) to convert 3D GS into smooth 2D maps, which are computationally expensive and time-consuming, limiting the application of GS on lightweight devices. In this paper, we propose a Lightweight 3D Gaussian Splatting (GS) Compression method based on Video codec (LGSCV). First, a two-stage Morton scan is proposed to generate blockwise 2D maps that are friendly for canonical video codecs in which the coding units (CU) are square blocks. A 3D Morton scan is used to permute GS primitives, followed by a 2D Morton scan to map the ordered GS primitives to 2D maps in a blockwise style. However, although the blockwise 2D maps report close performance to the PLAS map in high-bitrate regions, they show a quality collapse at medium-to-low bitrates. Therefore, a principal component analysis (PCA) is used to reduce the dimensionality of spherical harmonics (SH), and a MiniPLAS, which is flexible and fast, is designed to permute the primitives within certain block sizes. Incorporating SH PCA and MiniPLAS leads to a significant gain in rate-distortion (RD) performance, especially at medium and low bitrates. MiniPLAS can also guide the setting of the codec CU size configuration and significantly reduce encoding time. Experimental results on the MPEG dataset demonstrate that the proposed LGSCV achieves over 20% RD gain compared with state-of-the-art methods, while reducing 2D map generation time to approximately 1 second and cutting encoding time by 50%. The code is available at https://github.com/Qi-Yangsjtu/LGSCV .

[46] Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization

Anh-Kiet Duong,Petra Gomez-Krämer

Main category: cs.CV

TL;DR: 本文提出了在ICCV 2025 BinEgo-360挑战赛中用于多视角、多模态视频中时序动作定位(TAL)的解决方案,基于扩展的TSM模型,引入背景类和多任务学习框架,并通过加权集成策略取得竞赛第一名。

Details Motivation: 在多视角、多模态视频中实现精确的时序动作定位面临视角多样性和模态融合的挑战,需要有效利用上下文信息并提升模型鲁棒性。 Method: 基于Temporal Shift Module (TSM) 扩展,引入背景类并对固定长度非重叠区间进行分类;采用多任务学习联合优化场景分类与TAL;通过加权集成多个模型提升预测一致性。 Result: 该方法在BinEgo-360挑战赛的初赛和扩展轮次中均排名第一,验证了多任务学习、高效主干网络与集成学习结合的有效性。 Conclusion: 结合多任务学习、背景建模与模型集成的TSM扩展框架在多视角多模态TAL任务中表现优异,具有良好的鲁棒性和应用潜力。 Abstract: We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.

[47] CADKnitter: Compositional CAD Generation from Text and Geometry Guidance

Tri Le,Khang Nguyen,Baoru Huang,Tung D. Ta,Anh Nguyen

Main category: cs.CV

TL;DR: 本文提出了一种名为CADKnitter的组合式CAD生成框架,采用几何引导的扩散采样策略,能够根据给定CAD模型的几何约束和文本提示的语义约束生成互补的CAD部件,并构建了包含31万多个样本的KnitCAD数据集。

Details Motivation: 现有的单部件CAD生成方法难以满足现实应用中多部件在语义和几何约束下装配的需求,因此需要一种能同时满足几何与语义约束的组合式CAD生成方法。 Method: 提出CADKnitter框架,结合几何引导的扩散采样策略,在生成过程中融合输入CAD模型的几何约束与文本提示的语义约束;并构建KnitCAD数据集以支持训练与评估。 Result: 在大量实验中,CADKnitter显著优于现有最先进的基线方法,能够准确生成符合装配要求的互补CAD部件。 Conclusion: CADKnitter为功能性、可编辑的组合式CAD生成提供了有效解决方案,推动了3D生成技术向实用化CAD设计的转化。 Abstract: Crafting computer-aided design (CAD) models has long been a painstaking and time-intensive task, demanding both precision and expertise from designers. With the emergence of 3D generation, this task has undergone a transformative impact, shifting not only from visual fidelity to functional utility but also enabling editable CAD designs. Prior works have achieved early success in single-part CAD generation, which is not well-suited for real-world applications, as multiple parts need to be assembled under semantic and geometric constraints. In this paper, we propose CADKnitter, a compositional CAD generation framework with a geometry-guided diffusion sampling strategy. CADKnitter is able to generate a complementary CAD part that follows both the geometric constraints of the given CAD model and the semantic constraints of the desired design text prompt. We also curate a dataset, so-called KnitCAD, containing over 310,000 samples of CAD models, along with textual prompts and assembly metadata that provide semantic and geometric constraints. Intensive experiments demonstrate that our proposed method outperforms other state-of-the-art baselines by a clear margin.

[48] AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path

Zhengyang Yu,Akio Hayakawa,Masato Ishii,Qingtao Yu,Takashi Shibuya,Jing Zhang,Yuki Mitsufuji

Main category: cs.CV

TL;DR: 本文提出AutoRefiner,一种专为自回归视频扩散模型(AR-VDMs)设计的前馈噪声优化器,通过路径式噪声 refinement 和反射式KV-cache,在不更新模型参数的情况下显著提升生成质量。

Details Motivation: 现有的推理时对齐方法在AR-VDM中效果有限,直接将T2I中的噪声优化器迁移到AR-VDM中会失败,因此需要专门针对AR-VDM设计高效的噪声 refine 方法。 Method: 提出AutoRefiner,包含两个关键设计:路径式噪声优化(pathwise noise refinement),在去噪路径上逐步调整噪声;以及反射式KV-cache,以保持自回归生成中的一致性与效率。 Result: 实验证明AutoRefiner能作为高效插件提升AR-VDM的样本保真度,优于现有优化或搜索方法,在计算成本低的情况下实现更高质量生成。 Conclusion: AutoRefiner为AR-VDM提供了实用且高效的推理时优化方案,推动了实时和交互式视频生成应用的发展。 Abstract: Autoregressive video diffusion models (AR-VDMs) show strong promise as scalable alternatives to bidirectional VDMs, enabling real-time and interactive applications. Yet there remains room for improvement in their sample fidelity. A promising solution is inference-time alignment, which optimizes the noise space to improve sample fidelity without updating model parameters. Yet, optimization- or search-based methods are computationally impractical for AR-VDMs. Recent text-to-image (T2I) works address this via feedforward noise refiners that modulate sampled noises in a single forward pass. Can such noise refiners be extended to AR-VDMs? We identify the failure of naively extending T2I noise refiners to AR-VDMs and propose AutoRefiner-a noise refiner tailored for AR-VDMs, with two key designs: pathwise noise refinement and a reflective KV-cache. Experiments demonstrate that AutoRefiner serves as an efficient plug-in for AR-VDMs, effectively enhancing sample fidelity by refining noise along stochastic denoising paths.

[49] SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection

Tianye Qi,Weihao Li,Nick Barnes

Main category: cs.CV

TL;DR: 本文提出了SmokeBench,一个用于评估多模态大语言模型在图像中识别和定位野火烟雾能力的基准,包含四个任务,并发现现有模型在早期烟雾定位上表现不佳,突显了改进此类安全关键应用方法的迫切需求。

Details Motivation: 由于野火烟雾透明、无定形且常与云混淆,早期检测极具挑战性,因此需要评估当前多模态大语言模型在此类场景下的性能。 Method: 构建了一个名为SmokeBench的基准,包含烟雾分类、基于瓦片和网格的定位以及烟雾检测四项任务,并对多个主流MLLM进行了系统评估。 Result: 实验表明,尽管部分模型能较好地判断大面积烟雾的存在,但所有模型在精确定位(尤其是早期阶段)方面均表现不佳;进一步分析显示烟雾体积与模型性能强相关,而对比度影响较小。 Conclusion: 当前多模态大语言模型在安全关键的野火监测任务中存在显著局限,亟需发展能提升早期烟雾定位能力的新方法。 Abstract: Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast plays a comparatively minor role. These findings highlight critical limitations of current MLLMs for safety-critical wildfire monitoring and underscore the need for methods that improve early-stage smoke localization.

[50] VFMF: World Modeling by Forecasting Vision Foundation Model Features

Gabrijel Boduljak,Yushi Lan,Christian Rupprecht,Andrea Vedaldi

Main category: cs.CV

TL;DR: 本文提出了一种在视觉基础模型(VFM)特征空间中进行生成式预测的新方法,通过自回归流匹配建模未来状态的不确定性,克服了传统确定性回归忽略多可能性的问题,并在多个可解释模态输出上实现了更清晰、准确的预测。

Details Motivation: 现有基于像素或VFM特征的预测方法存在计算开销大或忽略未来不确定性的局限,亟需一种既能保持效率又能捕捉多模态未来的高效世界建模方法。 Method: 在VFM特征空间中引入生成式预测框架,采用自回归流匹配技术;首先将VFM特征编码到一个紧凑的潜在空间,以支持扩散模型生成,然后解码生成结果为多种输出模态(如语义分割、深度、法线和RGB图像)。 Result: 相比确定性回归方法,在相同架构和计算条件下,该方法在所有测试模态上均产生更锐利、更准确的预测结果;其潜在空间比PCA等传统降维方式更好地保留了信息,适用于预测和图像生成等任务。 Conclusion: 在VFM特征空间中进行随机条件生成是一种有前景且可扩展的世界模型构建范式,能够有效平衡准确性、效率与多模态输出能力。 Abstract: Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA-based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.

[51] FutureX: Enhance End-to-End Autonomous Driving via Latent Chain-of-Thought World Model

Hongbin Lin,Yiming Yang,Yifan Zhang,Chaoda Zheng,Jie Feng,Sheng Wang,Zhennan Wang,Shijia Chen,Boyang Wang,Yu Zhang,Xianming Liu,Shuguang Cui,Zhen Li

Main category: cs.CV

TL;DR: 本文提出FutureX,一种基于Chain of Thought(CoT)的端到端自动驾驶规划框架,通过潜在空间中的未来场景推理和轨迹优化提升复杂动态环境下的决策质量。

Details Motivation: 传统端到端规划器仅依赖当前场景进行决策,在高度动态交通中可能产生次优行为;且未考虑自车动作对未来场景的影响,导致规划不合理。 Method: FutureX引入Auto-think Switch判断是否需要深度推理,并在‘Thinking模式’下利用Latent World Model进行CoT引导的未来场景 rollout,再由Summarizer模块优化轨迹;简单场景则进入高效‘Instant模式’直接输出结果。 Result: 实验表明FutureX显著提升了现有方法的性能,例如在NAVSIM上使TransFuser的PDMS提高了6.2,同时减少碰撞且不牺牲效率。 Conclusion: FutureX通过引入未来场景的隐式推理与CoT机制,有效增强了端到端规划器在复杂动态环境中的适应性与安全性,实现了高质量与高效率的平衡。 Abstract: In autonomous driving, end-to-end planners learn scene representations from raw sensor data and utilize them to generate a motion plan or control actions. However, exclusive reliance on the current scene for motion planning may result in suboptimal responses in highly dynamic traffic environments where ego actions further alter the future scene. To model the evolution of future scenes, we leverage the World Model to represent how the ego vehicle and its environment interact and change over time, which entails complex reasoning. The Chain of Thought (CoT) offers a promising solution by forecasting a sequence of future thoughts that subsequently guide trajectory refinement. In this paper, we propose FutureX, a CoT-driven pipeline that enhances end-to-end planners to perform complex motion planning via future scene latent reasoning and trajectory refinement. Specifically, the Auto-think Switch examines the current scene and decides whether additional reasoning is required to yield a higher-quality motion plan. Once FutureX enters the Thinking mode, the Latent World Model conducts a CoT-guided rollout to predict future scene representation, enabling the Summarizer Module to further refine the motion plan. Otherwise, FutureX operates in an Instant mode to generate motion plans in a forward pass for relatively simple scenes. Extensive experiments demonstrate that FutureX enhances existing methods by producing more rational motion plans and fewer collisions without compromising efficiency, thereby achieving substantial overall performance gains, e.g., 6.2 PDMS improvement for TransFuser on NAVSIM. Code will be released.

[52] REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation

Haotian Wang,Yuzhe Weng,Xinyi Yu,Jun Du,Haoran Xu,Xiaoyan Wu,Shan He,Bing Yin,Cong Liu,Qingfeng Liu

Main category: cs.CV

TL;DR: 本文提出了REST,首个基于扩散模型的实时端到端流式语音驱动说话人头生成框架,通过紧凑的视频潜在空间、ID-Context Cache机制和异步流式蒸馏策略,实现了高效、连贯的实时生成,在速度和性能上均优于现有方法。

Details Motivation: 扩散模型在说话人头生成中取得进展,但推理速度慢和非自回归范式限制了其实际应用,需要一种能实现实时、连续生成的解决方案。 Method: 提出REST框架:1)通过高时空VAE压缩学习紧凑视频潜在空间;2)引入ID-Context Cache机制(结合ID-Sink和Context-Cache)用于KV缓存以保持时序一致性和身份连贯性;3)设计异步流式蒸馏(ASD)训练策略,利用非流式教师模型指导流式学生模型训练,减少误差累积。 Result: REST在生成速度和整体性能上均优于当前最先进的方法,支持实时端到端流式生成,并有效保持长时间生成的身份一致性和时间连贯性。 Conclusion: REST成功弥合了自回归与扩散模型在说话人头生成中的差距,为需要实时生成的应用提供了高效且高质量的解决方案。 Abstract: Diffusion models have significantly advanced the field of talking head generation. However, the slow inference speeds and non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, the first diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through high spatiotemporal VAE compression. Additionally, to enable autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles to key-value caching for maintaining temporal consistency and identity coherence during long-time streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) training strategy is proposed to mitigate error accumulation in autoregressive generation and enhance temporal consistency, which leverages a non-streaming teacher with an asynchronous noise schedule to supervise the training of the streaming student model. REST bridges the gap between autoregressive and diffusion-based approaches, demonstrating substantial value for applications requiring real-time talking head generation. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.

[53] RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing

Wentang Chen,Shougao Zhang,Yiman Zhang,Tianhao Zhou,Ruihui Li

Main category: cs.CV

TL;DR: 本文提出了RoomPilot,一个统一的框架,通过将多模态输入(如文本描述或CAD平面图)解析为室内领域特定语言(IDSL),实现可控且交互性强的室内场景生成。

Details Motivation: 现有方法在输入模态范围或可控性方面存在局限,难以生成兼具功能性和交互性的高质量室内场景。 Method: 提出一种基于IDSL的共享语义表示框架,支持从文本或CAD平面图等多模态输入生成结构化室内场景,并利用带有交互标注的资产数据集增强场景的功能真实性。 Result: 实验表明,RoomPilot在多模态理解、生成过程的细粒度控制、物理一致性和视觉保真度方面均表现优异。 Conclusion: RoomPilot推动了通用型可控3D室内场景生成的发展,为游戏、建筑可视化和具身AI训练等应用提供了更优解决方案。 Abstract: Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.

[54] WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering

Yuxuan Han,Xin Ming,Tianxiao Li,Zhuofan Shen,Qixuan Zhang,Lan Xu,Feng Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为WildCap的新方法,能够从在自然光照下用智能手机拍摄的视频中实现高质量的人脸外观捕捉。通过结合数据驱动和基于模型的逆向渲染框架,并引入一种新的纹理网格光照模型,有效分离复杂光照与反射率,显著提升了在非受控环境下的面部外观重建质量。

Details Motivation: 现有方法依赖于可控光照条件,增加了捕捉成本并限制了实用性。因此,需要一种能在自然环境下进行高质量面部外观捕捉的方法,以降低设备要求并提高可用性。 Method: 提出WildCap方法,采用混合逆向渲染框架:首先使用数据驱动的SwitchLight将自然光照图像转换为更受限的条件,再结合基于模型的逆向渲染;引入texel grid光照模型来解释非物理性伪影,并联合优化局部光照与反照率,同时引入扩散先验解决尺度模糊问题。 Result: 在相同拍摄条件下,本方法显著优于先前方法,大幅缩小了自然环境与受控环境之间在面部外观捕捉质量上的差距。 Conclusion: WildCap实现了在自然光照下高质量的面部外观捕捉,克服了复杂光照和非物理伪影的影响,推动了低成本、高可用性面部采集技术的发展。 Abstract: Existing methods achieve high-quality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for reflectance maps and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin. Our code will be released \href{https://yxuhan.github.io/WildCap/index.html}{\textcolor{magenta}{here}}.

[55] Cross-modal Prompting for Balanced Incomplete Multi-modal Emotion Recognition

Wen-Jue He,Xiaofeng Zhu,Zheng Zhang

Main category: cs.CV

TL;DR: 提出了一种新的跨模态提示(ComP)方法,用于不完整多模态情感识别,通过增强模态特定特征和动态重加权策略提升识别准确率。

Details Motivation: 解决不完整多模态数据下性能差距和模态欠优化问题,尤其是在缺失数据情况下多模态学习的有效性受限。 Method: 设计了包含渐进式提示生成模块、动态梯度调制器和跨模态知识传播机制的ComP方法,并引入协调器动态重加权模态输出。 Result: 在4个数据集上与7种SOTA方法比较,不同缺失率下均表现出优越性能,验证了方法的有效性。 Conclusion: ComP能有效提升不完整多模态情感识别的准确性和鲁棒性,通过提示机制和动态平衡策略增强了模态间的一致性与判别能力。 Abstract: Incomplete multi-modal emotion recognition (IMER) aims at understanding human intentions and sentiments by comprehensively exploring the partially observed multi-source data. Although the multi-modal data is expected to provide more abundant information, the performance gap and modality under-optimization problem hinder effective multi-modal learning in practice, and are exacerbated in the confrontation of the missing data. To address this issue, we devise a novel Cross-modal Prompting (ComP) method, which emphasizes coherent information by enhancing modality-specific features and improves the overall recognition accuracy by boosting each modality's performance. Specifically, a progressive prompt generation module with a dynamic gradient modulator is proposed to produce concise and consistent modality semantic cues. Meanwhile, cross-modal knowledge propagation selectively amplifies the consistent information in modality features with the delivered prompts to enhance the discrimination of the modality-specific output. Additionally, a coordinator is designed to dynamically re-weight the modality outputs as a complement to the balance strategy to improve the model's efficacy. Extensive experiments on 4 datasets with 7 SOTA methods under different missing rates validate the effectiveness of our proposed method.

[56] PersonaLive! Expressive Portrait Image Animation for Live Streaming

Zhiyuan Li,Chi-Man Pun,Chen Fang,Jue Wang,Xiaodong Cun

Main category: cs.CV

TL;DR: 本文提出了一种名为PersonaLive的扩散模型框架,用于实现实时肖像动画生成,通过多阶段训练策略显著提升了推理效率和生成稳定性,适用于直播场景。

Details Motivation: 现有基于扩散的肖像动画模型注重视觉质量和表情真实感,但忽视了生成延迟和实时性,限制了其在直播等实时场景中的应用。 Method: 采用混合隐式信号(隐式面部表示和3D隐式关键点)实现图像级运动控制;提出少步长外观蒸馏策略以减少去噪过程中的冗余;引入自回归微块流式生成范式,结合滑动训练和历史关键帧机制,实现低延迟、稳定的长序列视频生成。 Result: 实验表明,PersonaLive在保持高质量生成的同时,相比先前的扩散模型实现了7-22倍的速度提升,显著降低了生成延迟。 Conclusion: PersonaLive为基于扩散模型的实时肖像动画提供了高效、稳定的解决方案,拓展了其在实时交互场景中的应用潜力。 Abstract: Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.

[57] Do We Need Reformer for Vision? An Experimental Comparison with Vision Transformers

Ali El Bellaj,Mohammed-Amine Cheddadi,Rhassan Berber

Main category: cs.CV

TL;DR: 本文研究了Reformer架构作为视觉骨干网络的潜力,利用局部敏感哈希(LSH)注意力将自注意力的理论时间复杂度从O(n²)降低到O(n log n),并在CIFAR-10、ImageNet-100和高分辨率医学图像数据集上进行了评估。尽管Reformer在CIFAR-10上表现优于ViT基线模型,但在更大、更高分辨率的设置中,ViT在实际效率和端到端计算时间方面始终优于Reformer,表明LSH注意力的实际优势需要比典型高分辨率图像产生的序列更长才能体现。

Details Motivation: 标准Vision Transformer(ViT)由于全局自注意力机制的时间复杂度与序列长度平方成正比,在处理高分辨率图像和资源受限场景时计算成本过高,因此需要一种更具计算效率的替代架构。 Method: 采用Reformer架构作为视觉骨干网络,结合基于patch的分词和局部敏感哈希(LSH)注意力机制,以近似全局自注意力并降低计算复杂度至O(n log n)。 Result: Reformer在CIFAR-10上准确率高于ViT基线模型;但在ImageNet-100和高分辨率医学图像数据集上,ViT在实际效率和端到端计算时间方面均优于Reformer。 Conclusion: 尽管Reformer在理论上具有更低的计算复杂度,但在当前典型的高分辨率图像任务中,其实际计算优势并不明显,说明要发挥LSH注意力的优势需要更长的序列长度。 Abstract: Transformers have recently demonstrated strong performance in computer vision, with Vision Transformers (ViTs) leveraging self-attention to capture both low-level and high-level image features. However, standard ViTs remain computationally expensive, since global self-attention scales quadratically with the number of tokens, which limits their practicality for high-resolution inputs and resource-constrained settings. In this work, we investigate the Reformer architecture as an alternative vision backbone. By combining patch-based tokenization with locality-sensitive hashing (LSH) attention, our model approximates global self-attention while reducing its theoretical time complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ in the sequence length $n$. We evaluate the proposed Reformer-based vision model on CIFAR-10 to assess its behavior on small-scale datasets, on ImageNet-100 to study its accuracy--efficiency trade-off in a more realistic setting, and on a high-resolution medical imaging dataset to evaluate the model under longer token sequences. While the Reformer achieves higher accuracy on CIFAR-10 compared to our ViT-style baseline, the ViT model consistently outperforms the Reformer in our experiments in terms of practical efficiency and end-to-end computation time across the larger and higher-resolution settings. These results suggest that, despite the theoretical advantages of LSH-based attention, meaningful computation gains require sequence lengths substantially longer than those produced by typical high-resolution images.

[58] Evaluating the Efficacy of Sentinel-2 versus Aerial Imagery in Serrated Tussock Classification

Rezwana Sultana,Manzur Murshed,Kathryn Sheffield,Singarayer Florentine,Tsz-Kwan Lee,Shyh Wei Teng

Main category: cs.CV

TL;DR: 该研究评估了多时相Sentinel-2卫星影像在景观尺度上监测锯齿状须芒草(Nassella trichotoma)入侵物种的潜力,利用其光谱分辨率和物候信息,结合随机森林分类器,实现了与航空影像相当甚至略优的分类精度。

Details Motivation: 由于地面调查难以扩展至大范围,而航空影像成本较高,亟需一种经济且可扩展的方法来实现入侵物种的大规模监测。 Method: 使用多时相Sentinel-2卫星影像,构建包含光谱波段、纹理特征、植被指数和季节性数据的11种模型组合,采用随机森林分类器进行分类,并与航空影像结果对比。 Result: 最佳Sentinel-2模型(M76*)总体准确率达68%,Kappa系数为0.55,略优于最佳航空影像模型(OA 67%,OK 0.52)。 Conclusion: 基于多季节特征增强的卫星遥感模型在大规模入侵物种分类中具有可行性和应用潜力,提供了一种更具成本效益和可扩展性的监测方案。 Abstract: Invasive species pose major global threats to ecosystems and agriculture. Serrated tussock (\textit{Nassella trichotoma}) is a highly competitive invasive grass species that disrupts native grasslands, reduces pasture productivity, and increases land management costs. In Victoria, Australia, it presents a major challenge due to its aggressive spread and ecological impact. While current ground surveys and subsequent management practices are effective at small scales, they are not feasible for landscape-scale monitoring. Although aerial imagery offers high spatial resolution suitable for detailed classification, its high cost limits scalability. Satellite-based remote sensing provides a more cost-effective and scalable alternative, though often with lower spatial resolution. This study evaluates whether multi-temporal Sentinel-2 imagery, despite its lower spatial resolution, can provide a comparable and cost-effective alternative for landscape-scale monitoring of serrated tussock by leveraging its higher spectral resolution and seasonal phenological information. A total of eleven models have been developed using various combinations of spectral bands, texture features, vegetation indices, and seasonal data. Using a random forest classifier, the best-performing Sentinel-2 model (M76*) has achieved an Overall Accuracy (OA) of 68\% and an Overall Kappa (OK) of 0.55, slightly outperforming the best-performing aerial imaging model's OA of 67\% and OK of 0.52 on the same dataset. These findings highlight the potential of multi-seasonal feature-enhanced satellite-based models for scalable invasive species classification.

[59] FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion

Xiangyang Luo,Qingyu Li,Xiaokun Liu,Wenyu Qin,Miao Yang,Meng Wang,Pengfei Wan,Di Zhang,Kun Gai,Shao-Lun Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为FilmWeaver的新框架,用于生成具有一致性和任意长度的多镜头视频。通过解耦镜头间一致性和镜头内连贯性,并采用双级缓存机制,在角色、场景和运动连续性方面实现了显著提升。

Details Motivation: 现有视频生成模型在单镜头生成上表现良好,但在多镜头视频中难以保持角色和背景的一致性,且无法灵活生成任意长度和镜头数量的视频。因此需要一个能解决多镜头一致性与可扩展性的新方法。 Method: 提出FilmWeaver框架,采用自回归扩散模型实现任意长度视频生成;引入双级缓存机制:镜头记忆缓存前序镜头关键帧以维持跨镜头一致性,时序记忆保留当前镜头的历史帧以保证镜头内动作连贯。支持多轮用户交互,并可用于多概念注入和视频扩展等任务。 Result: 实验结果表明,该方法在一致性和视觉质量指标上均优于现有方法;构建了高质量多镜头视频数据集以支持训练;在多镜头生成、角色保持和叙事连贯性方面表现突出。 Conclusion: FilmWeaver通过解耦设计和双级记忆机制,有效解决了多镜头视频生成中的一致性与灵活性问题,推动了可控、连贯、叙事性强的长视频生成的发展。 Abstract: Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce \textbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content. Project Page: https://filmweaver.github.io

[60] HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning

Yiqing Yang,Kin-Man Lam

Main category: cs.CV

TL;DR: 提出一种端到端、任务自适应的视频关键帧选择框架,通过隐式查询生成、连续集合级优化和师生互学习机制,显著提升视频理解性能。

Details Motivation: 传统关键帧选择方法存在帧间独立打分导致的时序聚集与视觉冗余问题,且依赖静态伪标签限制了监督信号对任务目标的动态适应。 Method: 采用思维链引导小语言模型生成任务相关的隐式查询向量,结合多模态特征实现动态打分;设计包含相关性、覆盖度和冗余性的连续集合级目标函数,通过Gumbel-Softmax实现可微优化;引入学生-教师互学习机制,利用KL散度对齐帧重要性分布,并结合交叉熵损失进行端到端训练。 Result: 在Video-MME、LongVideoBench、MLVU和NExT-QA等多个基准上显著优于现有方法。 Conclusion: 该框架有效解决了关键帧选择中的局部冗余与全局优化难题,实现了任务自适应的端到端训练,提升了视频理解的整体性能。 Abstract: Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.

[61] RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly Detection

Rongcheng Wu,Hao Zhu,Shiying Zhang,Mingzhe Wang,Zhidong Li,Hui Li,Jianlong Zhou,Jiangtao Cui,Fang Chen,Pingyang Sun,Qiyu Liao,Ye Lin

Main category: cs.CV

TL;DR: 提出了一种基于递归自动编码器(RcAE)的无监督工业异常检测方法,通过迭代重建逐步抑制异常并细化正常结构,结合跨递归检测模块(CRD)和细节保留网络(DPN),在性能上媲美扩散模型,但参数更少、推理更快。

Details Motivation: 传统自动编码器在单次解码中难以有效处理不同严重程度和尺度的异常,导致异常抑制不完全和细节丢失,无法满足无监督工业异常检测对精度和细节恢复的需求。 Method: 提出递归自动编码器(RcAE),通过迭代重建实现渐进式异常抑制;引入跨递归检测(CRD)模块,利用多步重建间的不一致性检测异常;设计细节保留网络(DPN)以恢复高频纹理信息。 Result: 在多个实验中显著优于现有的非扩散方法,性能与最新扩散模型相当,但仅使用其10%的参数,并具有显著更快的推理速度。 Conclusion: RcAE通过递归重建机制和CRD、DPN模块,实现了高效、精确的无监督异常检测,兼具高性能与实用性,适合真实工业场景应用。 Abstract: Unsupervised industrial anomaly detection requires accurately identifying defects without labeled data. Traditional autoencoder-based methods often struggle with incomplete anomaly suppression and loss of fine details, as their single-pass decoding fails to effectively handle anomalies with varying severity and scale. We propose a recursive architecture for autoencoder (RcAE), which performs reconstruction iteratively to progressively suppress anomalies while refining normal structures. Unlike traditional single-pass models, this recursive design naturally produces a sequence of reconstructions, progressively exposing suppressed abnormal patterns. To leverage this reconstruction dynamics, we introduce a Cross Recursion Detection (CRD) module that tracks inconsistencies across recursion steps, enhancing detection of both subtle and large-scale anomalies. Additionally, we incorporate a Detail Preservation Network (DPN) to recover high-frequency textures typically lost during reconstruction. Extensive experiments demonstrate that our method significantly outperforms existing non-diffusion methods, and achieves performance on par with recent diffusion models with only 10% of their parameters and offering substantially faster inference. These results highlight the practicality and efficiency of our approach for real-world applications.

[62] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

Zhenyang Cai,Jiaming Zhang,Junjie Zhao,Ziyi Zeng,Yanchao Li,Jingyi Liang,Junying Chen,Yunjin Yang,Jiajun You,Shuzhi Deng,Tongfei Wang,Wanting Chen,Chunxiu Hao,Ruiqi Xie,Zhenwei Wen,Xiangyi Feng,Zou Ting,Jin Zou Lin,Jianquan Li,Guangjun Yu,Liangyi Chen,Junwen Wang,Shan Jiang,Benyou Wang

Main category: cs.CV

TL;DR: DentalGPT是一个专用于牙科的多模态大语言模型,通过高质量领域知识注入和强化学习,在疾病分类和牙科视觉问答任务中表现出色,尽管仅有7B参数但仍优于现有主流模型。

Details Motivation: 现有的多模态大语言模型在捕捉牙科图像的细粒度视觉细节和精确诊断推理方面存在不足,限制了其在自动化口腔医疗中的应用。 Method: 构建了目前最大的标注牙科多模态数据集(超过12万张图像配详细描述),并在此基础上通过监督微调和强化学习分阶段训练DentalGPT,以增强其对牙科视觉特征的理解与多模态推理能力。 Result: 在口内和全景图像基准以及医学VQA的牙科子集上,DentalGPT在疾病分类和牙科VQA任务中均超越多种先进MLLM,展现出卓越性能。 Conclusion: 高质量牙科数据结合分阶段适应策略是构建高效、专业化牙科多模态大模型的有效路径。 Abstract: Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.

[63] Autoregressive Video Autoencoder with Decoupled Temporal and Spatial Context

Cuifeng Shen,Lumin Xu,Xingguo Zhu,Gengdai Liu

Main category: cs.CV

TL;DR: 提出了一种新的视频自编码器ARVAE,通过自回归方式解耦时空表示,在保持高压缩效率的同时显著提升视频重建质量。

Details Motivation: 现有视频自编码器常将空间和时间信息纠缠,导致时序一致性建模能力弱、重建性能不佳。 Method: 提出ARVAE,采用自回归框架,编码器提取当前帧与前一帧的运动信息(时序)和新增内容(空间);引入时空解耦表示,结合下采样光流场与时空相对补偿,并使用多阶段训练策略优化模型。 Result: 在极轻量模型和小规模数据上实现了优越的重建质量,同时在视频生成任务中展现出良好的下游应用潜力。 Conclusion: ARVAE通过时空解耦的自回归架构,有效分离并利用时空信息,提升了视频压缩与重建的效率和质量,具备实际应用价值。 Abstract: Video autoencoders compress videos into compact latent representations for efficient reconstruction, playing a vital role in enhancing the quality and efficiency of video generation. However, existing video autoencoders often entangle spatial and temporal information, limiting their ability to capture temporal consistency and leading to suboptimal performance. To address this, we propose Autoregressive Video Autoencoder (ARVAE), which compresses and reconstructs each frame conditioned on its predecessor in an autoregressive manner, allowing flexible processing of videos with arbitrary lengths. ARVAE introduces a temporal-spatial decoupled representation that combines downsampled flow field for temporal coherence with spatial relative compensation for newly emerged content, achieving high compression efficiency without information loss. Specifically, the encoder compresses the current and previous frames into the temporal motion and spatial supplement, while the decoder reconstructs the original frame from the latent representations given the preceding frame. A multi-stage training strategy is employed to progressively optimize the model. Extensive experiments demonstrate that ARVAE achieves superior reconstruction quality with extremely lightweight models and small-scale training data. Moreover, evaluations on video generation tasks highlight its strong potential for downstream applications.

[64] Few-Shot VLM-Based G-Code and HMI Verification in CNC Machining

Yasaman Hashem Pour,Nazanin Mahjourian,Vinh Nguyen

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉语言模型(VLM)的少样本方法,用于同时验证G代码和数控机床人机界面(HMI)显示,以提高手动G代码生成中的错误检测能力。

Details Motivation: 现有的大语言模型(LLM)在G代码验证中无法利用HMI视觉信息,而HMI在数控加工中对状态和错误显示至关重要,因此需要一种能结合视觉和文本模态的验证方法。 Method: 提出一种少样本VLM方法,输入配对的G代码文本和HMI截图,使用结构化JSON模式和包含正确与错误案例的示例进行提示设计,实现对G代码与HMI的一致性及安全性联合验证。 Result: 相比零样本VLM,该方法在多种错误场景下显著提升了HMI错误和G代码不一致的检测准确率,特别是在每槽位精度方面表现更优。 Conclusion: 所提出的少样本VLM框架能够有效支持手动G代码的综合调试,适用于数控加工教学环境中的代码验证。 Abstract: Manual generation of G-code is important for learning the operation of CNC machines. Prior work in G-code verification uses Large-Language Models (LLMs), which primarily examine errors in the written programming. However, CNC machining requires extensive use and knowledge of the Human-Machine Interface (HMI), which displays machine status and errors. LLMs currently lack the capability to leverage knowledge of HMIs due to their inability to access the vision modality. This paper proposes a few-shot VLM-based verification approach that simultaneously evaluates the G-code and the HMI display for errors and safety status. The input dataset includes paired G-code text and associated HMI screenshots from a 15-slant-PRO lathe, including both correct and error-prone cases. To enable few-shot learning, the VLM is provided with a structured JSON schema based on prior heuristic knowledge. After determining the prompts, instances of G-code and HMI that either contain errors or are error free are used as few-shot examples to guide the VLM. The model was then evaluated in comparison to a zero-shot VLM through multiple scenarios of incorrect G-code and HMI errors with respect to per-slot accuracy. The VLM showed that few-shot prompting led to overall enhancement of detecting HMI errors and discrepancies with the G-code for more comprehensive debugging. Therefore, the proposed framework was demonstrated to be suitable for verification of manually generated G-code that is typically developed in CNC training.

[65] MultiEgo: A Multi-View Egocentric Video Dataset for 4D Scene Reconstruction

Bate Li,Houqiang Zhong,Zhengxue Cheng,Qiang Hu,Qiang Wang,Li Song,Wenjun Zhang

Main category: cs.CV

TL;DR: 本文提出了MultiEgo,首个用于4D动态场景重建的多视角自我中心数据集,涵盖五种社交互动场景,每个场景包含五个由参与者佩戴AR眼镜拍摄的自我中心视频,并实现了亚毫秒级时间同步和精确姿态标注,验证了其在自由视点视频应用中的实用性和有效性。

Details Motivation: 现有的重建数据集主要关注静态多视角或单自我中心视角设置,缺乏适用于动态场景重建的多视角自我中心数据集,限制了该领域的研究进展。 Method: 设计了一种基于硬件的数据采集系统和处理流程,实现了跨视角的亚毫秒级时间同步,并提供了准确的姿态标注,构建了包含五种典型社交互动场景的MultiEgo数据集。 Result: 实验验证表明,该数据集在自由视点视频(FVV)应用中具有良好的实用性和有效性,能够支持高质量的动态场景重建。 Conclusion: MultiEgo为推进多视角自我中心动态场景重建研究提供了基础性资源,有望推动社交互动全息记录等相关应用的发展。 Abstract: Multi-view egocentric dynamic scene reconstruction holds significant research value for applications in holographic documentation of social interactions. However, existing reconstruction datasets focus on static multi-view or single-egocentric view setups, lacking multi-view egocentric datasets for dynamic scene reconstruction. Therefore, we present MultiEgo, the first multi-view egocentric dataset for 4D dynamic scene reconstruction. The dataset comprises five canonical social interaction scenes: meetings, performances, and a presentation. Each scene provides five authentic egocentric videos captured by participants wearing AR glasses. We design a hardware-based data acquisition system and processing pipeline, achieving sub-millisecond temporal synchronization across views, coupled with accurate pose annotations. Experiment validation demonstrates the practical utility and effectiveness of our dataset for free-viewpoint video (FVV) applications, establishing MultiEgo as a foundational resource for advancing multi-view egocentric dynamic scene reconstruction research.

[66] SATMapTR: Satellite Image Enhanced Online HD Map Construction

Bingyuan Huang,Guanyi Zhao,Qian Xu,Yang Lou,Yung-Hui Li,Jianping Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SATMapTR的新型在线地图构建模型,通过融合卫星图像和车载传感器数据,解决了高精地图实时构建中因输入数据质量低导致的精度和鲁棒性下降问题。

Details Motivation: 由于车载传感器能力有限且常被遮挡,导致输入数据质量低,影响了高精地图的实时构建效果;同时,卫星图像虽能提供广域视角,但其鸟瞰图常受阴影和建筑物遮挡影响,现有方法难以有效融合这些多源信息。 Method: 提出SATMapTR模型,包含两个关键模块:(1)门控特征优化模块,结合高层语义与低层结构线索自适应过滤卫星图像特征;(2)几何感知融合模块,在网格级别实现卫星图像与BEV特征的一致性融合。 Result: 在nuScenes数据集上实验表明,SATMapTR达到73.8的mAP,比现有最先进的卫星增强模型最高提升14.2 mAP;在恶劣天气和传感器失效下mAP下降更小,并在扩展感知范围内实现近3倍更高的mAP。 Conclusion: SATMapTR通过有效的卫星图像特征提取与几何对齐融合策略,显著提升了复杂环境下的在线地图构建精度与鲁棒性,为自动驾驶提供了更具可扩展性的解决方案。 Abstract: High-definition (HD) maps are evolving from pre-annotated to real-time construction to better support autonomous driving in diverse scenarios. However, this process is hindered by low-quality input data caused by onboard sensors limited capability and frequent occlusions, leading to incomplete, noisy, or missing data, and thus reduced mapping accuracy and robustness. Recent efforts have introduced satellite images as auxiliary input, offering a stable, wide-area view to complement the limited ego perspective. However, satellite images in Bird's Eye View are often degraded by shadows and occlusions from vegetation and buildings. Prior methods using basic feature extraction and fusion remain ineffective. To address these challenges, we propose SATMapTR, a novel online map construction model that effectively fuses satellite image through two key components: (1) a gated feature refinement module that adaptively filters satellite image features by integrating high-level semantics with low-level structural cues to extract high signal-to-noise ratio map-relevant representations; and (2) a geometry-aware fusion module that consistently fuse satellite and BEV features at a grid-to-grid level, minimizing interference from irrelevant regions and low-quality inputs. Experimental results on the nuScenes dataset show that SATMapTR achieves the highest mean average precision (mAP) of 73.8, outperforming state-of-the-art satellite-enhanced models by up to 14.2 mAP. It also shows lower mAP degradation under adverse weather and sensor failures, and achieves nearly 3 times higher mAP at extended perception ranges.

[67] KeyframeFace: From Text to Expressive Facial Keyframes

Jingchao Wu,Zejian Kang,Haibo Liu,Yuanchen Fei,Xiangru Huang

Main category: cs.CV

TL;DR: 本文提出了KeyframeFace,一个用于文本到面部动画生成的大规模多模态数据集,并设计了一个基于大语言模型(LLM)先验的可解释面部动作合成框架。

Details Motivation: 现有数据集和方法主要关注语音驱动的动画或无结构的表情序列,缺乏生成富有表现力的人类表演所需的语义对齐和时间结构。 Method: 构建包含2100个富有表现力剧本的KeyframeFace数据集,提供逐帧ARKit系数、手动定义的关键帧以及基于LLM和多模态大模型的多视角标注;提出首个利用LLM先验知识进行可解释面部运动合成的文本到动画框架。 Result: KeyframeFace提供了丰富的语义和时间结构标注,所提框架能有效结合LLM的语义理解与ARKit参数的可解释性,实现高保真的表情动画生成。 Conclusion: KeyframeFace数据集与基于LLM的框架共同为可解释、关键帧引导且上下文感知的文本到动画生成建立了新基础。 Abstract: Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit's coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at https://github.com/wjc12345123/KeyframeFace.

[68] MLLM Machine Unlearning via Visual Knowledge Distillation

Yuhang Wang,Zhenxing Niu,Haoxuan Ji,Guangyu He,Haichang Gao,Gang Hua

Main category: cs.CV

TL;DR: 提出一种针对多模态大模型(MLLM)的视觉知识蒸馏(VKD)解耦方法,选择性擦除视觉知识并保持文本知识,仅微调视觉组件以提升遗忘效果和效率,并首次评估了对重学习攻击的鲁棒性。

Details Motivation: 现有机器遗忘方法主要面向LLM,而针对MLLM的遗忘研究尚处早期,且缺乏对视觉与文本知识分离的有效处理机制。 Method: 通过解耦MLLM中的视觉与文本知识,引入基于中间视觉表示监督的视觉知识蒸馏(VKD)方案,仅微调视觉部分实现选择性遗忘。 Result: 在遗忘效果和模型实用性上优于现有最先进方法,且具备更高效率;首次验证了MLLM遗忘对重学习攻击的鲁棒性。 Conclusion: 所提VKD方法能高效、有效地实现MLLM中视觉知识的选择性遗忘,同时保留模型原有功能,推动了MLLM安全与隐私保护的发展。 Abstract: Recently, machine unlearning approaches have been proposed to remove sensitive information from well-trained large models. However, most existing methods are tailored for LLMs, while MLLM-oriented unlearning remains at its early stage. Inspired by recent studies exploring the internal mechanisms of MLLMs, we propose to disentangle the visual and textual knowledge embedded within MLLMs and introduce a dedicated approach to selectively erase target visual knowledge while preserving textual knowledge. Unlike previous unlearning methods that rely on output-level supervision, our approach introduces a Visual Knowledge Distillation (VKD) scheme, which leverages intermediate visual representations within the MLLM as supervision signals. This design substantially enhances both unlearning effectiveness and model utility. Moreover, since our method only fine-tunes the visual components of the MLLM, it offers significant efficiency advantages. Extensive experiments demonstrate that our approach outperforms state-of-the-art unlearning methods in terms of both effectiveness and efficiency. Moreover, we are the first to evaluate the robustness of MLLM unlearning against relearning attacks.

[69] Physics-Informed Video Flare Synthesis and Removal Leveraging Motion Independence between Flare and Scene

Junqiao Wang,Yuanfei Huang,Hua Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于物理的动态镜头光晕合成方法和视频去光晕网络,通过注意力机制和Mamba结构建模时空依赖性,构建了首个视频光晕数据集,有效提升了视频去光晕性能。

Details Motivation: 现有研究主要集中在图像去光晕,而视频中光晕的时空特性尚未充分探索,且由于光晕、光源与场景运动的独立性,导致修复时易出现闪烁和伪影。 Method: 提出一种物理感知的动态光晕合成流程,利用光流模拟光源运动,并设计包含注意力模块和基于Mamba的时序建模组件的去光晕网络,以捕捉长距离时空依赖,避免多帧对齐带来的时间混叠问题。 Result: 在合成与真实视频上均优于现有的视频恢复和图像去光晕方法,能有效去除动态光晕,保持光源完整性并维持场景的时空一致性。 Conclusion: 所提方法通过无需对齐的运动独立时空表征,显著提升了视频去光晕效果,为未来相关研究提供了新思路与可用数据集。 Abstract: Lens flare is a degradation phenomenon caused by strong light sources. Existing researches on flare removal have mainly focused on images, while the spatiotemporal characteristics of video flare remain largely unexplored. Video flare synthesis and removal pose significantly greater challenges than in image, owing to the complex and mutually independent motion of flare, light sources, and scene content. This motion independence further affects restoration performance, often resulting in flicker and artifacts. To address this issue, we propose a physics-informed dynamic flare synthesis pipeline, which simulates light source motion using optical flow and models the temporal behaviors of both scattering and reflective flares. Meanwhile, we design a video flare removal network that employs an attention module to spatially suppress flare regions and incorporates a Mamba-based temporal modeling component to capture long range spatio-temporal dependencies. This motion-independent spatiotemporal representation effectively eliminates the need for multi-frame alignment, alleviating temporal aliasing between flares and scene content and thereby improving video flare removal performance. Building upon this, we construct the first video flare dataset to comprehensively evaluate our method, which includes a large set of synthetic paired videos and additional real-world videos collected from the Internet to assess generalization capability. Extensive experiments demonstrate that our method consistently outperforms existing video-based restoration and image-based flare removal methods on both real and synthetic videos, effectively removing dynamic flares while preserving light source integrity and maintaining spatiotemporal consistency of scene.

[70] FreqDINO: Frequency-Guided Adaptation for Generalized Boundary-Aware Ultrasound Image Segmentation

Yixuan Zhang,Qing Xu,Yue Li,Xiangjian He,Qian Zhang,Mainul Haque,Rong Qu,Wenting Duan,Zhen Chen

Main category: cs.CV

TL;DR: 提出了一种名为FreqDINO的频率引导分割框架,用于增强超声图像分割中的边界感知和结构一致性。

Details Motivation: DINOv3在自然图像上预训练,对超声图像中特有的边界退化不敏感,限制了其在超声图像分割中的性能。 Method: 设计了多尺度频率提取与对齐(MFEA)策略,分离低频结构和多尺度高频边界细节,并通过可学习注意力机制进行对齐;引入频率引导边界细化(FGBR)模块,从高频成分中提取边界原型并优化空间特征;设计多任务边界引导解码器(MBGD)以确保边界与语义预测之间的空间一致性。 Result: 大量实验表明,FreqDINO优于现有最先进方法,具有卓越的泛化能力。 Conclusion: FreqDINO有效提升了超声图像分割的精度和鲁棒性,尤其在处理斑点噪声和成像伪影方面表现突出。 Abstract: Ultrasound image segmentation is pivotal for clinical diagnosis, yet challenged by speckle noise and imaging artifacts. Recently, DINOv3 has shown remarkable promise in medical image segmentation with its powerful representation capabilities. However, DINOv3, pre-trained on natural images, lacks sensitivity to ultrasound-specific boundary degradation. To address this limitation, we propose FreqDINO, a frequency-guided segmentation framework that enhances boundary perception and structural consistency. Specifically, we devise a Multi-scale Frequency Extraction and Alignment (MFEA) strategy to separate low-frequency structures and multi-scale high-frequency boundary details, and align them via learnable attention. We also introduce a Frequency-Guided Boundary Refinement (FGBR) module that extracts boundary prototypes from high-frequency components and refines spatial features. Furthermore, we design a Multi-task Boundary-Guided Decoder (MBGD) to ensure spatial coherence between boundary and semantic predictions. Extensive experiments demonstrate that FreqDINO surpasses state-of-the-art methods with superior achieves remarkable generalization capability. The code is at https://github.com/MingLang-FD/FreqDINO.

[71] UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models

Hewen Pan,Cong Wei,Dashuang Liang,Zepeng Huang,Pengfei Gao,Ziqi Zhou,Lulu Xue,Pengfei Yan,Xiaoming Wei,Minghui Li,Shengshan Hu

Main category: cs.CV

TL;DR: 本文提出了UFVideo,首个具有统一多粒度协同理解能力的视频大语言模型,能够跨全局、像素和时间尺度进行灵活的视频理解,并通过设计统一的视觉-语言对齐机制实现多种任务输出。

Details Motivation: 现有视频大语言模型局限于特定任务,缺乏全面且多粒度的视频感知能力,因此需要一种能够统一处理不同尺度视频理解的模型。 Method: 设计了统一的视觉-语言引导对齐机制,使单个模型能动态编码不同任务的视觉与文本输入,并生成文本回答、时间定位或 grounding mask;同时构建了UFVideo-Bench 评测集以评估多粒度理解能力。 Result: 在包含三个协作任务的UFVideo-Bench上表现优于GPT-4o,并在9个公开基准上验证了模型的有效性,覆盖多种常见视频理解任务。 Conclusion: UFVideo实现了跨尺度的统一多粒度视频理解,展现出更强的灵活性和广泛适用性,为未来视频大模型的发展提供了新方向。 Abstract: With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.

[72] Task-Specific Distance Correlation Matching for Few-Shot Action Recognition

Fei Long,Yao Zhang,Jiaming Lv,Jiangtao Xie,Peihua Li

Main category: cs.CV

TL;DR: 本文提出了一种名为TS-FSAR的新框架,用于解决少样本动作识别中的两个关键问题:现有方法在建模帧间依赖时忽略非线性关系和任务特定信息,以及CLIP微调中引入的侧层在数据有限时难以优化。该框架包括三个部分:用于高效微调的视觉梯级侧网络(LSN)、基于α-距离相关性的任务特定匹配度量(TS-DCM),以及结合适配CLIP的引导模块(GLAC)以提升训练稳定性。实验表明,TS-FSAR在五个基准上优于现有方法。

Details Motivation: 现有少样本动作识别方法在使用集合匹配时仅依赖余弦相似性,无法捕捉非线性帧间关系且缺乏任务特定感知;同时,基于侧层的CLIP微调策略在小样本下难以优化。因此需要一种更鲁棒、可学习且适应任务的方法来提升性能。 Method: 提出TS-FSAR框架:1)设计梯级侧网络(LSN)实现高效的CLIP微调;2)提出任务特定距离相关匹配(TS-DCM),利用α-距离相关性建模线性和非线性帧间依赖,并引入任务原型进行任务感知匹配;3)设计GLAC模块,利用冻结的适配CLIP来正则化LSN训练,提升在低监督下的α-距离相关估计质量。 Result: 在五个广泛使用的少样本动作识别基准数据集上进行了大量实验,结果表明TS-FSAR显著优于现有的最先进方法,验证了其有效性与泛化能力。 Conclusion: TS-FSAR通过引入非线性依赖建模、任务特定匹配机制和基于冻结主干的正则化训练策略,有效解决了当前少样本动作识别中的关键挑战,在性能上实现了新的突破。 Abstract: Few-shot action recognition (FSAR) has recently made notable progress through set matching and efficient adaptation of large-scale pre-trained models. However, two key limitations persist. First, existing set matching metrics typically rely on cosine similarity to measure inter-frame linear dependencies and then perform matching with only instance-level information, thus failing to capture more complex patterns such as nonlinear relationships and overlooking task-specific cues. Second, for efficient adaptation of CLIP to FSAR, recent work performing fine-tuning via skip-fusion layers (which we refer to as side layers) has significantly reduced memory cost. However, the newly introduced side layers are often difficult to optimize under limited data conditions. To address these limitations, we propose TS-FSAR, a framework comprising three components: (1) a visual Ladder Side Network (LSN) for efficient CLIP fine-tuning; (2) a metric called Task-Specific Distance Correlation Matching (TS-DCM), which uses $α$-distance correlation to model both linear and nonlinear inter-frame dependencies and leverages a task prototype to enable task-specific matching; and (3) a Guiding LSN with Adapted CLIP (GLAC) module, which regularizes LSN using the adapted frozen CLIP to improve training for better $α$-distance correlation estimation under limited supervision. Extensive experiments on five widely-used benchmarks demonstrate that our TS-FSAR yields superior performance compared to prior state-of-the-arts.

[73] Surveillance Video-Based Traffic Accident Detection Using Transformer Architecture

Tanu Singh,Pranamesh Chakraborty,Long T. Truong

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer架构的交通事故检测模型,利用RGB特征与光流特征融合提升动态场景理解,并在一个新构建的多样化数据集上实现了88.3%的准确率,优于现有方法和主流视觉语言模型。

Details Motivation: 传统计算机视觉方法在交通事故检测中存在时空理解能力弱、跨域泛化差的问题,且现有研究缺乏对运动线索的有效整合,限制了动态场景下的检测性能。 Method: 构建了一个全面平衡的数据集,结合卷积层提取帧内局部相关性,利用Transformer捕捉特征间的时序依赖关系,并通过融合RGB与光流等运动线索增强动态表征能力。 Result: 实验表明,RGB特征与光学流特征拼接的输入方式取得了最佳效果,准确率达到88.3%,且优于GPT、Gemini和LLaVA-NeXT-Video等视觉语言模型的表现。 Conclusion: 所提出的基于Transformer并融合运动线索的模型有效提升了交通事故检测的准确性与鲁棒性,验证了时空特征联合建模与运动信息集成的重要性。 Abstract: Road traffic accidents represent a leading cause of mortality globally, with incidence rates rising due to increasing population, urbanization, and motorization. Rising accident rates raise concerns about traffic surveillance effectiveness. Traditional computer vision methods for accident detection struggle with limited spatiotemporal understanding and poor cross-domain generalization. Recent advances in transformer architectures excel at modeling global spatial-temporal dependencies and parallel computation. However, applying these models to automated traffic accident detection is limited by small, non-diverse datasets, hindering the development of robust, generalizable systems. To address this gap, we curated a comprehensive and balanced dataset that captures a wide spectrum of traffic environments, accident types, and contextual variations. Utilizing the curated dataset, we propose an accident detection model based on a transformer architecture using pre-extracted spatial video features. The architecture employs convolutional layers to extract local correlations across diverse patterns within a frame, while leveraging transformers to capture sequential-temporal dependencies among the retrieved features. Moreover, most existing studies neglect the integration of motion cues, which are essential for understanding dynamic scenes, especially during accidents. These approaches typically rely on static features or coarse temporal information. In this study, multiple methods for incorporating motion cues were evaluated to identify the most effective strategy. Among the tested input approaches, concatenating RGB features with optical flow achieved the highest accuracy at 88.3%. The results were further compared with vision language models (VLM) such as GPT, Gemini, and LLaVA-NeXT-Video to assess the effectiveness of the proposed method.

[74] A Multi-Mode Structured Light 3D Imaging System with Multi-Source Information Fusion for Underwater Pipeline Detection

Qinghan Hu,Haijiang Zhu,Na Sun,Lei Chen,Zhengqiang Fan,Zhiqing Li

Main category: cs.CV

TL;DR: 本文提出了一种基于多源信息融合的多模式水下结构光3D成像系统(UW-SLD),用于管道缺陷的高精度、鲁棒性检测,结合快速畸变校正、因子图优化、自适应滤波与边缘检测增强的点云配准算法,在复杂水下环境中实现了稳定姿态估计与高保真缺陷重建。

Details Motivation: 水下管道易受腐蚀,传统人工检测效率低且风险高,现有成像技术在复杂水下环境中难以兼顾精度与鲁棒性,亟需一种可实时、精确、自适应检测管道缺陷的智能成像系统。 Method: 提出UW-SLD系统:采用快速畸变校正(FDC)进行图像去畸变;利用因子图优化实现结构光与声学传感器间的外参标定;设计多模式3D成像策略以适应管道几何变化;结合多源信息融合与自适应扩展卡尔曼滤波(AEKF)提升位姿估计稳定性;提出边缘检测ICP(ED-ICP)算法,融合边缘检测网络与增强点云配准实现高保真缺陷重建。 Result: 实验表明,该系统在不同工作模式、速度和深度下均表现出优异性能:实现了高精度的空间细节恢复、稳定的位姿估计(误差低),ED-ICP显著提升了点云配准鲁棒性与重建保真度,整体系统具备强适应性与抗干扰能力。 Conclusion: 所提出的UW-SLD系统通过多模式成像与多源信息融合策略,有效解决了水下管道检测中的畸变、标定难、环境扰动等问题,为实现自主化、高精度的水下管道缺陷检测提供了可靠的技术方案。 Abstract: Underwater pipelines are highly susceptible to corrosion, which not only shorten their service life but also pose significant safety risks. Compared with manual inspection, the intelligent real-time imaging system for underwater pipeline detection has become a more reliable and practical solution. Among various underwater imaging techniques, structured light 3D imaging can restore the sufficient spatial detail for precise defect characterization. Therefore, this paper develops a multi-mode underwater structured light 3D imaging system for pipeline detection (UW-SLD system) based on multi-source information fusion. First, a rapid distortion correction (FDC) method is employed for efficient underwater image rectification. To overcome the challenges of extrinsic calibration among underwater sensors, a factor graph-based parameter optimization method is proposed to estimate the transformation matrix between the structured light and acoustic sensors. Furthermore, a multi-mode 3D imaging strategy is introduced to adapt to the geometric variability of underwater pipelines. Given the presence of numerous disturbances in underwater environments, a multi-source information fusion strategy and an adaptive extended Kalman filter (AEKF) are designed to ensure stable pose estimation and high-accuracy measurements. In particular, an edge detection-based ICP (ED-ICP) algorithm is proposed. This algorithm integrates pipeline edge detection network with enhanced point cloud registration to achieve robust and high-fidelity reconstruction of defect structures even under variable motion conditions. Extensive experiments are conducted under different operation modes, velocities, and depths. The results demonstrate that the developed system achieves superior accuracy, adaptability and robustness, providing a solid foundation for autonomous underwater pipeline detection.

[75] Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video

Meng-Li Shih,Ying-Huan Chen,Yu-Lun Liu,Brian Curless

Main category: cs.CV

TL;DR: 提出了一种从单目RGB视频中全自动重建动态场景的管道,通过增强Dynamic Gaussian Splatting的先验,结合视频分割与极线误差图生成精确的对象掩码,提升深度一致性与2D轨迹可靠性,显著优于以往方法。

Details Motivation: 现有的单目动态场景重建方法在处理复杂动态场景时存在几何模糊和运动不一致问题,需要更强大的先验来提升重建质量。 Method: 利用视频分割和极线误差图生成精确的对象级掩码;使用掩码引导的深度损失优化深度图;基于骨架采样和掩码引导的重识别生成可靠的2D轨迹;引入虚拟视图深度损失和支架投影损失将优化后的先验嵌入重建过程。 Result: 该方法在动态场景重建中实现了更清晰的几何细节和更连贯的运动表现,渲染效果明显优于先前的单目方法。 Conclusion: 通过引入多阶段先验增强策略,所提方法有效提升了单目动态场景重建的质量,在几何精度和运动一致性方面达到领先水平。 Abstract: We introduce a fully automatic pipeline for dynamic scene reconstruction from casually captured monocular RGB videos. Rather than designing a new scene representation, we enhance the priors that drive Dynamic Gaussian Splatting. Video segmentation combined with epipolar-error maps yields object-level masks that closely follow thin structures; these masks (i) guide an object-depth loss that sharpens the consistent video depth, and (ii) support skeleton-based sampling plus mask-guided re-identification to produce reliable, comprehensive 2-D tracks. Two additional objectives embed the refined priors in the reconstruction stage: a virtual-view depth loss removes floaters, and a scaffold-projection loss ties motion nodes to the tracks, preserving fine geometry and coherent motion. The resulting system surpasses previous monocular dynamic scene reconstruction methods and delivers visibly superior renderings

[76] Reliable Detection of Minute Targets in High-Resolution Aerial Imagery across Temporal Shifts

Mohammad Sadegh Gholizadeh,Amir Arsalan Rezapour,Hamidreza Shayegh,Ehsan Pazouki

Main category: cs.CV

TL;DR: 本文提出了一种基于Faster R-CNN和迁移学习的无人机水稻幼苗检测方法,通过构建高质量数据集并在不同时段图像上验证模型泛化能力,证明了该方法在复杂农业环境中对小目标检测的有效性和鲁棒性。

Details Motivation: 由于目标尺寸小和环境变化大,利用无人机进行高效作物检测在精准农业中仍具挑战性,尤其是水稻幼苗在稻田中的检测问题亟待解决。 Method: 采用基于迁移学习初始化的Faster R-CNN架构,并构建了一个大规模无人机遥感图像数据集用于训练和评估,特别针对高分辨率图像中的微小对象检测难题。 Result: 实验结果表明,该方法能快速收敛,在三个不同时段获取的测试集上均表现出稳定的检测性能,验证了其在成像条件变化下的鲁棒性。 Conclusion: 迁移学习显著提升了小尺度农作物在复杂农业场景中的检测效果,所提方法具备良好的跨域泛化能力,有助于推动无人机在精准农业中的规模化应用。 Abstract: Efficient crop detection via Unmanned Aerial Vehicles is critical for scaling precision agriculture, yet it remains challenging due to the small scale of targets and environmental variability. This paper addresses the detection of rice seedlings in paddy fields by leveraging a Faster R-CNN architecture initialized via transfer learning. To overcome the specific difficulties of detecting minute objects in high-resolution aerial imagery, we curate a significant UAV dataset for training and rigorously evaluate the model's generalization capabilities. Specifically, we validate performance across three distinct test sets acquired at different temporal intervals, thereby assessing robustness against varying imaging conditions. Our empirical results demonstrate that transfer learning not only facilitates the rapid convergence of object detection models in agricultural contexts but also yields consistent performance despite domain shifts in image acquisition.

[77] Assisted Refinement Network Based on Channel Information Interaction for Camouflaged and Salient Object Detection

Kuan Wang,Yanjun Qin,Mengge Lu,Liejun Wang,Xiaoming Tao

Main category: cs.CV

TL;DR: 本文提出了一种用于伪装物体检测(COD)的新方法,通过通道信息交互模块(CIIM)和基于先验知识的协同解码架构,有效提升了特征表达能力和边界与区域的联合建模能力,在多个基准数据集上实现了最先进的性能,并展现出在多种下游任务中的良好迁移能力。

Details Motivation: 当前主流方法在解码阶段存在两个关键问题:一是同层特征中跨通道信息交互不足,限制了特征表达能力;二是无法有效联合建模边界与区域信息,导致物体完整区域和清晰边界的重建困难。 Method: 提出通道信息交互模块(CIIM),在通道维度引入水平-垂直整合机制以增强跨通道信息交互;设计基于先验知识的协同解码架构,通过边界提取(BE)和区域提取(RE)模块生成边界先验和定位图,并利用混合注意力机制协同校准解码特征;同时引入多尺度增强(MSE)模块以丰富上下文特征表示。 Result: 在四个COD基准数据集上进行了大量实验,验证了所提方法的有效性并达到了最先进的性能;模型迁移到显著物体检测(SOD)及其他下游任务(如息肉分割、透明物体检测、工业与道路缺陷检测)也表现出良好的适应性。 Conclusion: 本文提出的ARNet-v2通过增强通道间信息交互和边界-区域协同建模,显著提升了伪装物体检测的性能,并具备良好的通用性和迁移能力,为相关视觉任务提供了有效的解决方案。 Abstract: Camouflaged Object Detection (COD) stands as a significant challenge in computer vision, dedicated to identifying and segmenting objects visually highly integrated with their backgrounds. Current mainstream methods have made progress in cross-layer feature fusion, but two critical issues persist during the decoding stage. The first is insufficient cross-channel information interaction within the same-layer features, limiting feature expressiveness. The second is the inability to effectively co-model boundary and region information, making it difficult to accurately reconstruct complete regions and sharp boundaries of objects. To address the first issue, we propose the Channel Information Interaction Module (CIIM), which introduces a horizontal-vertical integration mechanism in the channel dimension. This module performs feature reorganization and interaction across channels to effectively capture complementary cross-channel information. To address the second issue, we construct a collaborative decoding architecture guided by prior knowledge. This architecture generates boundary priors and object localization maps through Boundary Extraction (BE) and Region Extraction (RE) modules, then employs hybrid attention to collaboratively calibrate decoded features, effectively overcoming semantic ambiguity and imprecise boundaries. Additionally, the Multi-scale Enhancement (MSE) module enriches contextual feature representations. Extensive experiments on four COD benchmark datasets validate the effectiveness and state-of-the-art performance of the proposed model. We further transferred our model to the Salient Object Detection (SOD) task and demonstrated its adaptability across downstream tasks, including polyp segmentation, transparent object detection, and industrial and road defect detection. Code and experimental results are publicly available at: https://github.com/akuan1234/ARNet-v2.

[78] Out-of-Distribution Segmentation via Wasserstein-Based Evidential Uncertainty

Arnold Brosch,Abdelrahman Eldesokey,Michael Felsberg,Kira Maag

Main category: cs.CV

TL;DR: 提出了一种基于Wasserstein损失的证据分割框架,结合KL正则化和Dice结构一致性项,提升了开放世界场景中的OOD分割性能。

Details Motivation: 深度神经网络在语义分割中表现优异,但在面对开放世界中的未知对象时受限于预定义类别,难以识别分布外(OOD)物体,影响安全关键应用的安全性。 Method: 采用Wasserstein损失来捕捉分布距离并保持概率单纯形几何特性,结合Kullback-Leibler正则化和Dice结构一致性项构建证据分割框架。 Result: 在OOD分割任务上优于基于不确定性的方法,表现出更强的分布外物体识别与分割能力。 Conclusion: 所提方法通过建模分布间距离和结构约束,有效提升了开放环境中语义分割的鲁棒性和安全性。 Abstract: Deep neural networks achieve superior performance in semantic segmentation, but are limited to a predefined set of classes, which leads to failures when they encounter unknown objects in open-world scenarios. Recognizing and segmenting these out-of-distribution (OOD) objects is crucial for safety-critical applications such as automated driving. In this work, we present an evidence segmentation framework using a Wasserstein loss, which captures distributional distances while respecting the probability simplex geometry. Combined with Kullback-Leibler regularization and Dice structural consistency terms, our approach leads to improved OOD segmentation performance compared to uncertainty-based approaches.

[79] The N-Body Problem: Parallel Execution from Single-Person Egocentric Video

Zhifan Zhu,Yifei Huang,Yoichi Sato,Dima Damen

Main category: cs.CV

TL;DR: 本文提出了“N-Body问题”,即从单个第一人称视频中推断多人如何并行完成相同任务,旨在最大化加速比同时避免物理冲突。作者形式化了该问题并提出评估指标,结合结构化提示引导视觉语言模型推理三维环境、物体使用和时序依赖,实验证明其方法显著提升任务覆盖率并降低冲突率。

Details Motivation: 从单一视角视频中学习多人并行执行任务的可行性与效率,解决传统方法因忽略空间、物体和因果约束而导致的不现实分配问题。 Method: 提出N-Body问题及其评估体系,设计结构化提示策略,利用视觉语言模型(VLM)对3D环境、物体使用和时间依赖进行推理,生成可行的并行执行方案。 Result: 在EPIC-Kitchens和HD-EPIC的100个视频上,N=2时相比基线提示,Gemini 2.5 Pro的任务覆盖率提升45%,碰撞、物体冲突和因果冲突分别减少55%、45%和55%。 Conclusion: 通过结构化提示引导VLM可有效解决N-Body问题,在保证物理可行性的前提下实现高效的任务并行化,为多智能体协作提供新思路。 Abstract: Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.

[80] FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing

Yilei Jiang,Zhen Wang,Yanghao Wang,Jun Yu,Yueting Zhuang,Jun Xiao,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为FlowDC的新方法,用于解决文本引导的复杂图像编辑任务中语义对齐与源图像一致性之间的平衡问题。该方法通过并行解耦多个子编辑效果,并对正交于编辑位移的速度分量进行衰减,从而提升编辑质量。作者还构建了Complex-PIE-Bench基准用于评估复杂编辑性能,实验表明FlowDC在多个指标上优于现有方法。

Details Motivation: 现有的复杂文本引导图像编辑方法在处理多目标编辑时,受限于长文本依赖或累积不一致性,难以兼顾语义准确性和源图像结构保持。因此需要一种更有效的复杂编辑框架。 Method: 提出FlowDC方法,将复杂编辑解耦为多个子编辑效果,并在编辑过程中并行叠加;同时分解速度场,衰减其正交于编辑位移的分量以增强源结构保持能力。此外,构建了Complex-PIE-Bench作为复杂编辑评测基准。 Result: 在两个基准上,FlowDC均展现出优于现有方法的性能,尤其在复杂编辑场景下实现了更好的语义对齐与源一致性平衡。消融实验验证了各模块设计的有效性。 Conclusion: FlowDC通过并行解耦编辑路径和速度分解策略,有效解决了复杂文本引导图像编辑中的关键挑战,在保持源结构的同时实现精准的多目标编辑,推动了该领域的发展。 Abstract: With the surge of pre-trained text-to-image flow matching models, text-based image editing performance has gained remarkable improvement, especially for \underline{simple editing} that only contains a single editing target. To satisfy the exploding editing requirements, the \underline{complex editing} which contains multiple editing targets has posed as a more challenging task. However, current complex editing solutions: single-round and multi-round editing are limited by long text following and cumulative inconsistency, respectively. Thus, they struggle to strike a balance between semantic alignment and source consistency. In this paper, we propose \textbf{FlowDC}, which decouples the complex editing into multiple sub-editing effects and superposes them in parallel during the editing process. Meanwhile, we observed that the velocity quantity that is orthogonal to the editing displacement harms the source structure preserving. Thus, we decompose the velocity and decay the orthogonal part for better source consistency. To evaluate the effectiveness of complex editing settings, we construct a complex editing benchmark: Complex-PIE-Bench. On two benchmarks, FlowDC shows superior results compared with existing methods. We also detail the ablations of our module designs.

[81] Collaborative Reconstruction and Repair for Multi-class Industrial Anomaly Detection

Qishan Wang,Haofeng Wang,Shuyong Gao,Jia Guo,Li Xiong,Jiaqi Li,Dengxuan Bai,Wenqiang Zhang

Main category: cs.CV

TL;DR: 提出了一种名为协作重建与修复(CRR)的统一框架,用于多类工业异常检测,通过将重建转化为修复来缓解身份映射问题,并在多个数据集上实现了最先进的性能。

Details Motivation: 解决现有重建网络在多类异常检测中因身份映射问题导致的检测失败,以及构建单类模型带来的高内存消耗和泛化能力差的问题。 Method: 设计CRR框架,优化解码器以重建正常样本并修复合成异常;采用特征级随机掩码增强局部信息保留;训练一个由合成异常掩码监督的分割网络以提升定位精度。 Result: 实验表明CRR有效缓解了身份映射问题,在多个工业异常检测数据集上取得了优于现有方法的性能。 Conclusion: CRR为多类工业异常检测提供了一个高效且通用的解决方案,显著提升了异常定位和检测效果。 Abstract: Industrial anomaly detection is a challenging open-set task that aims to identify unknown anomalous patterns deviating from normal data distribution. To avoid the significant memory consumption and limited generalizability brought by building separate models per class, we focus on developing a unified framework for multi-class anomaly detection. However, under this challenging setting, conventional reconstruction-based networks often suffer from an identity mapping problem, where they directly replicate input features regardless of whether they are normal or anomalous, resulting in detection failures. To address this issue, this study proposes a novel framework termed Collaborative Reconstruction and Repair (CRR), which transforms the reconstruction to repairation. First, we optimize the decoder to reconstruct normal samples while repairing synthesized anomalies. Consequently, it generates distinct representations for anomalous regions and similar representations for normal areas compared to the encoder's output. Second, we implement feature-level random masking to ensure that the representations from decoder contain sufficient local information. Finally, to minimize detection errors arising from the discrepancies between feature representations from the encoder and decoder, we train a segmentation network supervised by synthetic anomaly masks, thereby enhancing localization performance. Extensive experiments on industrial datasets that CRR effectively mitigates the identity mapping issue and achieves state-of-the-art performance in multi-class industrial anomaly detection.

[82] JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion

Chaochao Li,Ruikui Wang,Liangbo Zhou,Jinheng Feng,Huaishao Luo,Huan Zhang,Youzheng Wu,Xiaodong He

Main category: cs.CV

TL;DR: 提出JoyAvatar,一种基于DiT的音频驱动自回归模型,支持实时推理和无限长度视频生成。

Details Motivation: 现有音频驱动的虚拟形象生成方法受限于高计算开销和无法生成长时视频的问题,且自回归方法存在误差累积和质量下降问题。 Method: 提出三项关键技术:渐进式步长自举(PSB)以稳定初始帧生成;运动条件注入(MCI)增强时序一致性;基于缓存重置的无界RoPE(URCR)实现无限长度生成。 Result: 1.3B参数的因果模型在单个GPU上实现16 FPS的实时推理,在视觉质量、时序一致性和唇音同步方面达到领先水平。 Conclusion: JoyAvatar有效解决了长时生成中的误差累积与计算效率问题,实现了高质量、无限长度的音频驱动虚拟形象生成。 Abstract: Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.

[83] Flowception: Temporally Expansive Flow Matching for Video Generation

Tariq Berrada Ifriqi,John Nguyen,Karteek Alahari,Jakob Verbeek,Ricky T. Q. Chen

Main category: cs.CV

TL;DR: Flowception是一种新颖的非自回归、可变长度视频生成框架,通过交错离散帧插入与连续帧去噪的概率路径来生成视频,有效减少误差累积并降低训练计算量。

Details Motivation: 现有的自回归方法存在误差累积问题,而全序列流模型计算复杂度高且难以处理可变长度视频,因此需要一种更高效、灵活的视频生成框架。 Method: Flowception采用非自回归方式,结合离散帧插入和连续帧去噪的交替概率路径,在训练中减少FLOPs,并支持局部注意力机制,同时联合学习视频长度与内容。 Result: 在FVD和VBench指标上优于自回归和全序列基线模型,且能无缝集成图像到视频生成和视频插值等任务。 Conclusion: Flowception提供了一种高效、灵活的视频生成方案,兼顾性能与多任务兼容性,适用于长视频和可变长度生成场景。 Abstract: We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.

[84] YawDD+: Frame-level Annotations for Accurate Yawn Prediction

Ahmed Mujtaba,Gleb Radchenko,Marc Masana,Radu Prodan

Main category: cs.CV

TL;DR: 本文提出了一种半自动标注流程,结合人工验证,提高了驾驶员打哈欠检测数据集YawDD+的质量。通过改进的数据训练MNasNet和YOLOv11模型,显著提升了帧级准确率和mAP,并在边缘设备上实现实时疲劳监测。

Details Motivation: 现有基于视频标注的机器学习方法因时间标注粗糙导致系统性噪声,影响疲劳检测性能。需要更精确的标注来提升模型表现。 Method: 开发了一个半自动标注流水线,结合人类参与验证,对YawDD数据集进行精细化标注,构建YawDD+数据集,并用于训练MNasNet分类器和YOLOv11检测器。 Result: 在YawDD+上训练的模型相比视频级监督提升了6%的帧准确率和5%的mAP,分别达到99.34%的分类准确率和95.69%的检测mAP,并在NVIDIA Jetson Nano上实现最高59.8 FPS的实时推理速度。 Conclusion: 高质量的标注数据能显著提升驾驶员疲劳检测模型的性能,且无需依赖服务器计算即可在边缘设备上实现实时监控,验证了数据质量优化的重要性。 Abstract: Driver fatigue remains a leading cause of road accidents, with 24\% of crashes involving drowsy drivers. While yawning serves as an early behavioral indicator of fatigue, existing machine learning approaches face significant challenges due to video-annotated datasets that introduce systematic noise from coarse temporal annotations. We develop a semi-automated labeling pipeline with human-in-the-loop verification, which we apply to YawDD, enabling more accurate model training. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6\% and mAP by 5\% over video-level supervision, achieving 99.34\% classification accuracy and 95.69\% detection mAP. The resulting approach deliver up to 59.8 FPS on edge AI hardware (NVIDIA Jetson Nano), confirming that enhanced data quality alone supports on-device yawning monitoring without server-side computation.

[85] Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation

Jingmin Zhu,Anqi Zhu,Hossein Rahmani,Jun Liu,Mohammed Bennamoun,Qiuhong Ke

Main category: cs.CV

TL;DR: Skeleton-Cache是一种无需训练的测试时自适应框架,用于基于骨架的零样本动作识别,通过结合全局和细粒度局部描述符并利用大语言模型进行语义引导,提升对未见动作的泛化能力。

Details Motivation: 为了提高在推理过程中对未见过的动作的模型泛化能力,解决现有方法在零样本动作识别中的局限性。 Method: 将推理过程重构为轻量级检索过程,使用非参数缓存存储结构化骨架表示,并结合大语言模型为不同类别分配重要性权重以融合预测结果。 Result: 在NTU RGB+D 60/120和PKU-MMD II数据集上的实验表明,Skeleton-Cache在零样本和广义零样本设置下均能持续提升多种骨干网络的性能。 Conclusion: Skeleton-Cache通过结合结构化描述符与大语言模型引导的语义先验,在无需额外训练的情况下有效提升了模型对未见动作的识别能力。 Abstract: We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at https://github.com/Alchemist0754/Skeleton-Cache.

[86] Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Han Lin,Xichen Pan,Ziqi Huang,Ji Hou,Jialiang Wang,Weifeng Chen,Zecheng He,Felix Juefei-Xu,Junzhe Sun,Zhipeng Fan,Ali Thabet,Mohit Bansal,Chu Wang

Main category: cs.CV

TL;DR: MetaCanvas 是一个轻量级框架,使多模态大语言模型(MLLMs)能够在潜在空间中直接进行推理和规划,从而实现对图像和视频生成的精确结构化控制。

Details Motivation: 当前多模态大语言模型在视觉理解方面能力强,但在生成任务中仅被用作文本编码器,未能充分利用其推理能力,导致理解和生成之间的能力差距。 Method: 提出 MetaCanvas 框架,让 MLLMs 在空间和时空潜在空间中进行推理与规划,并紧密连接扩散生成模型;在三种不同的扩散模型基础上进行实现,并在六项任务上进行评估。 Result: MetaCanvas 在文本到图像、图文到视频、编辑和上下文内视频生成等任务上均优于全局条件基线方法,展现出更强的布局控制、属性绑定和推理能力。 Conclusion: 将 MLLMs 视为潜在空间中的规划器是一种缩小多模态理解与生成之间差距的有前景方向。 Abstract: Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.

[87] DOS: Distilling Observable Softmaps of Zipfian Prototypes for Self-Supervised Point Representation

Mohamed Abdelsamad,Michael Ulrich,Bin Yang,Miao Zhang,Yakov Miron,Abhinav Valada

Main category: cs.CV

TL;DR: 提出DOS框架,通过在可观测点自蒸馏语义相关软图来学习3D点云表示,结合Zipfian原型和Zipf-Sinkhorn算法解决无监督下语义不平衡问题,在多个基准上实现SOTA。

Details Motivation: 现有3D点云自监督学习面临不规则几何、重建捷径和语义分布不平衡等挑战,尤其是掩码区域信息泄漏和离散分配监督不足的问题。 Method: 提出DOS框架,仅在未被掩码的可观测点进行语义相关软图的自蒸馏;引入Zipfian原型并设计Zipf-Sinkhorn算法,施加幂律先验以平衡原型使用,并调节目标软图锐度。 Result: 在nuScenes、Waymo、SemanticKITTI、ScanNet和ScanNet200等多个基准上,DOS在语义分割和3D目标检测任务中均超越当前最先进方法,且无需额外数据或标注。 Conclusion: 基于可观测点的软图自蒸馏是一种可扩展且有效的学习鲁棒3D表示范式。 Abstract: Recent advances in self-supervised learning (SSL) have shown tremendous potential for learning 3D point cloud representations without human annotations. However, SSL for 3D point clouds still faces critical challenges due to irregular geometry, shortcut-prone reconstruction, and unbalanced semantics distribution. In this work, we propose DOS (Distilling Observable Softmaps), a novel SSL framework that self-distills semantic relevance softmaps only at observable (unmasked) points. This strategy prevents information leakage from masked regions and provides richer supervision than discrete token-to-prototype assignments. To address the challenge of unbalanced semantics in an unsupervised setting, we introduce Zipfian prototypes and incorporate them using a modified Sinkhorn-Knopp algorithm, Zipf-Sinkhorn, which enforces a power-law prior over prototype usage and modulates the sharpness of the target softmap during training. DOS outperforms current state-of-the-art methods on semantic segmentation and 3D object detection across multiple benchmarks, including nuScenes, Waymo, SemanticKITTI, ScanNet, and ScanNet200, without relying on extra data or annotations. Our results demonstrate that observable-point softmaps distillation offers a scalable and effective paradigm for learning robust 3D representations.

[88] CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop

Weijian Ma,Shizhao Sun,Ruiyu Wang,Jiang Bian

Main category: cs.CV

TL;DR: 本文提出CADMorph,一种基于预训练基础模型的迭代式“规划-生成-验证”框架,用于解决几何驱动的参数化CAD编辑中的结构保持、语义有效性和形状保真度难题。

Details Motivation: 在迭代设计中,几何形状的调整需同步修改底层参数序列,但现有方法难以同时满足结构保持、语义有效性与高形状保真度,且受限于稀疏的编辑数据三元组。 Method: 提出CADMorph框架:1)规划阶段利用P2S扩散模型的交叉注意力图定位需修改区域并生成编辑掩码;2)生成阶段使用MPP模型填充掩码以产生语义有效的参数编辑;3)验证阶段通过P2S模型在形状隐空间中评估候选序列并与目标形状比对,选择最优结果。 Result: CADMorph在迭代编辑和逆向工程增强等任务上优于GPT-4o和专用CAD基线方法,且无需依赖稀缺的三元组数据进行训练。 Conclusion: 通过协同利用预训练模型的几何感知与设计知识,CADMorph有效解决了几何驱动CAD编辑中的核心挑战,具备高实用性与扩展潜力。 Abstract: A Computer-Aided Design (CAD) model encodes an object in two coupled forms: a parametric construction sequence and its resulting visible geometric shape. During iterative design, adjustments to the geometric shape inevitably require synchronized edits to the underlying parametric sequence, called geometry-driven parametric CAD editing. The task calls for 1) preserving the original sequence's structure, 2) ensuring each edit's semantic validity, and 3) maintaining high shape fidelity to the target shape, all under scarce editing data triplets. We present CADMorph, an iterative plan-generate-verify framework that orchestrates pretrained domain-specific foundation models during inference: a parameter-to-shape (P2S) latent diffusion model and a masked-parameter-prediction (MPP) model. In the planning stage, cross-attention maps from the P2S model pinpoint the segments that need modification and offer editing masks. The MPP model then infills these masks with semantically valid edits in the generation stage. During verification, the P2S model embeds each candidate sequence in shape-latent space, measures its distance to the target shape, and selects the closest one. The three stages leverage the inherent geometric consciousness and design knowledge in pretrained priors, and thus tackle structure preservation, semantic validity, and shape fidelity respectively. Besides, both P2S and MPP models are trained without triplet data, bypassing the data-scarcity bottleneck. CADMorph surpasses GPT-4o and specialized CAD baselines, and supports downstream applications such as iterative editing and reverse-engineering enhancement.

[89] VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

Emanuel Sánchez Aimar,Gulnaz Zhambulova,Fahad Shahbaz Khan,Yonghao Xu,Michael Felsberg

Main category: cs.CV

TL;DR: 本文提出了一种名为VLM2GeoVec的单编码器视觉-语言模型,通过对比学习将图像、文本、边界框和地理坐标等多模态输入统一嵌入到一个共享向量空间中,实现了遥感场景下的跨模态检索与区域级空间推理的融合,并在新提出的RSMEB基准上显著优于现有方法。

Details Motivation: 现有的遥感影像分析方法在双编码器检索模型和生成式助手之间割裂:前者擅长大规模跨模态搜索但无法处理交错模态输入,后者支持区域级理解但缺乏可扩展的检索能力。因此需要一种既能支持高效检索又能进行细粒度空间推理的统一模型。 Method: 提出VLM2GeoVec,采用单编码器架构,将图像、文本、边界框和地理坐标等交错输入联合编码到统一的向量空间中,通过对比损失进行端到端训练,避免多阶段流水线和任务特定模块;同时构建新的综合基准RSMEB,涵盖遥感嵌入的六大核心任务。 Result: 在RSMEB基准上,VLM2GeoVec在区域描述检索上达到26.6%的P@1(比双编码器基线高25个百分点),指代表达式检索达到32.5% P@1(+19 pp),语义地理定位检索达到17.8% P@1(超过此前最佳方法3倍以上),并在场景分类和跨模态检索等传统任务上媲美或超越专用模型。 Conclusion: VLM2GeoVec成功统一了可扩展的跨模态检索与区域级空间推理能力,为遥感领域的多模态分析提供了一个紧凑且高效的解决方案,推动了通用遥感理解系统的发展。 Abstract: Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.

[90] TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition

Yanan Liu,Jun Liu,Hao Zhang,Dan Xu,Hossein Rahmani,Mohammed Bennamoun,Qiuhong Ke

Main category: cs.CV

TL;DR: 提出TSkel-Mamba,一种结合Transformer与Mamba的混合框架,用于骨骼动作识别,通过引入多尺度时序交互模块增强跨通道建模能力,在多个数据集上实现高效且先进的性能。

Details Motivation: Mamba在建模1D时序序列上表现优异,但其独立处理各通道的机制限制了对通道间依赖关系的建模,难以充分适应骨骼数据的动作识别任务。 Method: 设计了一种混合框架TSkel-Mamba:空间特征采用Spatial Transformer学习,时间动态则由Mamba建模;并提出Temporal Dynamic Modeling(TDM)块,内含Multi-scale Temporal Interaction(MTI)模块,利用多尺度Cycle算子捕获跨通道时序交互。 Result: 在NTU-RGB+D 60、120、NW-UCLA和UAV-Human四个数据集上进行了广泛实验,TSkel-Mamba在保持低推理时间的同时达到了最先进的性能。 Conclusion: TSkel-Mamba有效融合了Transformer的空间建模能力和Mamba的时间建模效率,并通过TDM块增强了对跨通道时序依赖的捕捉,为骨架动作识别提供了一个高效且强大的解决方案。 Abstract: Skeleton-based action recognition has garnered significant attention in the computer vision community. Inspired by the recent success of the selective state-space model (SSM) Mamba in modeling 1D temporal sequences, we propose TSkel-Mamba, a hybrid Transformer-Mamba framework that effectively captures both spatial and temporal dynamics. In particular, our approach leverages Spatial Transformer for spatial feature learning while utilizing Mamba for temporal modeling. Mamba, however, employs separate SSM blocks for individual channels, which inherently limits its ability to model inter-channel dependencies. To better adapt Mamba for skeleton data and enhance Mamba`s ability to model temporal dependencies, we introduce a Temporal Dynamic Modeling (TDM) block, which is a versatile plug-and-play component that integrates a novel Multi-scale Temporal Interaction (MTI) module. The MTI module employs multi-scale Cycle operators to capture cross-channel temporal interactions, a critical factor in action recognition. Extensive experiments on NTU-RGB+D 60, NTU-RGB+D 120, NW-UCLA and UAV-Human datasets demonstrate that TSkel-Mamba achieves state-of-the-art performance while maintaining low inference time, making it both efficient and highly effective.

[91] SSA3D: Text-Conditioned Assisted Self-Supervised Framework for Automatic Dental Abutment Design

Mianjie Zheng,Xinquan Yang,Along He,Xuguang Li,Feilie Zhong,Xuefen Liu,Kun Tang,Zhicheng Zhang,Linlin Shen

Main category: cs.CV

TL;DR: 本文提出了一种名为SS$A^3$D的自监督辅助自动基台设计框架,通过双分支结构和文本条件提示模块,显著提升牙科种植基台自动设计的精度与效率。

Details Motivation: 由于缺乏大规模标注数据集,基于AI的自动基台设计研究受限;传统自监督学习需预训练和微调,计算成本高、耗时长。 Method: 采用双分支架构:重建分支通过恢复被掩码的口扫数据学习结构信息,并将知识迁移至回归分支;回归分支在监督学习下直接预测基台参数,省去预训练与微调过程;引入文本条件提示(TCP)模块融合临床信息以引导网络关注关键区域并约束预测。 Result: 实验表明,SS$A^3$D比传统自监督方法节省一半训练时间且精度更高,在所收集的数据集上达到最优性能。 Conclusion: SS$A^3$D有效解决了数据稀缺和训练效率低的问题,显著提升了自动化基台设计的准确性和效率,具有良好的临床应用前景。 Abstract: Abutment design is a critical step in dental implant restoration. However, manual design involves tedious measurement and fitting, and research on automating this process with AI is limited, due to the unavailability of large annotated datasets. Although self-supervised learning (SSL) can alleviate data scarcity, its need for pre-training and fine-tuning results in high computational costs and long training times. In this paper, we propose a Self-supervised assisted automatic abutment design framework (SS$A^3$D), which employs a dual-branch architecture with a reconstruction branch and a regression branch. The reconstruction branch learns to restore masked intraoral scan data and transfers the learned structural information to the regression branch. The regression branch then predicts the abutment parameters under supervised learning, which eliminates the separate pre-training and fine-tuning process. We also design a Text-Conditioned Prompt (TCP) module to incorporate clinical information (such as implant location, system, and series) into SS$A^3$D. This guides the network to focus on relevant regions and constrains the parameter predictions. Extensive experiments on a collected dataset show that SS$A^3$D saves half of the training time and achieves higher accuracy than traditional SSL methods. It also achieves state-of-the-art performance compared to other methods, significantly improving the accuracy and efficiency of automated abutment design.

[92] On Geometric Understanding and Learned Data Priors in VGGT

Jelena Bratulić,Sudhanshu Mittal,Thomas Brox,Christian Rupprecht

Main category: cs.CV

TL;DR: 本文研究了视觉几何Transformer(VGGT)是否在无显式几何约束训练下仍能隐式学习几何结构,发现其通过全局注意力实现对应匹配并编码极线几何,同时依赖数据先验。

Details Motivation: 探究VGGT这类3D基础模型在单步前馈中是否真正理解几何结构,还是仅依赖外观先验。 Method: 通过探测中间特征、分析注意力模式、进行干预实验,并使用空间掩码和扰动测试其对遮挡、外观变化等的鲁棒性。 Result: 发现VGGT在全局注意力层隐式执行对应匹配并编码极线几何,且其性能依赖于学习到的数据先验。 Conclusion: VGGT虽无显式几何监督,但仍能内部建模几何结构,同时结合数据驱动先验实现鲁棒场景理解。 Abstract: The Visual Geometry Grounded Transformer (VGGT) is a 3D foundation model that infers camera geometry and scene structure in a single feed-forward pass. Trained in a supervised, single-step fashion on large datasets, VGGT raises a key question: does it build upon geometric concepts like traditional multi-view methods, or does it rely primarily on learned appearance-based data-driven priors? In this work, we conduct a systematic analysis of VGGT's internal mechanisms to uncover whether geometric understanding emerges within its representations. By probing intermediate features, analyzing attention patterns, and performing interventions, we examine how the model implements its functionality. Our findings reveal that VGGT implicitly performs correspondence matching within its global attention layers and encodes epipolar geometry, despite being trained without explicit geometric constraints. We further investigate VGGT's dependence on its learned data priors. Using spatial input masking and perturbation experiments, we assess its robustness to occlusions, appearance variations, and camera configurations, comparing it with classical multi-stage pipelines. Together, these insights highlight how VGGT internalizes geometric structure while using learned data-driven priors.

[93] Reconstruction as a Bridge for Event-Based Visual Question Answering

Hanyue Lou,Jiayi Zhou,Yang Zhang,Boyu Li,Yi Wang,Guangnan Ye,Boxin Shi

Main category: cs.CV

TL;DR: 提出基于重建的FRT和ART方法,将事件相机数据与多模态大语言模型结合,并构建首个真实世界事件问答基准EvQA,实验证明其在挑战性视觉条件下具有优越性能。

Details Motivation: 如何在保留事件数据独特优势的同时,实现与基于帧的多模态大语言模型的有效兼容是一个关键挑战。 Method: 提出两种方法:基于帧的重建与分词(FRT)和自适应重建与分词(ART),利用事件稀疏性进行高效重建,并通过重建图像桥接事件数据与MLLM。 Result: 在新提出的EvQA基准上达到最先进性能,验证了所提方法在事件驱动的多模态理解中的有效性。 Conclusion: 重建是连接事件数据与多模态大语言模型的有效桥梁,所提方法展现了事件视觉在复杂场景理解中的巨大潜力。 Abstract: Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.

[94] Super-Resolved Canopy Height Mapping from Sentinel-2 Time Series Using LiDAR HD Reference Data across Metropolitan France

Ekaterina Kalinicheva,Florian Helen,Stéphane Mermoz,Florian Mouret,Milena Planells

Main category: cs.CV

TL;DR: 本文提出了一种名为THREASURE-Net的端到端深度学习框架,用于基于Sentinel-2时间序列和LiDAR衍生高度数据的树木高度回归与超分辨率预测,能够在无需预训练模型或高分辨率光学影像的情况下生成高精度年度树冠高度图。

Details Motivation: 细尺度森林监测对理解树冠结构及其动态至关重要,而现有方法在成本和精度上存在局限,因此需要一种可扩展且经济高效的方法来利用免费卫星数据实现精确监测。 Method: 提出THREASURE-Net模型,利用Sentinel-2时间序列数据和LiDAR HD数据提取的多分辨率参考高度信息进行训练,通过端到端方式直接学习树高特征,不依赖预训练模型或高分辨率参考影像,实现2.5 m、5 m和10 m分辨率的年树高制图。 Result: 模型在2.5 m、5 m和10 m分辨率下的平均绝对误差分别为2.62 m、2.72 m和2.88 m,性能优于现有的基于Sentinel数据的最先进方法,并与基于高分辨率影像的方法相当。 Conclusion: THREASURE-Net能够仅利用免费卫星数据实现温带森林结构的可扩展、低成本精细监测,具有广泛的应用潜力。 Abstract: Fine-scale forest monitoring is essential for understanding canopy structure and its dynamics, which are key indicators of carbon stocks, biodiversity, and forest health. Deep learning is particularly effective for this task, as it integrates spectral, temporal, and spatial signals that jointly reflect the canopy structure. To address this need, we introduce THREASURE-Net, a novel end-to-end framework for Tree Height Regression And Super-Resolution. The model is trained on Sentinel-2 time series using reference height metrics derived from LiDAR HD data at multiple spatial resolutions over Metropolitan France to produce annual height maps. We evaluate three model variants, producing tree-height predictions at 2.5 m, 5 m, and 10 m resolution. THREASURE-Net does not rely on any pretrained model nor on reference very high resolution optical imagery to train its super-resolution module; instead, it learns solely from LiDAR-derived height information. Our approach outperforms existing state-of-the-art methods based on Sentinel data and is competitive with methods based on very high resolution imagery. It can be deployed to generate high-precision annual canopy-height maps, achieving mean absolute errors of 2.62 m, 2.72 m, and 2.88 m at 2.5 m, 5 m, and 10 m resolution, respectively. These results highlight the potential of THREASURE-Net for scalable and cost-effective structural monitoring of temperate forests using only freely available satellite data. The source code for THREASURE-Net is available at: https://github.com/Global-Earth-Observation/threasure-net.

[95] Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

Hossein Shahabadi,Niki Sepasian,Arash Marioriyad,Ali Sharifi-Zarchi,Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: 本研究系统评估了六种文本到图像(T2I)模型在组合对齐任务上的表现,涵盖对象、属性和空间关系等。结果显示,Infinity-8B整体表现最佳,Infinity-2B也在多个类别中优于更大的扩散模型,而SDXL和PixArt-α在属性和空间任务上存在明显不足。这是首次对视觉自回归(VAR)与扩散模型进行的全面比较,为未来T2I模型发展提供了统一基准。

Details Motivation: 现代T2I模型在实现文本描述与生成图像之间的组合对齐(如对象、属性、空间关系)方面仍面临挑战。尽管基于扩散的模型已被广泛研究,新兴的视觉自回归(VAR)模型的组合行为尚未充分探索。因此,亟需系统性评估不同架构在复杂语义理解任务中的表现差异。 Method: 研究在完整的T2I-CompBench++和GenEval两个评测套件上,对六种不同的T2I模型(SDXL、PixArt-α、Flux-Dev、Flux-Schnell、Infinity-2B、Infinity-8B)进行了基准测试,评估其在颜色与属性绑定、空间关系、数量理解以及多对象复杂提示下的组合对齐能力。 Result: 实验结果表明,Infinity-8B在整体组合对齐性能上表现最强;Infinity-2B虽规模较小,但在多个类别中表现媲美甚至超过更大的扩散模型;而SDXL和PixArt-α在属性敏感和空间关系任务中持续表现较弱。 Conclusion: 该研究首次系统比较了VAR与扩散模型在组合对齐方面的表现,揭示了VAR模型在效率与性能间的良好权衡,并建立了未来T2I模型发展的统一基准。 Abstract: Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt-$α$, Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs. In contrast, SDXL and PixArt-$α$ show persistent weaknesses in attribute-sensitive and spatial tasks. These results provide the first systematic comparison of VAR and diffusion approaches to compositional alignment and establish unified baselines for the future development of the T2I model.

[96] SSL-MedSAM2: A Semi-supervised Medical Image Segmentation Framework Powered by Few-shot Learning of SAM2

Zhendi Gong,Xin Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的半监督医学图像分割框架SSL-MedSAM2,结合无需训练的Few-shot分支与迭代全监督学习分支,在标注数据有限的情况下显著提升性能。

Details Motivation: 现有深度学习模型依赖大量标注数据进行全监督学习,而医学图像标注耗时昂贵,限制了临床应用,因此需要降低标注成本的方法。 Method: 提出SSL-MedSAM2框架:利用预训练基础模型SAM2构建无需训练的Few-shot分支(TFFS-MedSAM2)生成伪标签,并通过基于nnUNet的迭代全监督学习分支(FSL-nnUNet)优化伪标签。 Result: 在MICCAI2025挑战赛CARE-LiSeg数据集上验证,GED4和T1 MRI测试集平均Dice分数分别为0.9710和0.9648,Hausdorff距离为20.07和21.97,性能优于其他方法。 Conclusion: SSL-MedSAM2有效减少了对标注数据的依赖,在肝部分割任务中表现出色,具有良好的临床应用前景。 Abstract: Despite the success of deep learning based models in medical image segmentation, most state-of-the-art (SOTA) methods perform fully-supervised learning, which commonly rely on large scale annotated training datasets. However, medical image annotation is highly time-consuming, hindering its clinical applications. Semi-supervised learning (SSL) has been emerged as an appealing strategy in training with limited annotations, largely reducing the labelling cost. We propose a novel SSL framework SSL-MedSAM2, which contains a training-free few-shot learning branch TFFS-MedSAM2 based on the pretrained large foundation model Segment Anything Model 2 (SAM2) for pseudo label generation, and an iterative fully-supervised learning branch FSL-nnUNet based on nnUNet for pseudo label refinement. The results on MICCAI2025 challenge CARE-LiSeg (Liver Segmentation) demonstrate an outstanding performance of SSL-MedSAM2 among other methods. The average dice scores on the test set in GED4 and T1 MRI are 0.9710 and 0.9648 respectively, and the Hausdorff distances are 20.07 and 21.97 respectively. The code is available via https://github.com/naisops/SSL-MedSAM2/tree/main.

[97] 3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation

Zhiguo Lu,Jianwen Lou,Mingjun Ma,Hairong Jin,Youyi Zheng,Kun Zhou

Main category: cs.CV

TL;DR: 提出3DTeethSAM,基于SAM2模型通过2D-3D投影与轻量模块实现高精度3D牙齿分割,在3DTeethSeg基准上达到91.90% IoU,性能领先。

Details Motivation: 3D牙齿分割在数字牙科中至关重要,但因真实牙列复杂,现有方法难以兼顾精度与效率,需利用强大基础模型并解决其在3D场景下的提示依赖与类别无关缺陷。 Method: 将SAM2应用于多视角渲染的2D牙齿图像,通过2D-3D投影重建分割结果;引入可学习的提示嵌入生成器、掩码优化器和分类器,并在图像编码器中加入可变形全局注意力插件(DGAP)以提升精度与训练速度。 Result: 在3DTeethSeg基准上实现了91.90%的IoU,显著优于现有方法,分割精度和训练效率均得到提升。 Conclusion: 3DTeethSAM有效适配SAM2至3D牙齿分割任务,结合轻量模块和DGAP插件,在高分辨率数据上达到SOTA性能,推动了数字牙科自动化发展。 Abstract: 3D teeth segmentation, involving the localization of tooth instances and their semantic categorization in 3D dental models, is a critical yet challenging task in digital dentistry due to the complexity of real-world dentition. In this paper, we propose 3DTeethSAM, an adaptation of the Segment Anything Model 2 (SAM2) for 3D teeth segmentation. SAM2 is a pretrained foundation model for image and video segmentation, demonstrating a strong backbone in various downstream scenarios. To adapt SAM2 for 3D teeth data, we render images of 3D teeth models from predefined views, apply SAM2 for 2D segmentation, and reconstruct 3D results using 2D-3D projections. Since SAM2's performance depends on input prompts and its initial outputs often have deficiencies, and given its class-agnostic nature, we introduce three light-weight learnable modules: (1) a prompt embedding generator to derive prompt embeddings from image embeddings for accurate mask decoding, (2) a mask refiner to enhance SAM2's initial segmentation results, and (3) a mask classifier to categorize the generated masks. Additionally, we incorporate Deformable Global Attention Plugins (DGAP) into SAM2's image encoder. The DGAP enhances both the segmentation accuracy and the speed of the training process. Our method has been validated on the 3DTeethSeg benchmark, achieving an IoU of 91.90% on high-resolution 3D teeth meshes, establishing a new state-of-the-art in the field.

[98] Multi-temporal Calving Front Segmentation

Marcel Dreier,Nora Gourmelon,Dakota Pyles,Fei Wu,Matthias Braun,Thorsten Seehaus,Andreas Maier,Vincent Christlein

Main category: cs.CV

TL;DR: 提出一种并行处理卫星图像时间序列的方法,通过在特征图间交换时间信息来提升冰川前端分割的准确性,在CaFFe数据集上达到新的SOTA性能。

Details Motivation: 现有深度学习模型在季节性条件(如冰杂岩、积雪)影响下难以准确分割海洋终止型冰川的前端位置,需提高模型鲁棒性。 Method: 将多帧SAR影像并行输入模型,并在特征图之间共享时间信息,以稳定每次预测;该方法集成到当前最优模型Tyrion中。 Result: 在CaFFe基准数据集上实现了184.4米的平均距离误差和83.6%的平均交并比,性能达到新的SOTA。 Conclusion: 通过引入时间信息交互机制,显著提升了冰川前端提取在复杂季节条件下的精度和稳定性。 Abstract: The calving fronts of marine-terminating glaciers undergo constant changes. These changes significantly affect the glacier's mass and dynamics, demanding continuous monitoring. To address this need, deep learning models were developed that can automatically delineate the calving front in Synthetic Aperture Radar imagery. However, these models often struggle to correctly classify areas affected by seasonal conditions such as ice melange or snow-covered surfaces. To address this issue, we propose to process multiple frames from a satellite image time series of the same glacier in parallel and exchange temporal information between the corresponding feature maps to stabilize each prediction. We integrate our approach into the current state-of-the-art architecture Tyrion and accomplish a new state-of-the-art performance on the CaFFe benchmark dataset. In particular, we achieve a Mean Distance Error of 184.4 m and a mean Intersection over Union of 83.6.

[99] Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis

Valentina Lilova,Toyesh Chakravorty,Julian I. Bibo,Emma Boccaletti,Brandon Li,Lívia Baxová,Cees G. M. Snoek,Mohammadreza Salehi

Main category: cs.CV

TL;DR: 提出了一种无需微调的新型基准,用于直接评估基础模型在多视角下的密集视觉特征质量,基于Hummingbird框架扩展至3D场景理解,并在MVImgNet数据集上对8种前沿模型进行了评测。

Details Motivation: 现有3D空间理解评估依赖于下游任务微调,难以分离预训练编码器本身的内在3D推理能力,因此需要一种更直接、免微调的评估方式来衡量其真实性能。 Method: 基于Hummingbird框架,在3D Multi-View ImageNet(MVImgNet)数据集上构建无需微调的上下文3D场景理解基准,通过给定特定角度的对象图像(keys)来分割新视角图像(queries),并根据视角差异分为四个难度等级进行评估。 Result: 评测了8个最先进的基础模型,发现DINO-based编码器在大视角变化下仍具竞争力,而VGGT等3D感知模型需专门的多视图调整才能表现良好。 Conclusion: 所提基准能有效评估基础模型在无微调情况下的密集3D特征质量,揭示了当前模型在跨视角泛化中的优劣,为未来3D感知模型设计提供了指导。 Abstract: Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream finetuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images from objects in specific angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval .

[100] In-Context Learning for Seismic Data Processing

Fabian Fuchs,Mario Ruben Fernandez,Norman Ettrich,Janis Keuper

Main category: cs.CV

TL;DR: 本文提出了一种名为ContextSeisNet的地震去多次波处理模型,采用上下文学习方法,利用空间相关的邻近共深度点道集及其标签作为支持集,实现推理时的任务特定行为学习,无需重新训练。该方法在合成和实际数据上均表现出优于传统Radon变换和U-Net基线模型的横向一致性和去多次波效果,并具有更高的数据效率。

Details Motivation: 现有深度学习方法在地震数据处理中存在空间不一致性及缺乏用户控制的问题,传统方法则依赖手动调参且对噪声敏感,因此需要一种兼具灵活性、一致性和高效性的新方法。 Method: 提出ContextSeisNet,采用上下文学习框架,以空间邻近的共深度点道集及其标签作为条件输入(支持集),在推理时动态调整预测,从而实现无需重训练的任务自适应处理,并提升横向连续性与用户可控性。 Result: 在合成数据上,ContextSeisNet优于U-Net基线,展现出更强的空间连贯性;在实际数据中,相比Radon和U-Net,表现出更优的横向一致性、近偏移距性能和更完整的多次波去除效果,且仅用90%的训练数据即达到相当性能,显示其高数据效率。 Conclusion: ContextSeisNet是一种实用的地震去多次波方法,能够实现空间一致的处理结果和良好的用户控制,具备扩展到其他地震处理任务的潜力。 Abstract: Seismic processing transforms raw data into subsurface images essential for geophysical applications. Traditional methods face challenges, such as noisy data, and manual parameter tuning, among others. Recently deep learning approaches have proposed alternative solutions to some of these problems. However, important challenges of existing deep learning approaches are spatially inconsistent results across neighboring seismic gathers and lack of user-control. We address these limitations by introducing ContextSeisNet, an in-context learning model, to seismic demultiple processing. Our approach conditions predictions on a support set of spatially related example pairs: neighboring common-depth point gathers from the same seismic line and their corresponding labels. This allows the model to learn task-specific processing behavior at inference time by observing how similar gathers should be processed, without any retraining. This method provides both flexibility through user-defined examples and improved lateral consistency across seismic lines. On synthetic data, ContextSeisNet outperforms a U-Net baseline quantitatively and demonstrates enhanced spatial coherence between neighboring gathers. On field data, our model achieves superior lateral consistency compared to both traditional Radon demultiple and the U-Net baseline. Relative to the U-Net, ContextSeisNet also delivers improved near-offset performance and more complete multiple removal. Notably, ContextSeisNet achieves comparable field data performance despite being trained on 90% less data, demonstrating substantial data efficiency. These results establish ContextSeisNet as a practical approach for spatially consistent seismic demultiple with potential applicability to other seismic processing tasks.

[101] Using GUI Agent for Electronic Design Automation

Chunyi Li,Longfei Li,Zicheng Zhang,Xiaohong Liu,Min Tang,Weisi Lin,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了首个针对电子设计自动化(EDA)工作流的GUI代理系统研究,发布了大规模数据集GUI-EDA,并构建了涵盖30多个主流GUI代理的基准测试,结果显示现有方法在专业CAD任务上仍表现不佳;为此提出EDA-specialized metric EDAgent,结合反思机制,在工业级CAD软件中首次超越电气工程博士生的表现,推动GUI代理向高价值工程领域拓展。

Details Motivation: 现有的GUI代理主要在办公软件上进行评估,而在具有更高经济价值的专业CAD软件(如电子设计自动化EDA)中表现较差,远未达到替代专家工程师的水平,因此需要系统性研究GUI代理在EDA中的应用与挑战。 Method: 构建了一个名为GUI-EDA的大规模数据集,包含5种CAD工具和5个物理领域,涵盖2000多个由EDA科学家和工程师在实际设计中记录的高质量截图-答案-动作对;在此基础上建立了综合基准,评估了30多个主流GUI代理;并提出一种带有反思机制的EDA专用评估指标EDAgent。 Result: 基准测试表明当前GUI代理在EDA任务上整体表现有限,存在重大挑战;EDAgent通过引入反思机制,在工业级CAD软件中实现了可靠性能,并首次在特定任务上超越电气工程专业的博士生。 Conclusion: 该工作将GUI代理的应用从通用办公自动化扩展到专业化、高价值的工程设计领域,验证了其在复杂CAD环境中的潜力,为提升EDA工程师的生产力提供了新路径。 Abstract: Graphical User Interface (GUI) agents adopt an end-to-end paradigm that maps a screenshot to an action sequence, thereby automating repetitive tasks in virtual environments. However, existing GUI agents are evaluated almost exclusively on commodity software such as Microsoft Word and Excel. Professional Computer-Aided Design (CAD) suites promise an order-of-magnitude higher economic return, yet remain the weakest performance domain for existing agents and are still far from replacing expert Electronic-Design-Automation (EDA) engineers. We therefore present the first systematic study that deploys GUI agents for EDA workflows. Our contributions are: (1) a large-scale dataset named GUI-EDA, including 5 CAD tools and 5 physical domains, comprising 2,000+ high-quality screenshot-answer-action pairs recorded by EDA scientists and engineers during real-world component design; (2) a comprehensive benchmark that evaluates 30+ mainstream GUI agents, demonstrating that EDA tasks constitute a major, unsolved challenge; and (3) an EDA-specialized metric named EDAgent, equipped with a reflection mechanism that achieves reliable performance on industrial CAD software and, for the first time, outperforms Ph.D. students majored in Electrical Engineering. This work extends GUI agents from generic office automation to specialized, high-value engineering domains and offers a new avenue for advancing EDA productivity. The dataset will be released at: https://github.com/aiben-ch/GUI-EDA.

[102] Embodied Image Compression

Chunyi Li,Rui Qing,Jianbo Zhang,Yuan Tian,Xiangyang Zhu,Zicheng Zhang,Xiaohong Liu,Weisi Lin,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文首次提出“具身图像压缩”(Embodied Image Compression)这一科学问题,旨在解决多智能体系统中具身AI的通信约束,确保低比特率下的实时任务执行。作者构建了标准化基准EmbodiedComp,并通过实验表明现有视觉-语言-动作模型在低于“具身比特率阈值”时性能显著下降。

Details Motivation: 随着机器智能的发展,压缩目标从特定任务模型转向真实环境中运行的具身智能体。传统面向人类感知或单一任务的图像压缩方法无法满足具身智能体在多智能体系统中对实时性与通信效率的需求,亟需针对具身代理优化的新型压缩方案。 Method: 提出Embodied Image Compression问题并构建闭环、超低比特率下的标准化评估基准EmbodiedComp,结合模拟与真实环境中的实验,评估现有视觉-语言-动作模型(VLAs)在极低码率图像输入下的任务执行能力。 Result: 实验证明,当图像被压缩至低于具身比特率阈值时,当前先进的VLAs在简单操作任务上的性能显著下降,难以可靠完成任务,揭示了现有方法在具身场景下的局限性。 Conclusion: 研究凸显了为具身智能体设计专用图像压缩技术的必要性,EmbodiedComp基准有望推动面向具身AI的压缩算法发展,加速其在现实世界中的部署。 Abstract: Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task-specific virtual models to Embodied agents operating in real-world environments. To address the communication constraints of Embodied AI in multi-agent systems and ensure real-time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra-low bitrate conditions in a closed-loop setting. Through extensive empirical studies in both simulated and real-world settings, we demonstrate that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain-specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real-world.

[103] Fast and Explicit: Slice-to-Volume Reconstruction via 3D Gaussian Primitives with Analytic Point Spread Function Modeling

Maik Dannecker,Steven Jia,Nil Stolt-Ansó,Nadine Girard,Guillaume Auzias,François Rousseau,Daniel Rueckert

Main category: cs.CV

TL;DR: 提出基于高斯基显式表示的快速切片到体积重建方法,通过闭式解析解实现高效准确的胎儿MRI 3D重建,比现有隐式神经表示方法快5-10倍。

Details Motivation: 隐式神经表示在医学图像重建中存在计算瓶颈,需昂贵的蒙特卡洛采样来近似点扩散函数,限制了临床实时应用。 Method: 将高分辨率3D图像建模为各向异性高斯基元场,利用高斯函数在卷积下的封闭性,推导出前向模型的闭式解析解,用协方差相加精确表示图像采集过程。 Result: 在新生儿和胎儿数据上达到与最先进SVR框架相当的重建质量,同时实现5-10倍加速,通常在30秒内完成收敛。 Conclusion: 该方法通过显式高斯表示解决了计算效率问题,为临床实时胎儿3D MRI重建的实用化铺平了道路。 Abstract: Recovering high-fidelity 3D images from sparse or degraded 2D images is a fundamental challenge in medical imaging, with broad applications ranging from 3D ultrasound reconstruction to MRI super-resolution. In the context of fetal MRI, high-resolution 3D reconstruction of the brain from motion-corrupted low-resolution 2D acquisitions is a prerequisite for accurate neurodevelopmental diagnosis. While implicit neural representations (INRs) have recently established state-of-the-art performance in self-supervised slice-to-volume reconstruction (SVR), they suffer from a critical computational bottleneck: accurately modeling the image acquisition physics requires expensive stochastic Monte Carlo sampling to approximate the point spread function (PSF). In this work, we propose a shift from neural network based implicit representations to Gaussian based explicit representations. By parameterizing the HR 3D image volume as a field of anisotropic Gaussian primitives, we leverage the property of Gaussians being closed under convolution and thus derive a \textit{closed-form analytical solution} for the forward model. This formulation reduces the previously intractable acquisition integral to an exact covariance addition ($\mathbfΣ_{obs} = \mathbfΣ_{HR} + \mathbfΣ_{PSF}$), effectively bypassing the need for compute-intensive stochastic sampling while ensuring exact gradient propagation. We demonstrate that our approach matches the reconstruction quality of self-supervised state-of-the-art SVR frameworks while delivering a 5$\times$--10$\times$ speed-up on neonatal and fetal data. With convergence often reached in under 30 seconds, our framework paves the way towards translation into clinical routine of real-time fetal 3D MRI. Code will be public at {https://github.com/m-dannecker/Gaussian-Primitives-for-Fast-SVR}.

[104] FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint

Jiapeng Tang,Kai Li,Chengxiang Yin,Liuhao Ge,Fei Jiang,Jiu Xu,Matthias Nießner,Christian Häne,Timur Bagautdinov,Egor Zakharov,Peihong Guo

Main category: cs.CV

TL;DR: FactorPortrait是一种基于视频扩散的可控人像动画方法,能够从解耦的控制信号(如面部表情、头部运动和相机视角)生成逼真的动画。

Details Motivation: 现有方法在面部表情、头部动作与视角控制之间的解耦和协同建模上存在不足,难以实现高真实感和多视角一致性的人像动画。 Method: 利用预训练图像编码器从驱动视频中提取解耦的面部表情隐变量,并通过提出的表达控制器注入视频扩散Transformer;结合Plücker射线图和法线图实现相机与头部姿态控制;构建大规模合成数据集进行训练。 Result: 实验表明,该方法在真实感、表现力、控制精度和视角一致性方面优于现有方法。 Conclusion: FactorPortrait实现了高质量、多角度可控的肖像动画生成,推动了基于扩散模型的动态人像合成技术的发展。 Abstract: We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.

[105] Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation

Luca Cazzola,Ahed Alboody

Main category: cs.CV

TL;DR: 本文提出了KineMIC,一种用于少样本动作合成的迁移学习框架,通过利用文本编码空间中的语义对应关系来指导文本到动作生成模型的微调,从而缩小通用文本-动作生成模型与人类活动识别(HAR)任务之间的领域差距。

Details Motivation: 由于大型标注运动数据集获取成本高,且现有的文本-动作生成模型侧重于艺术性动作生成,难以满足HAR对运动学精确和类别可分动作的需求,因此需要一种能有效桥接该领域差异的方法。 Method: KineMIC采用基于CLIP文本嵌入的动能挖掘策略,在稀疏的HAR标签与源T2M数据之间建立语义对应关系,并以此提供软监督信号,对T2M扩散模型进行微调,将其转化为专用于少样本动作生成的Action-to-Motion模型。 Result: 在仅使用每类10个样本的情况下,KineMIC在NTU RGB+D 120子集上实现了显著更连贯的动作生成,并作为数据增强来源使分类准确率提升了+23.1%。 Conclusion: KineMIC有效解决了通用T2M模型与HAR任务之间的领域鸿沟问题,展示了利用语义引导的运动学蒸馏进行少样本动作生成的可行性与优越性能。 Abstract: The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (https://lucazzola.github.io/publications/kinemic).

[106] Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing

Xu Zhang,Jiabin Fang,Zhuoming Ding,Jin Yuan,Xuan Liu,Qianjun Zhang,Zhiyong Li

Main category: cs.CV

TL;DR: 提出CLV-Net模型,通过视觉提示和上下文感知机制提升遥感图像中的多模态理解,实现用户意图对齐的精确分割与描述。

Details Motivation: 现有方法在仅使用简单文本提示时难以聚焦用户关注区域,且在复杂、相似对象共存的航拍图像中难以准确识别并利用对象间关系。 Method: 提出CLV-Net,引入视觉提示(边界框)引导模型;设计上下文感知掩码解码器建模对象间关系,并通过语义与关系对齐模块,结合跨模态语义一致性损失和关系一致性损失优化输出。 Result: 在两个基准数据集上实验表明,CLV-Net优于现有方法,实现了最先进的分割与描述性能,能更准确捕捉用户意图。 Conclusion: CLV-Net有效结合视觉提示与上下文关系建模,提升了遥感图像中多模态理解的准确性与意图对齐能力,为复杂场景下的视觉语言任务提供了新思路。 Abstract: Recent advances in image understanding have enabled methods that leverage large language models for multimodal reasoning in remote sensing. However, existing approaches still struggle to steer models to the user-relevant regions when only simple, generic text prompts are available. Moreover, in large-scale aerial imagery many objects exhibit highly similar visual appearances and carry rich inter-object relationships, which further complicates accurate recognition. To address these challenges, we propose Cross-modal Context-aware Learning for Visual Prompt-Guided Multimodal Image Understanding (CLV-Net). CLV-Net lets users supply a simple visual cue, a bounding box, to indicate a region of interest, and uses that cue to guide the model to generate correlated segmentation masks and captions that faithfully reflect user intent. Central to our design is a Context-Aware Mask Decoder that models and integrates inter-object relationships to strengthen target representations and improve mask quality. In addition, we introduce a Semantic and Relationship Alignment module: a Cross-modal Semantic Consistency Loss enhances fine-grained discrimination among visually similar targets, while a Relationship Consistency Loss enforces alignment between textual relations and visual interactions. Comprehensive experiments on two benchmark datasets show that CLV-Net outperforms existing methods and establishes new state-of-the-art results. The model effectively captures user intent and produces precise, intention-aligned multimodal outputs.

[107] Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection

Qiushi Guo

Main category: cs.CV

TL;DR: 提出了一种名为Depth Copy Paste的深度感知数据增强方法,通过结合语义、视觉和深度信息,生成更真实、多样化的面部检测训练样本。

Details Motivation: 传统复制粘贴增强方法因前景提取不准确、场景几何不一致和背景语义不匹配,导致合成图像不真实,限制了人脸检测模型的鲁棒性。 Method: 利用BLIP和CLIP联合评估前景与背景的语义和视觉一致性,使用SAM3进行精确分割,并结合Depth-Anything提取可见区域的深度信息,通过深度引导的滑动窗口机制实现几何合理的粘贴位置选择。 Result: 实验表明,该方法生成的合成图像具有更高的视觉真实性和多样性,在多种人脸检测任务中显著优于传统复制粘贴和其他无深度增强方法。 Conclusion: Depth Copy Paste通过融合多模态和深度感知策略,有效提升了数据增强的质量,增强了人脸检测模型在复杂场景下的性能。 Abstract: Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.

[108] Text images processing system using artificial intelligence models

Aya Kaysan Bahjat

Main category: cs.CV

TL;DR: 提出一种文本图像分类设备,可识别图像中的文本内容并将其分类为发票、表格、信件或报告四类,支持图库模式和实时摄像头模式,采用DBNet++检测文本元素,BART模型进行分类,并通过Python和PyQt5构建的界面展示结果,在Total-Text数据集上达到94.62%的识别准确率。

Details Motivation: 针对复杂成像条件下(如光照变化、文本方向随机、弯曲或部分遮挡、低分辨率等)文本图像分类的实际挑战,设计一种鲁棒性强、适应性广的分类设备。 Method: 系统分为四个步骤:图像获取与预处理、基于DBNet++模型检测文本元素、利用BART模型对文本内容进行分类、通过Python和PyQt5开发的用户界面展示结果;支持图库模式(读取存储设备文件)和实时模式(摄像头输入)。 Result: 在Total-Text数据集上连续测试十小时,系统实现了约94.62%的文本识别准确率,验证了其在复杂实际条件下的有效性。 Conclusion: 该设备能有效应对多种现实干扰因素,实现高准确率的多源文本图像分类,具备较强的实用价值和部署潜力。 Abstract: This is to present a text image classifier device that identifies textual content in images and then categorizes each image into one of four predefined categories, including Invoice, Form, Letter, or Report. The device supports a gallery mode, in which users browse files on flash disks, hard disk drives, or microSD cards, and a live mode which renders feeds of cameras connected to it. Its design is specifically aimed at addressing pragmatic challenges, such as changing light, random orientation, curvature or partial coverage of text, low resolution, and slightly visible text. The steps of the processing process are divided into four steps: image acquisition and preprocessing, textual elements detection with the help of DBNet++ (Differentiable Binarization Network Plus) model, BART (Bidirectional Auto-Regressive Transformers) model that classifies detected textual elements, and the presentation of the results through a user interface written in Python and PyQt5. All the stages are connected in such a way that they form a smooth workflow. The system achieved a text recognition rate of about 94.62% when tested over ten hours on the mentioned Total-Text dataset, that includes high resolution images, created so as to represent a wide range of problematic conditions. These experimental results support the effectiveness of the suggested methodology to practice, mixed-source text categorization, even in uncontrolled imaging conditions.

[109] EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

Wei Chow,Linfeng Li,Lingdong Kong,Zefeng Li,Qi Xu,Hang Song,Tian Ye,Xian Wang,Jinbin Bai,Shilin Xu,Xiangtai Li,Junting Pan,Shaoteng Liu,Ran Zhou,Tianshu Yang,Songhua Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于Masked Generative Transformers(MGTs)的图像编辑框架EditMGT,利用MGT的局部解码特性实现精准的局部编辑,避免扩散模型中常见的非目标区域干扰问题。通过注意力图定位编辑区域,并设计区域保持采样策略抑制无关修改,在无需额外参数的情况下,将预训练MGT适配为编辑模型,实现了更快、更高质量的图像编辑。

Details Motivation: 扩散模型在图像编辑中存在全局去噪导致非目标区域被意外修改的问题,缺乏对局部编辑的天然支持,因此需要一种具备显式局部控制能力的新范式来提升编辑精度和效率。 Method: 提出EditMGT框架,利用MGT的多层交叉注意力图进行编辑区域定位,设计多层注意力整合方案以获得精细定位;引入区域保持采样策略,限制低注意力区域的token翻转,从而约束修改范围;使用自建的高分辨率数据集CrispEdit-2M训练,并通过注意力注入将预训练文本到图像MGT转换为编辑模型。 Result: 在四个标准基准上实验表明,EditMGT在少于10亿参数下实现与现有方法相当的相似性表现,编辑速度快6倍;在风格变化和风格迁移任务上分别提升3.6%和17.6%的性能。 Conclusion: MGT为图像编辑提供了优于扩散模型的局部化建模优势,EditMGT通过注意力引导的定位与采样策略实现了高效、精确的编辑,验证了MGT在图像编辑任务中的潜力,为未来研究提供了新方向。 Abstract: Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

[110] Referring Change Detection in Remote Sensing Imagery

Yilmaz Korkmaz,Jay N. Paranjape,Celso M. de Melo,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出了指代表更检测(RCD),利用自然语言提示来检测遥感图像中特定类型的变更,克服了传统方法类别固定、数据不足和类别不平衡的问题。

Details Motivation: 传统变化检测方法无法满足用户对特定类型变化的需求,语义变化检测受限于固定的类别定义和模型结构,难以跨任务复用。因此需要一种灵活、可扩展的变化检测方法。 Method: 提出一个两阶段框架:第一阶段是RCDNet,一种用于指代表更检测的跨模态融合网络;第二阶段是RCDGen,基于扩散模型的合成数据生成流程,仅需前时相图像即可生成逼真的后时相图像和对应的变化图。 Result: 在多个数据集上的实验表明,该框架能够实现可扩展且针对特定需求的变化检测,有效缓解数据稀缺和类别不平衡问题。 Conclusion: RCD通过结合语言与视觉信息,实现了灵活、用户指定的变化检测,RCDGen降低了大规模标注数据构建的门槛,为遥感变化检测提供了新范式。 Abstract: Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) \textbf{RCDNet}, a cross-modal fusion network designed for referring change detection, and (II) \textbf{RCDGen}, a diffusion-based synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Project website is here: https://yilmazkorkmaz1.github.io/RCD.

[111] Reframing Music-Driven 2D Dance Pose Generation as Multi-Channel Image Generation

Yan Zhang,Han Zou,Lincong Feng,Cong Xie,Ruiqi Yu,Zhenpeng Zhan

Main category: cs.CV

TL;DR: 本文提出了一种将音乐转化为2D舞蹈姿态序列的新方法,通过将姿态序列编码为图像并利用DiT架构建模,结合时间共享索引和参考姿态条件策略,实现了与音乐节奏对齐且主体一致的高质量舞蹈生成。

Details Motivation: 现有音乐到舞蹈生成方法在复杂、高方差的真实场景下难以生成时序连贯且节奏对齐的2D姿态序列,因此需要一种能更好建模高方差分布并保持时间一致性的新方法。 Method: 将2D姿态序列编码为独热图像,使用预训练图像VAE压缩,并采用DiT风格的主干网络进行建模;引入时间共享的时间索引机制以对齐音乐token与姿态隐变量,并设计参考姿态条件策略以保持人物体型和尺度一致性,支持长序列分段拼接生成。 Result: 在大规模真实舞蹈数据集和AIST++2D基准上实验表明,该方法在姿态空间和视频空间指标及人类偏好测试中均优于代表性方法,消融实验证实了各组件的有效性。 Conclusion: 通过图像合成视角建模音乐到舞蹈生成,结合新颖的时间对齐和身份保持机制,显著提升了生成舞蹈的时序连贯性、节奏对齐性和视觉质量。 Abstract: Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at https://hot-dance.github.io

[112] Weak-to-Strong Generalization Enables Fully Automated De Novo Training of Multi-head Mask-RCNN Model for Segmenting Densely Overlapping Cell Nuclei in Multiplex Whole-slice Brain Images

Lin Bai,Xiaoyang Li,Liqiang Huang,Quynh Nguyen,Hien Van Nguyen,Saurabh Prasad,Dragan Maric,John Redell,Pramod Dash,Badrinath Roysam

Main category: cs.CV

TL;DR: 提出了一种从弱监督到强泛化的多头Mask-RCNN方法,结合高效通道注意力,用于无需人工标注的重叠细胞核分割。

Details Motivation: 在多重循环免疫荧光全切片图像中,重叠细胞核的精确分割面临挑战,且人工标注成本高昂,亟需自动化解决方案。 Method: 采用多头Mask-RCNN架构并引入高效通道注意力机制,通过伪标签校正与覆盖扩展实现从弱到强的泛化,并设计自诊断指标评估分割质量。 Result: 该方法在无新仪器或新成像协议的人工标注情况下,仍能准确分割新类别图像,且在五个主流方法的对比中表现显著更优。 Conclusion: 所提方法实现了高质量、全自动的细胞核分割,具备良好的泛化能力与实用价值,适用于大规模WSI的生产环境。 Abstract: We present a weak to strong generalization methodology for fully automated training of a multi-head extension of the Mask-RCNN method with efficient channel attention for reliable segmentation of overlapping cell nuclei in multiplex cyclic immunofluorescent (IF) whole-slide images (WSI), and present evidence for pseudo-label correction and coverage expansion, the key phenomena underlying weak to strong generalization. This method can learn to segment de novo a new class of images from a new instrument and/or a new imaging protocol without the need for human annotations. We also present metrics for automated self-diagnosis of segmentation quality in production environments, where human visual proofreading of massive WSI images is unaffordable. Our method was benchmarked against five current widely used methods and showed a significant improvement. The code, sample WSI images, and high-resolution segmentation results are provided in open form for community adoption and adaptation.

[113] SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Minglei Shi,Haolin Wang,Borui Zhang,Wenzhao Zheng,Bohan Zeng,Ziyang Yuan,Xiaoshi Wu,Yuanxing Zhang,Huan Yang,Xintao Wang,Pengfei Wan,Kun Gai,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了SVG-T2I,首个在视觉基础模型(VFM)表征空间内进行端到端训练的大规模文本到图像扩散模型,验证了VFM表征在生成任务中的强大潜力,并全面开源项目资源。

Details Motivation: 尽管视觉基础模型(VFM)在统一视觉理解与生成方面具有潜力,但如何在VFM表征空间中直接训练大规模文本到图像扩散模型仍缺乏探索。 Method: 扩展SVG框架,构建SVG-T2I,在VFM特征空间中采用标准的文本到图像扩散训练流程,结合自编码器实现高质量图像生成。 Result: SVG-T2I在GenEval上达到0.75,在DPG-Bench上达到85.78,表现出与现有方法相媲美的性能,证明了VFM表征本身足以支持高质量生成。 Conclusion: VFM表征空间具备强大的生成能力,无需回归像素空间即可实现高效文本到图像合成,为表征驱动的视觉生成提供了可行路径。 Abstract: Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

[114] Reducing Domain Gap with Diffusion-Based Domain Adaptation for Cell Counting

Mohammad Dehghanmanshadi,Wallapak Tavanapong

Main category: cs.CV

TL;DR: 本研究将基于反转的风格迁移(InST)框架应用于生物医学显微图像,通过扩散模型中的潜空间自适应实例归一化与随机反转,将真实荧光显微图像的风格迁移到合成图像中,显著缩小了合成与真实数据之间的域差距。

Details Motivation: 传统域适应方法在合成图像缺乏真实样本复杂纹理时难以有效弥合域差距,限制了深度学习模型在标签稀缺场景(如细胞计数)中的性能。因此,需要一种能生成更逼真合成图像的方法以提升模型训练效果。 Method: 采用InST框架,结合潜空间自适应实例归一化(AdaIN)与扩散模型中的随机反转技术,实现从真实荧光显微图像到合成图像的风格迁移,同时弱保留内容结构,并用于预训练和微调EfficientNet-B0模型进行细胞计数。 Result: 使用InST生成的合成数据训练的模型相比硬编码合成数据降低了37%的MAE,相比Cell200-s数据降低了52%的MAE(从53.70降至25.95),且优于仅使用真实数据训练的模型(25.95 vs. 27.74 MAE)。结合DACS与CutMix等轻量级域适应方法可进一步提升性能。 Conclusion: InST风格迁移能有效缩小合成与真实显微图像之间的域差距,显著提升细胞计数性能,提供了一种可扩展、减少人工标注依赖的解决方案。 Abstract: Generating realistic synthetic microscopy images is critical for training deep learning models in label-scarce environments, such as cell counting with many cells per image. However, traditional domain adaptation methods often struggle to bridge the domain gap when synthetic images lack the complex textures and visual patterns of real samples. In this work, we adapt the Inversion-Based Style Transfer (InST) framework originally designed for artistic style transfer to biomedical microscopy images. Our method combines latent-space Adaptive Instance Normalization with stochastic inversion in a diffusion model to transfer the style from real fluorescence microscopy images to synthetic ones, while weakly preserving content structure. We evaluate the effectiveness of our InST-based synthetic dataset for downstream cell counting by pre-training and fine-tuning EfficientNet-B0 models on various data sources, including real data, hard-coded synthetic data, and the public Cell200-s dataset. Models trained with our InST-synthesized images achieve up to 37\% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data, and a 52\% reduction in MAE compared to models trained on Cell200-s (from 53.70 to 25.95 MAE). Notably, our approach also outperforms models trained on real data alone (25.95 vs. 27.74 MAE). Further improvements are achieved when combining InST-synthesized data with lightweight domain adaptation techniques such as DACS with CutMix. These findings demonstrate that InST-based style transfer most effectively reduces the domain gap between synthetic and real microscopy data. Our approach offers a scalable path for enhancing cell counting performance while minimizing manual labeling effort. The source code and resources are publicly available at: https://github.com/MohammadDehghan/InST-Microscopy.

[115] Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

Kai Yao,Marc Juarez

Main category: cs.CV

TL;DR: 本文首次系统评估了模型指纹检测技术在对抗环境下的安全性,提出了指纹移除和伪造两种攻击目标,并通过实验揭示了现有方法在鲁棒性和准确性之间的权衡问题。

Details Motivation: 尽管模型指纹检测在AI生成图像溯源中展现出潜力,但其在对抗条件下的鲁棒性尚不明确,亟需系统性安全评估。 Method: 形式化了白盒与黑盒威胁模型,设计五种攻击策略,评估14种代表性指纹方法在RGB、频域和学习特征域上的表现,涵盖12种先进图像生成模型。 Result: 移除攻击在白盒下成功率超80%,黑盒下超50%;伪造攻击难度较大但效果因模型而异;高准确性的方法通常更易受攻击,无一种方法在所有威胁下兼具高鲁棒性与高准确性。 Conclusion: 现有模型指纹技术在对抗环境下普遍脆弱,需发展兼顾鲁棒性与准确性的新方法,研究为未来方向提供了实证基础。 Abstract: Model fingerprint detection techniques have emerged as a promising approach for attributing AI-generated images to their source models, but their robustness under adversarial conditions remains largely unexplored. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under constrained black-box access. While forgery is more challenging than removal, its success significantly varies across targeted models. We also identify a utility-robustness trade-off: methods with the highest attribution accuracy are often vulnerable to attacks. Although some techniques exhibit robustness in specific settings, none achieves high robustness and accuracy across all evaluated threat models. These findings highlight the need for techniques balancing robustness and accuracy, and identify the most promising approaches for advancing this goal.

[116] MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Peiqing Yang,Shangchen Zhou,Kai Hao,Qingyi Tao

Main category: cs.CV

TL;DR: 本文提出了一种无监督的视频抠图质量评估器(MQE),用于提升视频抠图的语义和边界质量,并构建了大规模真实世界数据集VMReal,结合参考帧训练策略,使MatAnyone 2在合成和真实场景下均达到SOTA性能。

Details Motivation: 现有视频抠图数据集规模小、真实性不足,且缺乏有效的边界监督,导致结果缺乏细节、偏向分割式粗糙抠图。 Method: 提出学习型Matting Quality Evaluator (MQE),生成像素级质量评估图;利用MQE作为训练中的在线反馈和离线数据筛选模块,构建大规模真实数据集VMReal;引入参考帧训练策略,利用长距离帧信息应对长时间视频中的外观变化。 Result: 构建了包含28K视频片段、2.4M帧的大规模真实视频抠图数据集VMReal;MatAnyone 2在合成与真实世界基准上均超越先前方法,实现最先进性能。 Conclusion: 通过MQE实现无需真值的细粒度质量评估,有效提升了视频抠图的训练监督与数据质量,推动了大规模、高保真视频抠图的发展。 Abstract: Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.

[117] Uncertainty-Aware Domain Adaptation for Vitiligo Segmentation in Clinical Photographs

Wentao Jiang,Vamsi Varra,Caitlin Perez-Stable,Harrison Zhu,Meredith Apicella,Nicole Nyamongo

Main category: cs.CV

TL;DR: 提出了一种可信赖的、频率感知的分割框架,用于临床照片中白癜风区域的精确量化,结合域自适应预训练、高频谱门控模块和不确定性估计机制,在分割性能和临床可靠性方面均优于现有方法。

Details Motivation: 准确量化临床照片中的白癜风范围对于治疗反应的纵向监测至关重要,但现有方法在处理背景噪声、细微纹理和预测可靠性方面存在不足。 Method: 采用基于ConvNeXt V2的编码器,引入高频谱门控(HFSG)模块和stem-skip连接以增强纹理捕捉;结合域自适应预训练和ROI约束的双任务损失实现数据高效训练;并通过K折集成与测试时增强(TTA)生成像素级不确定性图以提升临床可信度。 Result: 在专家标注的临床数据集上验证,Dice分数达85.05%,95% Hausdorff距离从44.79像素显著降低至29.95像素,边界误差明显减少,且无灾难性失败案例,同时提供可解释的熵图辅助医生判读。 Conclusion: 所提出的框架在准确性、鲁棒性和可解释性方面表现优越,为自动化白癜风评估建立了可靠标准,具有良好的临床应用前景。 Abstract: Accurately quantifying vitiligo extent in routine clinical photographs is crucial for longitudinal monitoring of treatment response. We propose a trustworthy, frequency-aware segmentation framework built on three synergistic pillars: (1) a data-efficient training strategy combining domain-adaptive pre-training on the ISIC 2019 dataset with an ROI-constrained dual-task loss to suppress background noise; (2) an architectural refinement via a ConvNeXt V2-based encoder enhanced with a novel High-Frequency Spectral Gating (HFSG) module and stem-skip connections to capture subtle textures; and (3) a clinical trust mechanism employing K-fold ensemble and Test-Time Augmentation (TTA) to generate pixel-wise uncertainty maps. Extensive validation on an expert-annotated clinical cohort demonstrates superior performance, achieving a Dice score of 85.05% and significantly reducing boundary error (95% Hausdorff Distance improved from 44.79 px to 29.95 px), consistently outperforming strong CNN (ResNet-50 and UNet++) and Transformer (MiT-B5) baselines. Notably, our framework demonstrates high reliability with zero catastrophic failures and provides interpretable entropy maps to identify ambiguous regions for clinician review. Our approach suggests that the proposed framework establishes a robust and reliable standard for automated vitiligo assessment.

[118] Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Yang Fei,George Stoica,Jingyuan Liu,Qifeng Chen,Ranjay Krishna,Xiaojuan Wang,Benlin Liu

Main category: cs.CV

TL;DR: 本文提出了一种将自回归视频跟踪模型(SAM2)中的结构保持运动先验蒸馏到双向视频扩散模型(CogVideoX)中的新方法,构建了SAM2VideoX模型,并通过双向特征融合模块和局部Gram Flow损失实现了更真实、结构一致的视频生成。

Details Motivation: 现有视频扩散模型在生成具有复杂运动的物体(如人和动物)时难以保持结构一致性,依赖噪声运动表示进行条件控制,导致物理上不合理的过渡。需要一种能有效引入可靠运动先验的方法来提升生成质量。 Method: 提出一种知识蒸馏算法,从SAM2等自回归视频跟踪模型中提取结构保持的运动先验,并融入CogVideoX扩散模型;设计双向特征融合模块以捕获全局运动结构,并引入局部Gram Flow损失来对齐局部特征的运动模式。 Result: 在VBench上达到95.51%的成绩,比REPA提升2.60%;FVD降低至360.57,分别优于REPA和LoRA微调21.20%和22.46%;人类评估显示71.4%的偏好率,显著优于基线方法。 Conclusion: SAM2VideoX通过引入来自强跟踪模型的结构保持运动先验,有效提升了视频生成的真实性和结构一致性,为高质量视频生成提供了新的解决方案。 Abstract: Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .

[119] Particulate: Feed-Forward 3D Object Articulation

Ruining Li,Yuxin Yao,Chuanxia Zheng,Christian Rupprecht,Joan Lasenby,Shangzhe Wu,Andrea Vedaldi

Main category: cs.CV

TL;DR: 提出Particulate,一种前馈方法,可从单个静态3D网格直接推断出物体的关节结构属性,包括3D部件、运动学结构和运动约束。

Details Motivation: 现有方法依赖于逐物体优化,速度慢且难以扩展,无法高效处理AI生成的3D资产。需要一种快速、通用的方法来自动构建可动3D模型。 Method: 设计基于Transformer的网络Part Articulation Transformer,处理输入网格的点云数据,以端到端方式预测多关节支持的关节结构属性,并将预测结果映射回原始网格。 Result: 在新构建的高质量基准上显著优于现有最先进方法,推理速度快达数秒,能有效处理真实和合成图像生成的3D资产。 Conclusion: Particulate实现了快速、准确的3D物体关节结构推断,推动了从单张图像生成可动3D模型的发展,并为评估该任务提供了更符合人类偏好的新基准。 Abstract: We present Particulate, a feed-forward approach that, given a single static 3D mesh of an everyday object, directly infers all attributes of the underlying articulated structure, including its 3D parts, kinematic structure, and motion constraints. At its core is a transformer network, Part Articulation Transformer, which processes a point cloud of the input mesh using a flexible and scalable architecture to predict all the aforementioned attributes with native multi-joint support. We train the network end-to-end on a diverse collection of articulated 3D assets from public datasets. During inference, Particulate lifts the network's feed-forward prediction to the input mesh, yielding a fully articulated 3D model in seconds, much faster than prior approaches that require per-object optimization. Particulate can also accurately infer the articulated structure of AI-generated 3D assets, enabling full-fledged extraction of articulated 3D objects from a single (real or synthetic) image when combined with an off-the-shelf image-to-3D generator. We further introduce a new challenging benchmark for 3D articulation estimation curated from high-quality public 3D assets, and redesign the evaluation protocol to be more consistent with human preferences. Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art approaches.

[120] V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Ye Fang,Tong Wu,Valentin Deschaintre,Duygu Ceylan,Iliyan Georgiev,Chun-Hao Paul Huang,Yiwei Hu,Xuelin Chen,Tuanfeng Yang Wang

Main category: cs.CV

TL;DR: 本文提出了V-RGBX,首个端到端的内在意图感知视频编辑框架,统一了视频逆渲染、基于内在表示的合成和关键帧编辑功能。

Details Motivation: 现有大规模视频生成模型虽能生成逼真外观和光照效果,但缺乏联合理解并编辑场景内在属性(如反照率、法线、材质、辐照度)的闭环框架。 Method: 提出V-RGBX框架,采用交错条件机制,实现从视频中分解内在通道、基于这些表示生成视频,并通过关键帧进行基于内在通道的编辑。 Result: 实验结果表明V-RGBX能生成时间一致且逼真的视频,并以物理上合理的方式在整个序列中传播关键帧编辑,支持对象外观编辑和场景重照明等应用。 Conclusion: V-RGBX是首个实现内在意图感知视频编辑的端到端框架,在视频编辑质量和物理合理性方面优于先前方法。 Abstract: Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.

[121] Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance

Jan U. Müller,Robin Tim Landsgesell,Leif Van Holland,Patrick Stotko,Reinhard Klein

Main category: cs.CV

TL;DR: 提出了一种基于统计矩的高保真透射率计算方法,用于3D高斯点阵的光栅化渲染,避免了光线追踪和像素排序,提升了复杂半透明物体的渲染质量。

Details Motivation: 现有3D高斯点阵渲染依赖简化的alpha混合和密度积分近似,难以准确渲染复杂重叠的半透明物体。 Method: 利用基于矩的顺序无关透明技术,为每条相机光线上的密度分布构建基于统计矩的紧凑连续表示,并解析地计算来自所有3D高斯的每像素矩,重建连续透射函数并在每个高斯内独立采样。 Result: 实现了无需光线追踪或像素排序的高保真透射计算,显著提升了复杂半透明介质中的光照衰减建模和渲染质量。 Conclusion: 该方法在保持光栅化效率的同时,大幅提高了3D高斯表示在复杂透明场景下的物理准确性和视觉保真度。 Abstract: The recent success of 3D Gaussian Splatting (3DGS) has reshaped novel view synthesis by enabling fast optimization and real-time rendering of high-quality radiance fields. However, it relies on simplified, order-dependent alpha blending and coarse approximations of the density integral within the rasterizer, thereby limiting its ability to render complex, overlapping semi-transparent objects. In this paper, we extend rasterization-based rendering of 3D Gaussian representations with a novel method for high-fidelity transmittance computation, entirely avoiding the need for ray tracing or per-pixel sample sorting. Building on prior work in moment-based order-independent transparency, our key idea is to characterize the density distribution along each camera ray with a compact and continuous representation based on statistical moments. To this end, we analytically derive and compute a set of per-pixel moments from all contributing 3D Gaussians. From these moments, a continuous transmittance function is reconstructed for each ray, which is then independently sampled within each Gaussian. As a result, our method bridges the gap between rasterization and physical accuracy by modeling light attenuation in complex translucent media, significantly improving overall reconstruction and rendering quality.