Skip to content

Table of Contents

cs.CL [Back]

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

Hen-Hsen Huang

Main category: cs.CL

TL;DR: 本文提出了一种面向资源有限环境的高效大语言模型研究新方向,强调简化而非复杂化,提倡无需重训练的架构改进、轻量级微调、经济化推理和动态知识管理,并引入“开销感知效率”(OAE)作为新基准,以实现LLM部署的公平性、可持续性和普及化。

Details Motivation: 现有的高效方法如MoE、投机解码和复杂RAG主要服务于拥有庞大基础设施的超大规模厂商,在资源有限的情境下反而带来过高开销、脆弱性和资源浪费,导致技术红利集中在少数科技巨头,加剧了技术不平等与碳排放问题。 Method: 提出“稳健简单性”作为新范式,主张在不重新训练的前提下改造预训练模型架构、开发保持对齐性的轻量微调技术、优化长链推理效率、实现无需重型RAG管道的动态知识管理,并倡导采用“开销感知效率”(OAE)作为评估标准。 Result: 为非超大规模机构(如医院、学校、政府等)提供了可行的高效LLM部署路径,推动LLM技术的民主化。 Conclusion: 效率的定义应扩展至包含采用成本、可持续性和公平性,真正的效率应减少不平等和碳浪费,而非加剧之。 Abstract: Large language models (LLMs) have become indispensable, but the most celebrated efficiency methods -- mixture-of-experts (MoE), speculative decoding, and complex retrieval-augmented generation (RAG) -- were built for hyperscale providers with vast infrastructure and elite teams. Outside that context, their benefits collapse into overhead, fragility, and wasted carbon. The result is that a handful of Big Tech companies benefit, while thousands of hospitals, schools, governments, and enterprises are left without viable options. We argue that the next frontier is not greater sophistication at scale, but robust simplicity: efficiency that thrives under modest resources and minimal expertise. We propose a new research agenda: retrofitting pretrained models with more efficient architectures without retraining, inventing lightweight fine-tuning that preserves alignment, making reasoning economical despite long chains of thought, enabling dynamic knowledge management without heavy RAG pipelines, and adopting Overhead-Aware Efficiency (OAE) as a standard benchmark. By redefining efficiency to include adoption cost, sustainability, and fairness, we can democratize LLM deployment -- ensuring that optimization reduces inequality and carbon waste rather than amplifying them.

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

Tcharlies Schmitz

Main category: cs.CL

TL;DR: 本文提出了Harmonic Token Projection (HTP),一种无需训练、词表或随机参数的可逆且确定性的文本嵌入生成框架。HTP通过将每个token的Unicode整数表示映射为谐波轨迹,实现离散符号到连续向量空间的双射和可解释性映射,并在语义相似性任务中展现出高效且稳定的性能。

Details Motivation: 传统神经嵌入方法依赖于统计共现或优化过程,缺乏可解释性和可逆性,且计算成本较高。本文旨在探索仅基于确定性几何结构即可生成有意义语义表示的可能性,提供更透明、高效的替代方案。 Method: HTP将每个token的Unicode整数值转换为谐波轨迹,利用解析函数生成连续向量表示。该映射是双射且可逆的,语义相似性通过向量间的几何对齐来评估,无需任何训练过程。 Result: 在STS-B及其多语言扩展数据集上的实验表明,HTP在英语中达到Spearman相关系数ρ = 0.68,并在十种语言中保持稳定性能,每句对的计算延迟低于1毫秒,计算成本极低。 Conclusion: HTP证明了仅通过确定性几何结构即可有效捕捉语义关系,为文本嵌入提供了一种可解释、高效且可逆的新范式,挑战了传统依赖数据驱动和统计学习的嵌入方法。 Abstract: This paper introduces the Harmonic Token Projection (HTP), a reversible and deterministic framework for generating text embeddings without training, vocabularies, or stochastic parameters. Unlike neural embeddings that rely on statistical co-occurrence or optimization, HTP encodes each token analytically as a harmonic trajectory derived from its Unicode integer representation, establishing a bijective and interpretable mapping between discrete symbols and continuous vector space. The harmonic formulation provides phase-coherent projections that preserve both structure and reversibility, enabling semantic similarity estimation from purely geometric alignment. Experimental evaluation on the Semantic Textual Similarity Benchmark (STS-B) and its multilingual extension shows that HTP achieves a Spearman correlation of \r{ho} = 0.68 in English, maintaining stable performance across ten languages with negligible computational cost and sub-millisecond latency per sentence pair. This demonstrates that meaningful semantic relations can emerge from deterministic geometry, offering a transparent and efficient alternative to data-driven embeddings. Keywords: Harmonic Token Projection, reversible embedding, deterministic encoding, semantic similarity, multilingual representation.

[3] A centroid based framework for text classification in itsm environments

Hossein Mohanna,Ali Ait-Bachir

Main category: cs.CL

TL;DR: 提出一种基于双嵌入质心的文本分类框架,用于IT服务管理中的层次化分类任务,兼顾性能、可解释性与高效更新。

Details Motivation: 在ITSM系统中,支持工单需按树状层次分类体系进行归类,现有方法在可解释性和更新效率上存在不足。 Method: 采用双嵌入(语义与词法)质心表示每个类别,并在推理时通过互逆排序融合结合两者。 Result: 在8,968个工单、123个类别的数据上,层级F1达0.731,优于SVM(0.727),训练速度快5.9倍,增量更新快152倍,批处理速度提升8.6-8.8倍(不含嵌入计算)。 Conclusion: 该方法在保持竞争力性能的同时,显著提升训练与更新效率,并具备良好可解释性,适用于重视可维护性与运行效率的生产环境。 Abstract: Text classification with hierarchical taxonomies is a fundamental requirement in IT Service Management (ITSM) systems, where support tickets must be categorized into tree-structured taxonomies. We present a dual-embedding centroid-based classification framework that maintains separate semantic and lexical centroid representations per category, combining them through reciprocal rank fusion at inference time. The framework achieves performance competitive with Support Vector Machines (hierarchical F1: 0.731 vs 0.727) while providing interpretability through centroid representations. Evaluated on 8,968 ITSM tickets across 123 categories, this method achieves 5.9 times faster training and up to 152 times faster incremental updates. With 8.6-8.8 times speedup across batch sizes (100-1000 samples) when excluding embedding computation. These results make the method suitable for production ITSM environments prioritizing interpretability and operational efficiency.

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

Yongfu Xue

Main category: cs.CL

TL;DR: PIRA是一种新的训练范式,通过偏好指令重构、多任务奖励聚合和dropout下的价值头输出平均化,提升奖励模型的数据效率并缓解奖励过优化问题。

Details Motivation: 传统的判别式奖励模型存在数据效率低和易受奖励过优化影响的问题,需要更有效的训练方法来提升对齐效果。 Method: 提出PIRA框架:1)将问答对重新表述为基于偏好的指令;2)聚合来自不同偏好任务的奖励以减少偏差;3)在不同dropout率下对value-head输出进行平均以稳定奖励。 Result: 大量实验表明,PIRA在提升数据效率和增强模型鲁棒性方面显著优于传统方法。 Conclusion: PIRA有效解决了奖励模型中的数据效率和过优化问题,为LLM与人类偏好对齐提供了更可靠的方案。 Abstract: Reward models are crucial for aligning Large Language Models (LLMs) with human preferences but face two representative challenges. First, traditional discriminative reward models usually concatenate questions and responses directly as input, resulting in low data efficiency. Second, reward models are vulnerable to reward overoptimization. We propose PIRA, a training paradigm addressing these issues through three strategies: (1) Reformulating question-answer pairs into preference-based instructions for clearer and more explicit task specification, (2) aggregating rewards from diverse preference tasks to reduce bias and improve robustness, and (3) averaging value-head outputs under varying dropout rates to stabilize rewards. Extensive experiments have demonstrated the effectiveness of PIRA.

Mann Khatri,Mirza Yusuf,Rajiv Ratn Shah,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: 本文研究了如何通过重构法律文档的结构、定义修辞角色以及模拟法院推理过程来提升大语言模型在法律领域的零样本表现,实验表明这些方法显著提高了F1分数。

Details Motivation: 大语言模型在通用领域表现出色,但在法律等专业领域因缺乏领域特定预训练和难以处理长而复杂的法律文本而表现受限。 Method: 在三个印度法律判决预测数据集上进行零样本实验,通过按修辞角色重组文档、定义法律术语和模拟法院逐步推理来分析模型行为。 Result: 组织数据或解释关键法律术语显著提升了模型性能,F1分数相比基线最低提升约1.5%,最高达4.36%。 Conclusion: 引入结构化信息和领域术语解释能有效增强大语言模型在法律任务中的长文本处理能力和推理表现。 Abstract: Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.

[6] MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

Saad Mankarious,Ayah Zirikly,Daniel Wiechmann,Elma Kerz,Edward Kempa,Yu Qiao

Main category: cs.CL

TL;DR: 本文提出了一个新的心理健康分析基准数据集MindSET,该数据集从Reddit收集,包含超过1300万条标注帖子,涵盖七种心理健康状况,规模是以往基准的两倍以上。通过严格的数据清洗和过滤(如语言、NSFW内容和重复项),并利用LIWC进行语言学分析,验证了数据集的质量。实验表明,在诊断检测任务中,基于MindSET训练的模型显著优于以往基准,自闭症检测F1分数最高提升18点,展示了其在社交媒体与心理健康研究中的潜力。

Details Motivation: 现有心理健康研究的基准数据集存在数据过时、清洗不足及难以应对社交媒体多样性(如多语言和有害内容)的问题,限制了研究进展,因此需要一个更大、更高质量、更具代表性的新数据集。 Method: 构建了一个名为MindSET的新数据集,基于Reddit上用户自我报告的诊断信息进行标注;对原始数据进行了严格的预处理,包括语言过滤、去除不适宜工作环境(NSFW)内容和重复帖子;使用LIWC工具进行心理语言特征分析;并通过基于词袋(BoW)和微调语言模型的二分类实验评估数据集效用。 Result: MindSET包含超过1300万条标注帖子,规模超过以往基准两倍以上;语言分析揭示了不同心理状态群体间的语言使用差异;在诊断检测任务中,使用MindSET训练的模型在F1分数上最高提升18点,尤其在自闭症检测中表现突出。 Conclusion: MindSET是一个高质量、大规模的心理健康研究新基准,能够有效支持基于社交媒体的心理健康风险早期识别和新兴心理趋势的深入分析,推动该领域的进一步发展。 Abstract: Social media data has become a vital resource for studying mental health, offering real-time insights into thoughts, emotions, and behaviors that traditional methods often miss. Progress in this area has been facilitated by benchmark datasets for mental health analysis; however, most existing benchmarks have become outdated due to limited data availability, inadequate cleaning, and the inherently diverse nature of social media content (e.g., multilingual and harmful material). We present a new benchmark dataset, \textbf{MindSET}, curated from Reddit using self-reported diagnoses to address these limitations. The annotated dataset contains over \textbf{13M} annotated posts across seven mental health conditions, more than twice the size of previous benchmarks. To ensure data quality, we applied rigorous preprocessing steps, including language filtering, and removal of Not Safe for Work (NSFW) and duplicate content. We further performed a linguistic analysis using LIWC to examine psychological term frequencies across the eight groups represented in the dataset. To demonstrate the dataset utility, we conducted binary classification experiments for diagnosis detection using both fine-tuned language models and Bag-of-Words (BoW) features. Models trained on MindSET consistently outperformed those trained on previous benchmarks, achieving up to an \textbf{18-point} improvement in F1 for Autism detection. Overall, MindSET provides a robust foundation for researchers exploring the intersection of social media and mental health, supporting both early risk detection and deeper analysis of emerging psychological trends.

[7] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

Zheng Hui,Xiaokai Wei,Reza Shirkavand,Chen Wang,Weizhi Zhang,Alejandro Peláez,Michelle Gong

Main category: cs.CL

TL;DR: 提出FlexCode,一种基于流行度感知的生成式推荐框架,通过自适应分配协同过滤与语义码本的令牌预算,提升推荐准确性与长尾鲁棒性。

Details Motivation: 现有生成式推荐方法使用单一统一码本编码所有项目,忽视了热门项目与长尾项目在协同信号和语义依赖上的差异,导致表示效率低下和泛化能力受限。 Method: 设计FlexCode框架,采用两个专用码本(协同过滤码本和语义码本)并结合轻量级MoE机制动态分配令牌预算,同时引入对齐和平滑目标以保持跨流行度的一致性。 Result: 在公开和工业规模数据集上的实验表明,FlexCode持续优于强基线方法,在推荐准确性和长尾性能上均有提升。 Conclusion: FlexCode为生成式推荐系统提供了更高效的令牌表示机制,有效平衡了记忆与泛化,推动了基于令牌的推荐模型的发展。 Abstract: Generative recommendation has recently emerged as a powerful paradigm that unifies retrieval and generation, representing items as discrete semantic tokens and enabling flexible sequence modeling with autoregressive models. Despite its success, existing approaches rely on a single, uniform codebook to encode all items, overlooking the inherent imbalance between popular items rich in collaborative signals and long-tail items that depend on semantic understanding. We argue that this uniform treatment limits representational efficiency and hinders generalization. To address this, we introduce FlexCode, a popularity-aware framework that adaptively allocates a fixed token budget between a collaborative filtering (CF) codebook and a semantic codebook. A lightweight MoE dynamically balances CF-specific precision and semantic generalization, while an alignment and smoothness objective maintains coherence across the popularity spectrum. We perform experiments on both public and industrial-scale datasets, showing that FlexCode consistently outperform strong baselines. FlexCode provides a new mechanism for token representation in generative recommenders, achieving stronger accuracy and tail robustness, and offering a new perspective on balancing memorization and generalization in token-based recommendation models.

[8] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

Saleh Almohaimeed,May Alsofyani,Saad Almohaimeed,Mansour Al Ghanim,Liqiang Wang

Main category: cs.CL

TL;DR: 本文提出了首个阿拉伯语跨领域、上下文相关的文本到SQL数据集Ar-SParC,并基于GPT-3.5和GPT-4.5进行了40次实验,结合多种提示工程方法,提出了一种新的GAT校正器方法,在零样本和上下文学习设置下均提升了性能。

Details Motivation: 现有文本到SQL的研究主要集中于英语和中文,缺乏对阿拉伯语的支持,因此亟需构建针对阿拉伯语的跨领域、上下文相关数据集以推动该语言在该任务上的发展。 Method: 构建了包含3,450个问题序列(共10,225个问题)的Ar-SParC数据集;采用GPT-3.5-turbo和GPT-4.5-turbo两个大模型,结合四种问题表示方法和六种上下文学习提示技术进行实验;提出GAT corrector新方法,并通过消融实验分析其有效性。 Result: GAT corrector在零样本设置下平均提升1.9%执行准确率(EX)和1.9%交互准确率(IX),在上下文学习设置下提升1.72% EX和0.92% IX;实验证明其优于先前的GAT verifier方法。 Conclusion: Ar-SParC填补了阿拉伯语在跨领域、上下文相关文本到SQL任务上的空白,所提出的GAT corrector有效提升了生成SQL的准确性,尤其适用于阿拉伯语场景。 Abstract: In recent years, the task of cross-domain, context-dependent text-to-SQL has received significant attention. Enables users with no prior knowledge of SQL to have a conversation with databases using natural language. However, most of the available datasets and research have been conducted in English, along with some work in Chinese. To this date, no effort has been made to address this task in the Arabic language. In this paper, we introduce Ar-SParC, the first Arabic cross-domain, context-dependent text-to-SQL dataset. The dataset consists of 3,450 sequences of interrelated questions, each sequence containing an average of approximately three questions, which results in a total of 10225 questions along with their corresponding SQL queries. We conducted 40 experiments on the Ar-SParC dataset using two large language models, GPT-3.5-turbo and GPT-4.5-turbo, applying 10 different prompt engineering techniques, including four question representation methods and six in-context learning techniques. Furthermore, we developed a novel approach named GAT corrector, which enhanced the performance across all 40 experiments, yielding an average improvement of 1.9% in execution accuracy (EX) and 1.9% in interaction accuracy (IX) under zero-shot settings, and an average increase of 1.72% EX and 0.92% IX under in-context learning settings. Finally, we conducted an ablation study with two more experiments to explain why the GAT corrector outperformed the previous GAT verifier technique, particularly for the Arabic language.

[9] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

Matthew W. Kenaston,Umair Ayub,Mihir Parmar,Muhammad Umair Anjum,Syed Arsalan Ahmed Naqvi,Priya Kumar,Samarth Rawal,Aadel A. Chaudhuri,Yousef Zakharia,Elizabeth I. Heath,Tanios S. Bekaii-Saab,Cui Tao,Eliezer M. Van Allen,Ben Zhou,YooJung Choi,Chitta Baral,Irbaz Bin Riaz

Main category: cs.CL

TL;DR: 该研究开发了一个分层分类法来识别GPT-4在真实肿瘤学笔记中的推理错误,发现23%的解读存在推理错误,尤其是确认偏见和锚定偏见,这些错误与不符合指南且可能有害的临床建议相关,提示即使语言模型表现流畅,其推理过程仍可能存在安全隐患。

Details Motivation: 尽管大型语言模型在临床基准测试中表现优异,但可能通过错误的推理得出正确结论,这种推理缺陷在肿瘤学决策支持中存在安全隐患,而传统基于准确性的评估无法捕捉此类问题。 Method: 研究采用回顾性双队列设计,基于CORAL数据集中的乳腺癌和胰腺癌病例,标注600条GPT-4的思维链推理路径,构建一个三层级的推理错误分类体系,并在前列腺癌会诊笔记的822条响应中验证该分类法的临床相关性,同时评估自动化评测工具识别错误的能力。 Result: 推理错误出现在23%的解读中,是总体错误的主要来源,其中以确认偏见和锚定偏见最为常见;这些错误与违背指南及潜在有害的临床建议显著相关,尤其在晚期疾病管理中更为突出;当前最先进的语言模型驱动的自动评估器能检测错误的存在,但无法可靠分类错误子类型。 Conclusion: 大型语言模型可能因推理缺陷而产生看似合理但临床不安全的建议;所提出的分类法为评估和提升模型在临床部署前的推理保真度提供了一个可推广的框架。 Abstract: Despite high performance on clinical benchmarks, large language models may reach correct conclusions through faulty reasoning, a failure mode with safety implications for oncology decision support that is not captured by accuracy-based evaluation. In this two-cohort retrospective study, we developed a hierarchical taxonomy of reasoning errors from GPT-4 chain-of-thought responses to real oncology notes and tested its clinical relevance. Using breast and pancreatic cancer notes from the CORAL dataset, we annotated 600 reasoning traces to define a three-tier taxonomy mapping computational failures to cognitive bias frameworks. We validated the taxonomy on 822 responses from prostate cancer consult notes spanning localized through metastatic disease, simulating extraction, analysis, and clinical recommendation tasks. Reasoning errors occurred in 23 percent of interpretations and dominated overall errors, with confirmation bias and anchoring bias most common. Reasoning failures were associated with guideline-discordant and potentially harmful recommendations, particularly in advanced disease management. Automated evaluators using state-of-the-art language models detected error presence but could not reliably classify subtypes. These findings show that large language models may provide fluent but clinically unsafe recommendations when reasoning is flawed. The taxonomy provides a generalizable framework for evaluating and improving reasoning fidelity before clinical deployment.

[10] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

Bharadwaj Yadavalli

Main category: cs.CL

TL;DR: 本文提出了动态模板选择(DTS)方法,通过根据查询复杂度自适应匹配响应模板,显著降低大语言模型的输出令牌成本,且不影响回答质量。

Details Motivation: 现有的统一提示策略在处理不同类型问题时缺乏效率,尤其是对简单问题使用冗长响应导致高昂的输出令牌成本,亟需一种更高效的响应生成机制。 Method: 提出动态模板选择(DTS),采用MLP或RoBERTa等模型对问题复杂度进行路由判断,并选择合适的响应模板;在MMLU数据集上评估路由准确性,并在多个主流LLM平台上验证其通用性。 Result: MLP路由器在保留测试数据上达到90.5%的路由准确率,略高于RoBERTa的89.5%;跨三个主流LLM(GPT-4、Gemini、Claude)的实验显示DTS具有良好的泛化能力,令牌消耗减少32.6%至33.9%。 Conclusion: DTS能有效实现按需响应生成,在保证回答质量的同时显著降低成本,具备理论基础与实际应用价值,适用于多平台部署。 Abstract: Contemporary large language model deployments typically employ uniform prompting strategies across diverse query types, applying verbose response patterns to both complex analytical tasks and straightforward factual questions. This one-size-fits-all methodology leads to substantial token inefficiency, a concern amplified by the significant cost differential between input and output tokens--the latter commanding 4-8x higher prices across major providers. We present Dynamic Template Selection (DTS), which adaptively matches response templates to query complexity, achieving significant cost reductions without compromising response quality. We compared two routing approaches: a simple MLP that uses pre-computed embeddings and a more complex fine-tuned RoBERTa transformer. Through comprehensive evaluation on 1,000 MMLU questions, we find that the MLP router achieves 90.5% routing accuracy on held-out test data, marginally exceeding RoBERTa's performance (89.5%) despite utilizing 125M fewer parameters. Notably, our empirical analysis reveals provider-agnostic behavior in template selection--routing decisions generalize effectively across 3 major LLM providers (OpenAI GPT-4, Google Gemini, and Anthropic Claude), as validated through 9,000 production API calls. While routing accuracy remains consistent at 90.5% across providers, observed token reductions vary from 32.6% to 33.9%, reflecting provider-specific generation characteristics. This work contributes several key elements: formal problem formulation with theoretical grounding in machine learning, four algorithms with corresponding complexity analyses, and extensive empirical validation across production systems.

[11] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

Lijun Shang,Yadong Yu,Wenqiang Kang,Jian Zhou,Dongyue Gao,Pan Xiang,Zhe Liu,Mengyan Dai,Zhonglu Guo,Zhimei Sun

Main category: cs.CL

TL;DR: 本文综述了二维材料在能源存储和转换中的应用,强调了从已发表论文中提取关键信息的重要性。

Details Motivation: 由于二维材料的研究文献分散,难以高效获取其性质和制备方法等关键信息,因此需要系统分析现有研究成果。 Method: 通过综合分析已发表的研究论文,总结二维材料的物理化学、电子特性及其制备方法。 Result: 归纳了二维材料在能源领域的重要应用,并揭示了其性能与结构之间的关系。 Conclusion: 系统梳理文献有助于加速二维材料的研发与应用。 Abstract: Two-dimensional (2D) materials have showed widespread applications in energy storage and conversion owning to their unique physicochemical, and electronic properties. Most of the valuable information for the materials, such as their properties and preparation methods, is included in the published research papers. However, due to the dispersion of synthe

[12] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

Trung Cuong Dang,David Mohaisen

Main category: cs.CL

TL;DR: 本文提出了一种新的框架——多前缀记忆化(multi-prefix memorization),用于检测大语言模型中训练数据的记忆化现象。该方法通过衡量能够触发某段序列生成的不重复前缀数量,来判断其是否被记忆,相较于传统方法更全面、鲁棒。

Details Motivation: 现有对记忆化的定义往往无法充分捕捉对齐模型中的记忆现象,且多基于单一提取路径,缺乏对记忆强度的量化。因此需要一个更全面、更具实践意义的定义来评估大模型的数据泄露风险。 Method: 提出多前缀记忆化定义:若可通过外部对抗性搜索找到足够多的不同前缀成功提取某段序列,则视为被记忆。通过系统实验测试多个开源及对齐对话模型中不同序列的可提取前缀数量,验证该框架的有效性。 Result: 实验表明,真正被记忆的序列能通过显著更多的不同前缀被提取出来,而非记忆内容则难以复现;该方法在多种模型上均能有效区分记忆与非记忆内容。 Conclusion: 多前缀记忆化提供了一个更稳健、可量化的记忆化评估框架,为审计大语言模型中的数据泄露问题提供了实用工具。 Abstract: Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.

[13] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

Jiaojiao Han,Wujiang Xu,Mingyu Jin,Mengnan Du

Main category: cs.CL

TL;DR: 本文提出了SAGE,一种基于智能体的稀疏自编码器特征解释框架,通过主动、迭代的解释过程显著提升了对大语言模型内部特征的理解准确性和可解释性。

Details Motivation: 大语言模型的内部机制不透明,稀疏自编码器虽有助于分解表示,但其提取的特征仍难以解释,亟需更有效的解释方法。 Method: 提出SAGE框架,将特征解释转化为智能体驱动的主动过程,系统生成多种假设,设计针对性实验验证,并根据激活反馈迭代优化解释。 Result: 在多种语言模型的SAE特征上实验表明,SAGE在生成和预测准确性上显著优于现有最先进基线方法。 Conclusion: SAGE通过主动推理和实证反馈机制,有效提升了对LLM中SAE特征的解释能力,为模型可解释性研究提供了新范式。 Abstract: Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

[14] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

Asad Aali,Muhammad Ahmed Mohsin,Vasiliki Bikia,Arnav Singhvi,Richard Gaus,Suhana Bedi,Hejie Cui,Miguel Fuentes,Alyssa Unell,Yifan Mai,Jordan Cahoon,Michael Pfeffer,Roxana Daneshjou,Sanmi Koyejo,Emily Alsentzer,Percy Liang,Christopher Potts,Nigam H. Shah,Akshay S. Chaudhari

Main category: cs.CL

TL;DR: 本文提出了一种结合DSPy与HELM的可复现框架,通过结构化提示方法(尤其是引入推理链)来更准确地评估大语言模型性能,发现传统固定提示方法会低估模型表现、导致排名偏差,并提高了评估的稳定性与决策价值。

Details Motivation: 现有的语言模型评估框架(如HELM)依赖固定提示,无法泛化到不同模型,导致性能估计不准确;缺乏对模型性能上限的估计,可能误导部署决策。 Method: 构建一个可复现的DSPy+HELM集成框架,采用四种结构化提示方法,在七个通用和医学领域的基准上评估四个前沿大语言模型,并与原有HELM基线进行比较。 Result: 发现无结构化提示时:(i) HELM平均低估性能4%;(ii) 跨基准表现波动增加(标准差+2%);(iii) 3/7个基准出现排行榜反转;(iv) 引入推理链可降低模型对提示设计的敏感性。 Conclusion: 结构化提示(特别是可优化的推理链)能更准确估计语言模型的性能上限,提升评估的稳定性和决策有用性,是迈向更可靠基准测试的重要一步。 Abstract: As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we estimate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (+2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing reasoning (chain-of-thought) reduces LM sensitivity to prompt design (smaller Δ across prompts). To our knowledge, this is the first large-scale benchmarking study to empirically characterize LM behavior across benchmarks and prompting methods, showing that scalable performance ceiling estimation enables more decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).

[15] Length-MAX Tokenizer for Language Models

Dong Dong,Weijie Su

Main category: cs.CL

TL;DR: 本文提出了一种新的语言模型分词器Length-MAX,通过最小化平均token每字符长度来减少训练和推理时所需的token数量。该方法将长度加权目标最大化建模为图划分问题,并设计了贪心近似算法。实验表明,相比BPE,Length-MAX在多种词汇表大小下均显著减少token数、训练步数和推理延迟,同时提升下游任务性能。

Details Motivation: 传统的分词方法如字节对编码(BPE)主要基于符号频率进行合并,忽略了生成文本的长度效率。这可能导致冗余的token表示,增加计算开销。本文旨在通过优化平均token长度而非仅频率,提升语言模型的训练与推理效率。 Method: 提出Length-MAX分词器,将最小化平均token每字符长度的目标转化为图划分问题,并采用贪心近似算法求解。通过在FineWeb等多领域数据上构建词汇表,在不同规模下与BPE对比评估其效率与性能。 Result: 在10K到64K词汇大小范围内,Length-MAX比BPE减少13%-18%的token数;训练GPT-2模型时减少17.2%-18.5%的训练步数,降低12.7%-13.7%推理延迟,提升16%吞吐量,并在LAMBADA和HellaSwag等下游任务上表现更优;词汇覆盖率达99.62%,未登录词率低至0.12%。 Conclusion: 优化平均token长度是一种有效提升语言模型效率的方法,在减少计算资源消耗的同时,不牺牲甚至提升下游任务性能。Length-MAX分词器具备实际部署能力,可显著降低嵌入和KV缓存内存占用。 Abstract: We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.

[16] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei,Noveen Sachdeva,Benjamin Coleman,Zhankui He,Yuanchen Bei,Xuying Ning,Mengting Ai,Yunzhe Li,Jingrui He,Ed H. Chi,Chi Wang,Shuo Chen,Fernando Pereira,Wang-Cheng Kang,Derek Zhiyuan Cheng

Main category: cs.CL

TL;DR: 本文提出了Evo-Memory,一个用于评估大语言模型代理在连续任务流中自演化记忆能力的基准和框架,强调记忆的动态积累与复用,并引入了ExpRAG和ReMem方法以提升经验利用和持续学习能力。

Details Motivation: 现有记忆评估主要集中在静态对话场景,忽视了大语言模型代理在动态任务流中持续学习和记忆演化的实际需求,导致模型难以有效积累和复用经验。 Method: 构建Evo-Memory流式基准,将数据集组织为顺序任务流,要求模型在每次交互后搜索、适应并更新记忆;统一实现十多种代表性记忆模块,并在10个多样化多轮目标导向及单轮推理问答数据集上进行评估;提出ExpRAG基线方法用于检索和利用先前经验,以及ReMem——一种融合推理、动作与记忆更新的动作-思考-记忆精炼流程。 Result: 实验表明,现有记忆模块在动态任务流中表现有限,而所提出的ReMem能有效促进记忆演化和经验复用,显著提升模型在长期任务中的性能。 Conclusion: Evo-Memory填补了大语言模型代理在真实动态环境中记忆演化评估的空白,验证了持续记忆更新的重要性,并为构建具备长期学习能力的智能代理提供了新方向。 Abstract: Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

[17] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

Ali Jahan,Masood Ghayoomi,Annette Hautli-Janisz

Main category: cs.CL

TL;DR: 本文研究了跨语言方法在低资源语言(如波斯语)的论点挖掘中的应用,提出了三种训练场景:零样本迁移、基于大语言模型增强的英语训练和结合英波双语数据的跨语言模型。实验结果表明,跨语言模型在波斯语测试集上表现最佳(F1达74.8%),优于其他方法,证明其是解决低资源语言数据不足的有效途径。

Details Motivation: 由于低资源语言缺乏标注数据,传统论点挖掘方法难以有效应用,因此需要探索能克服数据稀缺问题的新方法。 Method: 采用三种训练场景:(i) 零样本迁移(仅用英语数据训练);(ii) 使用大语言模型生成合成样本来增强英语训练;(iii) 将原始英语数据与人工翻译的波斯语句子结合构建跨语言模型。在英语Microtext语料库及其波斯语平行翻译上进行评估。 Result: 零样本迁移模型在英语和波斯语测试集上的F1分别为50.2%和50.7%;LLM增强模型将性能提升至英语59.2%、波斯语69.3%;跨语言模型在波斯语测试集上达到74.8%的F1,表现最优。 Conclusion: 轻量级的跨语言混合方法能显著优于资源消耗更大的数据增强流程,为低资源语言的论点挖掘提供了实用且高效的解决方案。 Abstract: Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus \citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2\% on the English test set and 50.7\% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2\% on English and 69.3\% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8\%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.

[18] Emergence and Localisation of Semantic Role Circuits in LLMs

Nura Aljaafari,Danilo S. Carvalho,André Freitas

Main category: cs.CL

TL;DR: 本文提出了一种新方法,结合角色交叉最小对、时间演化分析和跨模型比较,揭示了大语言模型中处理语义角色的高度集中电路及其在不同规模和架构间的部分保守性。

Details Motivation: 尽管大语言模型展现出语义能力,但其内部如何实现抽象语义结构仍不清楚,因此需要系统性方法来刻画这些机制。 Method: 引入角色交叉最小对、时间演化分析和跨模型比较的综合方法,用于分析大语言模型中语义角色的实现机制。 Result: 发现了高度集中的神经电路(28个节点贡献89-94%归因),结构逐步精细化而非突变,且存在中等程度的跨尺度保守性(24-59%组件重叠)但光谱相似性高。 Conclusion: 大语言模型形成了紧凑且因果隔离的机制来处理抽象语义结构,这些机制在不同规模和架构间部分可迁移。 Abstract: Despite displaying semantic competence, large language models' internal mechanisms that ground abstract semantic structure remain insufficiently characterised. We propose a method integrating role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study how LLMs implement semantic roles. Our analysis uncovers: (i) highly concentrated circuits (89-94% attribution within 28 nodes); (ii) gradual structural refinement rather than phase transitions, with larger models sometimes bypassing localised circuits; and (iii) moderate cross-scale conservation (24-59% component overlap) alongside high spectral similarity. These findings suggest that LLMs form compact, causally isolated mechanisms for abstract semantic structure, and these mechanisms exhibit partial transfer across scales and architectures.

[19] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

Reham Omar,Abdelghny Orogat,Ibrahim Abdelaziz,Omij Mangukiya,Panos Kalnis,Essam Mansour

Main category: cs.CL

TL;DR: Chatty-KG 是一个模块化的多智能体系统,结合检索增强生成与结构化执行,实现对知识图谱的高效、准确的多轮问答。

Details Motivation: 现有方法在处理多轮对话时存在上下文跟踪困难、结构信息丢失、延迟高和难以适应动态知识图谱的问题,需要一种既能保持对话连贯性又能高效访问结构化知识的方法。 Method: 提出 Chatty-KG,采用任务专用的 LLM 智能体协同工作,通过生成 SPARQL 查询实现自然语言到可执行查询的转化,结合 RAG 风格检索与结构化执行,支持实体链接、关系匹配和对话状态跟踪。 Result: 在多个大型和多样化知识图谱上的实验表明,Chatty-KG 在单轮和多轮设置下均显著优于现有最先进基线模型,F1 和 P@1 分数更高,且具有低延迟和良好的 LLM 兼容性。 Conclusion: Chatty-KG 成功融合了对话灵活性与知识图谱的结构化优势,提供了一种可扩展、可扩展且无需微调即可适应演化知识图谱的多轮 KGQA 解决方案。 Abstract: Conversational Question Answering over Knowledge Graphs (KGs) combines the factual grounding of KG-based QA with the interactive nature of dialogue systems. KGs are widely used in enterprise and domain applications to provide structured, evolving, and reliable knowledge. Large language models (LLMs) enable natural and context-aware conversations, but lack direct access to private and dynamic KGs. Retrieval-augmented generation (RAG) systems can retrieve graph content but often serialize structure, struggle with multi-turn context, and require heavy indexing. Traditional KGQA systems preserve structure but typically support only single-turn QA, incur high latency, and struggle with coreference and context tracking. To address these limitations, we propose Chatty-KG, a modular multi-agent system for conversational QA over KGs. Chatty-KG combines RAG-style retrieval with structured execution by generating SPARQL queries through task-specialized LLM agents. These agents collaborate for contextual interpretation, dialogue tracking, entity and relation linking, and efficient query planning, enabling accurate and low-latency translation of natural questions into executable queries. Experiments on large and diverse KGs show that Chatty-KG significantly outperforms state-of-the-art baselines in both single-turn and multi-turn settings, achieving higher F1 and P@1 scores. Its modular design preserves dialogue coherence and supports evolving KGs without fine-tuning or pre-processing. Evaluations with commercial (e.g., GPT-4o, Gemini-2.0) and open-weight (e.g., Phi-4, Gemma 3) LLMs confirm broad compatibility and stable performance. Overall, Chatty-KG unifies conversational flexibility with structured KG grounding, offering a scalable and extensible approach for reliable multi-turn KGQA.

[20] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

Ioana Buhnila,Aman Sinha,Mathieu Constant

Main category: cs.CL

TL;DR: 该研究通过TrackList分析管道和新构建的RefoMed-EN数据集,评估了大语言模型在不同类型医学查询上的表现,发现模型在定义类问题上表现最佳,而在举例类问题上表现最差,且更倾向于复述高频知识而非低频专业内容。

Details Motivation: 探讨大语言模型在除定义类之外的多种语言查询类型(如举例、释义等)中表现下降的原因,分析预训练数据对模型输出的影响。 Method: 提出TrackList分析流程,并构建包含6170个标注医学术语的RefoMed-EN英文数据集,利用句法与语义相似性指标、统计相关性和嵌入表示来评估模型在头部(高频)与尾部(低频)概念上的表现差异。 Result: 实验结果显示,大语言模型在定义类回答上性能最高,在举例类回答上最低;模型更倾向于对高频、流行知识进行复述,而对低频、专业技术知识的复述能力较弱。 Conclusion: 大语言模型在处理多样化的语言任务时存在局限性,尤其在低频和专业领域知识的表达上表现不足,反映出预训练数据分布对模型输出行为的显著影响。 Abstract: Large Language Models (LLMs) have proven efficient in giving definition-type answers to user input queries. While for humans giving various types of answers, such as examples and paraphrases, is an easy task, LLMs struggle to provide correct answers for other than definition-type queries. In this study, we evaluated this drop in performance using TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also introduce RefoMed-EN, an English dataset consisting of 6170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We studied whether the high frequency of a concept (head) or low frequency (tail) impacts the language model's performance. We evaluated the quality of the LLM's output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM's task performance for definition type questions is the highest, while for the exemplification type it is the lowest. Additionally, we showed that for definition-type questions, large language models are prone to paraphrase more on popular and frequent knowledge and less on tail and technical knowledge, especially in the expert texts.

[21] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

Anantha Padmanaban Krishna Kumar

Main category: cs.CL

TL;DR: 本文研究了上下文学习(ICL)是否能够覆盖预训练模型中的标签语义,还是仅在其基础上进行微调。通过将大语言模型视为提示诱导的分类器,并比较其在“自然”和“反转”示范下的表现,作者提出了一种语义锚定观点:ICL主要依赖于预训练中形成的稳定语义方向,而无法真正反转标签含义。实验涵盖八个分类任务和八种开源LLM,结果表明模型难以实现语义覆盖,支持了语义锚定的存在。

Details Motivation: 探讨ICL是否真正改变了模型对标签语义的理解,还是仅仅基于预训练的语义先验进行调整,以揭示少样本提示的根本局限。 Method: 将LLM视为提示诱导的分类器,使用自然和反转的示范对比其行为;引入三种对齐度量(真实性、先验性和提示对齐)以及语义覆盖率来量化ICL行为。 Result: 在自然示范下,ICL提升准确率且保持高先验对齐;多数正确预测与零样本行为一致。在反转示范下,模型无法建立一致的反语义分类器,提示对齐的提升以牺牲准确性为代价,语义覆盖率为零。 Conclusion: ICL并不能灵活重映射标签语义,而是主要调整输入在预训练所得稳定语义方向上的投影。这表明在当前规模下,仅靠ICL无法覆盖预训练语义,需要更深入干预才能改变语义表示。 Abstract: Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1--12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1--12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl.

[22] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

Michael Iskandardinata,William Christian,Derwin Suhartono

Main category: cs.CL

TL;DR: 本文提出了一种检索感知的讽刺检测方法,结合外部检索和模型自身知识来增强上下文理解,在多个数据集上显著提升了基于大语言模型的讽刺检测性能。

Details Motivation: 现有的预训练语言模型和大语言模型在讽刺检测中仍难以处理需要额外语境支撑的词汇或文化特定表达,因此需要引入上下文信息以提升检测效果。 Method: 基于Pragmatic Metacognitive Prompting (PMP) 方法,提出两种上下文增强策略:一是通过网络检索引入非参数化知识,二是激发模型自身的内部知识以实现自我认知意识。 Result: 在Twitter Indonesia Sarcastic数据集上,非参数检索使macro-F1提升了9.87%;在SemEval-2018和MUStARD上,自知识检索分别提升了3.29%和4.08%的macro-F1。 Conclusion: 上下文信息对提升大语言模型在讽刺检测中的表现至关重要,尤其是应对文化特有俚语或模型未知术语时,结合检索与自我知识是一种有效策略。 Abstract: Detecting sarcasm remains a challenging task in the areas of Natural Language Processing (NLP) despite recent advances in neural network approaches. Currently, Pre-trained Language Models (PLMs) and Large Language Models (LLMs) are the preferred approach for sarcasm detection. However, the complexity of sarcastic text, combined with linguistic diversity and cultural variation across communities, has made the task more difficult even for PLMs and LLMs. Beyond that, those models also exhibit unreliable detection of words or tokens that require extra grounding for analysis. Building on a state-of-the-art prompting method in LLMs for sarcasm detection called Pragmatic Metacognitive Prompting (PMP), we introduce a retrieval-aware approach that incorporates retrieved contextual information for each target text. Our pipeline explores two complementary ways to provide context: adding non-parametric knowledge using web-based retrieval when the model lacks necessary background, and eliciting the model's own internal knowledge for a self-knowledge awareness strategy. We evaluated our approach with three datasets, such as Twitter Indonesia Sarcastic, SemEval-2018 Task 3, and MUStARD. Non-parametric retrieval resulted in a significant 9.87% macro-F1 improvement on Twitter Indonesia Sarcastic compared to the original PMP method. Self-knowledge retrieval improves macro-F1 by 3.29% on Semeval and by 4.08% on MUStARD. These findings highlight the importance of context in enhancing LLMs performance in sarcasm detection task, particularly the involvement of culturally specific slang, references, or unknown terms to the LLMs. Future work will focus on optimizing the retrieval of relevant contextual information and examining how retrieval quality affects performance. The experiment code is available at: https://github.com/wllchrst/sarcasm-detection_pmp_knowledge-base.

[23] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

Thura Aung,Eaint Kay Khaing Kyaw,Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi

Main category: cs.CL

TL;DR: 该研究探索了Kolmogorov-Arnold Networks(KANs)作为低资源语言(如缅甸语)分类任务中替代多层感知机(MLP)的分类头,实验表明KAN在多种嵌入表示下具有竞争力甚至更优的表现。

Details Motivation: 在低资源语言中,传统MLP分类头因固定非线性和高计算成本限制了模型表达能力,需寻找更高效灵活的替代方案。 Method: 采用KAN的三种变体(FourierKAN、EfficientKAN、FasterKAN),结合TF-IDF、fastText和多语言Transformer(mBERT、Distil-mBERT)等嵌入,在仅微调分类头的设定下进行文本分类实验。 Result: EfficientKAN结合fastText取得最高F1分数(0.928);FasterKAN在速度与准确率间表现最佳平衡;在Transformer嵌入上,EfficientKAN与MLP相当或略优(mBERT下F1为0.917)。 Conclusion: KAN-based分类头是比MLP更具表达力且高效的替代方案,适用于低资源语言的分类任务。 Abstract: In low-resource languages like Burmese, classification tasks often fine-tune only the final classification layer, keeping pre-trained encoder weights frozen. While Multi-Layer Perceptrons (MLPs) are commonly used, their fixed non-linearity can limit expressiveness and increase computational cost. This work explores Kolmogorov-Arnold Networks (KANs) as alternative classification heads, evaluating Fourier-based FourierKAN, Spline-based EfficientKAN, and Grid-based FasterKAN-across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT). Experimental results show that KAN-based heads are competitive with or superior to MLPs. EfficientKAN with fastText achieved the highest F1-score (0.928), while FasterKAN offered the best trade-off between speed and accuracy. On transformer embeddings, EfficientKAN matched or slightly outperformed MLPs with mBERT (0.917 F1). These findings highlight KANs as expressive, efficient alternatives to MLPs for low-resource language classification.

[24] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

Bryan E. Tuck,Rakesh M. Verma

Main category: cs.CL

TL;DR: 该研究评估了28种大语言模型在58个需要字符级约束满足的单词谜题上的表现,发现架构差异对性能的影响远大于参数规模的影响,且模型在常见但拼写异常的词上表现差,揭示其过度依赖统计规律而缺乏对正字法约束的处理能力。

Details Motivation: 探索大语言模型在受控文本生成中满足硬性正字约束的能力,并系统分析不同架构和规模模型的表现差异。 Method: 在涵盖三个模型家族(Qwen3、Claude Haiku-4.5、GPT-5-mini)的28种配置上,测试其在58个单词谜题上的F1分数,并结合人类解题数据进行难度校准与错误分析。 Result: 发现架构差异导致的性能差距(F1:0.761 vs. 0.343)远超八倍参数扩展带来的提升(83%增益);高容量模型在增加推理预算时表现提升,中等模型则饱和或下降;模型对‘data’、‘poop’、‘loll’等常见非常规拼写词的错误率高达89-96%,而人类成功率达86-95%。 Conclusion: 控制文本生成中的字符级约束满足不仅依赖模型规模或计算资源扩展,更需专门的架构设计或训练目标来改进对正字法规律的理解与应用。 Abstract: Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography ("data", "poop", "loll": 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.

[25] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Ye Bhone Lin,Thura Aung,Ye Kyaw Thu,Thazin Myint Oo

Main category: cs.CL

TL;DR: 本文研究了在低资源缅甸语中使用序列到序列Transformer模型进行自动语音识别(ASR)错误纠正,提出结合IPA和对齐信息的特征融合策略,显著提升了词和字符级别的识别准确率。

Details Motivation: 针对低资源语言缅甸语缺乏ASR错误纠正研究的问题,探索有效的错误纠正方法以提升ASR系统性能。 Method: 采用序列到序列Transformer模型,结合国际音标(IPA)和对齐信息作为特征输入,评估五种ASR骨干模型上的不同特征融合策略。 Result: 所提出的AEC模型将ASR平均词错误率(WER)从51.56降至39.82(未增强),chrF++分数从0.5864提升至0.627,显示出显著且一致的性能增益。 Conclusion: ASR错误纠正在低资源环境下具有有效性,特征设计对提升ASR输出至关重要。 Abstract: This paper investigates sequence-to-sequence Transformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IPA and alignment information. To our knowledge, this is the first study addressing ASR error correction specifically for Burmese. We evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word- and character-level accuracy over baseline outputs. The proposed AEC model, combining IPA and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.

[26] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

Manish Jain,Satheesh Kumar Ponnambalam,Salman Faroz,Chandrakanth Lns,Vinay Sharma

Main category: cs.CL

TL;DR: MortgageLLM 是一个面向抵押贷款金融领域的双专家大语言模型,通过指令残差技术在保持指令遵循能力的同时实现领域专业化,显著优于基线模型。

Details Motivation: 大型语言模型在通用领域表现优异,但在专业领域如抵押贷款金融中需要领域知识增强,同时保持指令遵循能力。单一多任务模型存在性能权衡问题,需提出新方法解决。 Method: 采用基于LLaMA-3.1-8B的双轨专业化框架,构建两个专家模型:一个用于对话问答,另一个用于结构化任务(分类与摘要)。使用指令残差技术恢复领域适配后的指令遵循能力,并设计由专家模型自身执行少样本分类的任务路由机制。 Result: 在领域基准测试中,MortgageLLM(MLM v2)显著优于基线模型:摘要评分为4.58 vs 3.99,问答为4.09 vs 4.0,分类为2.6 vs 1.2;BERTScore方面也全面领先,分别达到0.77(摘要)、0.68(问答)、0.75(分类)。 Conclusion: 双专家架构结合指令残差技术能有效解决领域适应与指令遵循之间的冲突,MortgageLLM在抵押贷款金融领域实现了卓越性能,验证了该方法在高度专业化场景中的有效性。 Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across general domains, yet their application to specialized sectors such as mortgage finance requires domain-specific knowledge augmentation while preserving instruction-following fidelity. We present MortgageLLM, a novel domain-specific large language model that addresses this dual challenge. It is developed using a dual-track specialization framework from a single base model (LLaMA-3.1-8B). We opted for this dual-expert approach as a single multi-task model suffers from performance trade-offs, where optimizing for structured tasks (via SFT) degrades conversational fidelity (via DPO). Our dual-track method solves this by creating two specialists, allowing each to be optimally trained for its distinct capability. Our approach applies the instruction residual technique to restore instruction-following capabilities post-domain adaptation without supervised fine-tuning. We contribute: (1) application of this residual technique to the highly specialized mortgage finance domain; (2) a dual-expert architecture combining a conversational Q&A model and a structured task model for classification and summarization; and (3) an intelligent task routing mechanism using few-shot classification performed by one of the expert models itself. We validate our approach on domain-specific benchmarks, where our final model (MLM v2) significantly outperforms the base LLaMA-3.1-8B-Instruct, achieving an LLM-as-a-Judge summarization score of 4.58 (vs. 3.99), a Q&A score of 4.09 (vs. 4.0), and a classification score of 2.6 (vs. 1.2). On semantic similarity, our model achieved a BERTScore of 0.77 for summarization (vs. 0.74), 0.68 for Q&A (vs. 0.58), and 0.75 for classification (vs. 0.73), substantially outperforming baseline approaches.

[27] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang,Yanxu Zhu,Dongyuan Lu,Jitao Sang

Main category: cs.CL

TL;DR: 本文提出了一个名为SGASA的框架,通过合成指南和自适应对齐来增强推理模型在面对对抗性越狱提示时的安全性。

Details Motivation: 由于对抗性越狱提示具有隐蔽性和欺骗性,容易绕过现有安全机制,导致生成有害内容,因此需要一种能够自适应强化防御的安全对齐方法。 Method: SGASA框架包含两个阶段:数据预合成阶段生成安全指南和增强提示;对齐微调阶段使用监督微调(SFT)和直接偏好优化(DPO)将这些指南嵌入模型。 Result: 在多个数据集上的实验表明,SGASA显著提升了模型的安全性,有效增强了对有害对抗提示的鲁棒性,同时减少了对良性请求的不必要拒绝。 Conclusion: SGASA是一种有效、可扩展且自适应的安全对齐方法,能够帮助推理模型自主强化防御能力,提升整体安全性。 Abstract: Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.

[28] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

Steven Wang,Kyle Hunt,Shaojie Tang,Kenneth Joseph

Main category: cs.CL

TL;DR: 该研究探讨了在小规模人类调查数据上微调大语言模型(LLM)是否能使其更真实地模拟人类行为。结果表明,微调可提升响应的多样性、子群体对齐和信念-行为一致性,但微调后的模型仍无法复现原始研究的回归系数,表明LLM生成的数据尚不足以替代人类参与者的实证推断分析。

Details Motivation: 当前关于大语言模型能否替代人类参与者进行调查和实验研究存在争议。尽管已有探索,但证据显示LLM在行为多样性、少数群体对齐和信念-行动一致性方面与真实人类存在偏差。本研究旨在检验通过小样本数据微调是否能缓解这些问题。 Method: 基于一项关于信息披露行为的实验,比较人类与LLM生成的回答在分布差异、子群体对齐、信念-行动一致性和回归系数还原等多个维度的表现,评估在小样本人类数据上微调LLM的效果。 Result: 微调显著提升了LLM在响应异质性、群体对齐和信念-行动一致性方面的表现,但所有微调模型均未能复现原始研究中的回归系数。 Conclusion: 尽管微调能改善LLM模拟人类行为的能力,但其生成的数据仍不适合用于正式的统计推断分析,因此不能完全替代人类参与者。 Abstract: There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.

[29] Developing an Open Conversational Speech Corpus for the Isan Language

Adisai Na-Thalang,Chanakan Wittayasakpan,Kritsadha Phatcharoen,Supakit Buakaw

Main category: cs.CL

TL;DR: 本文介绍了首个开放的伊桑语会话语音数据集的开发,该数据集包含自然口语,捕捉了真实的语言现象,并解决了由于缺乏标准化正字法带来的转录挑战。

Details Motivation: 推动对泰国使用最广泛的区域方言伊桑语的研究,弥补现有语音语料库多基于朗读或脚本语音的不足,并促进包容性人工智能的发展。 Method: 建立实用的转录协议,平衡表征准确性与计算处理需求,在缺乏标准化正字法的情况下解决转录一致性、可用性和语言真实性问题。 Result: 成功构建并公开发布了首个基于自然对话的伊桑语开放语音数据集,支持非标准化语言的语音研究。 Conclusion: 该数据集有助于促进少数群体语言的研究,为建模会话语音中的语言和技术挑战提供基础。 Abstract: This paper introduces the development of the first open conversational speech dataset for the Isan language, the most widely spoken regional dialect in Thailand. Unlike existing speech corpora that are primarily based on read or scripted speech, this dataset consists of natural speech, thereby capturing authentic linguistic phenomena such as colloquials, spontaneous prosody, disfluencies, and frequent code-switching with central Thai. A key challenge in building this resource lies in the lack of a standardized orthography for Isan. Current writing practices vary considerably, due to the different lexical tones between Thai and Isan. This variability complicates the design of transcription guidelines and poses questions regarding consistency, usability, and linguistic authenticity. To address these issues, we establish practical transcription protocols that balance the need for representational accuracy with the requirements of computational processing. By releasing this dataset as an open resource, we aim to contribute to inclusive AI development, support research on underrepresented languages, and provide a basis for addressing the linguistic and technical challenges inherent in modeling conversational speech.

[30] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

Robert Belanec,Branislav Pecher,Ivan Srba,Maria Bielikova

Main category: cs.CL

TL;DR: 本文提出了PEFT-Bench,一个用于评估自回归大语言模型上多种参数高效微调(PEFT)方法的统一端到端基准,并引入了综合考虑训练参数、推理速度和训练内存的PSCP评分指标。

Details Motivation: 尽管大型语言模型在许多任务上表现出色,但其庞大的规模导致计算和环境成本高昂,限制了可访问性。现有的PEFT方法评估存在覆盖模型和数据集有限、难以复现等问题。 Method: 构建了一个名为PEFT-Bench的统一评估框架,涵盖27个NLP数据集和6种PEFT方法,并提出PSCP指标来综合评估不同PEFT方法在训练参数、推理速度和训练内存方面的表现。 Result: 实现了对多种PEFT方法在广泛数据集上的可重复评估,并通过PSCP指标提供了更全面的性能比较。 Conclusion: PEFT-Bench为PEFT方法提供了标准化、可复现的评估平台,有助于推动高效微调技术的发展与应用。 Abstract: Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-efficient fine-tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the increased development in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 6 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Score Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.

[31] Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

Kai Kugler

Main category: cs.CL

TL;DR: 首次系统研究了神经语言模型训练过程中生成文本的Martin定律(词频与多义性之间的关系),发现其呈现非单调发展轨迹,存在一个最佳语义窗口,且提出了一种评估神经语言模型中 emergent 语言结构的新方法。

Details Motivation: 探究神经语言模型在训练过程中是否遵循人类语言中的Martin定律,即词频与多义性之间的经验关系,并理解语言模型生成文本中词义演变的动态过程。 Method: 使用DBSCAN聚类上下文化嵌入作为词义的操作化定义,分析四个不同规模的Pythia模型(70M-1B参数)在30个训练检查点上的表现。 Result: Martin定律在检查点100左右出现,在104达到峰值相关性(r > 0.6),之后在105下降;小模型后期出现灾难性语义崩溃,大模型则表现为渐进退化;频率-特异性权衡在整个过程中保持稳定(r ≈ -0.3)。 Conclusion: 大型语言模型对语言规律的符合程度并非随训练单调提升,而是经历一个存在最优语义窗口的平衡发展过程,表明需用新方法评估模型中语言结构的涌现。 Abstract: We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.

[32] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

Joshua Fonseca Rivera

Main category: cs.CL

TL;DR: 通过微调,一个7B参数的语言模型被训练成能够可靠地检测和报告单个token位置注入的‘思想’,准确率达到85%,且无误报,满足了Lindsey提出的三项标准,展示了直接训练可诱导出某种内省行为,为实现AI透明性提供了新路径。

Details Motivation: Lindsey (2025) 发现语言模型对注入激活模式的自我感知能力不稳定(约20%成功率),本文旨在探索是否可以通过直接训练而非依赖能力自然出现来提升这种内省能力。 Method: 在短暂的单token激活注入任务上对7B参数语言模型进行微调,训练其检测并报告注入的语义内容,并评估其在未见概念上的泛化能力。 Result: 模型从几乎完全失败(0.4%准确率,6.7%误报率)提升到85%准确率且无误报;满足准确性、 grounding 和 internality 三项标准,在未见概念上有一定泛化能力(性能差距7.5pp)。 Conclusion: 至少一种内省行为成分可通过训练直接诱导,回应了Lindsey关于‘训练是否能消除模型间差异’的开放问题,表明构建内置的AI透明机制是可行的。 Abstract: Lindsey (2025) investigates introspective awareness in language models through four experiments, finding that models can sometimes detect and identify injected activation patterns -- but unreliably (~20% success in the best model). We focus on the first of these experiments -- self-report of injected "thoughts" -- and ask whether this capability can be directly trained rather than waiting for emergence. Through fine-tuning on transient single-token injections, we transform a 7B parameter model from near-complete failure (0.4% accuracy, 6.7% false positive rate) to reliable detection (85% accuracy on held-out concepts at α=40, 0% false positives). Our model detects fleeting "thoughts" injected at a single token position, retains that information, and reports the semantic content across subsequent generation steps. On this task, our trained model satisfies three of Lindsey's criteria: accuracy (correct identification), grounding (0/60 false positives), and internality (detection precedes verbalization). Generalization to unseen concept vectors (7.5pp gap) demonstrates the model learns a transferable skill rather than memorizing specific vectors, though this does not establish metacognitive representation in Lindsey's sense. These results address an open question raised by Lindsey: whether "training for introspection would help eliminate cross-model differences." We show that at least one component of introspective behavior can be directly induced, offering a pathway to built-in AI transparency.

[33] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Antonín Jarolím,Martin Fajčík,Lucia Makaiová

Main category: cs.CL

TL;DR: 本文研究了针对捷克语和斯洛伐克语用户评论中虚假信息的细粒度证据提取问题,构建了一个由付费标注者创建的双通道标注数据集,并评估了多种大语言模型在该任务上与人类标注的一致性。

Details Motivation: 在线新闻评论中常传播错误信息,需有效识别支持或反驳主张的具体文本证据,以提升事实核查的准确性和可解释性。 Method: 构建新的双通道标注细粒度证据数据集,使用多个大语言模型(如llama3.1:8b、gpt-oss-120b等)进行评估,并分析其输出与人类标注的一致性及错误模式。 Result: 实验表明,尽管参数较少,llama3.1:8b模型仍能产生较高质量的正确输出;而gpt-oss-120b表现不佳;qwen3:14b、deepseek-r1:32b和gpt-oss:20b在模型大小与人类标注一致性之间表现出良好平衡;LLMs普遍难以逐字复制原文证据,导致输出无效。 Conclusion: 当前大语言模型在细粒度证据提取任务上仍有局限,特别是在精确复制源文本方面;模型性能不完全依赖参数规模,较小模型亦可能表现优异,未来应优化模型对原文忠实度的生成能力。 Abstract: Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task -- fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.

[34] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

Zhifeng Hao,Qibin Song,Ruichu Cai,Boyan Xu

Main category: cs.CL

TL;DR: 提出DSR-SQL,一种双状态推理框架,通过上下文压缩与反馈引导的SQL生成提升大模型在复杂数据库上的Text-to-SQL性能。

Details Motivation: 现有基于思维链的Text-to-SQL方法在复杂企业数据库上因上下文容量限制、模式链接不可靠及语义 grounding 不足而表现不佳。 Method: 设计双状态框架:自适应上下文状态用于压缩和选择关键模式结构,生成状态则通过反馈机制逐步构建SQL并自我修正。 Result: 在Spider 2.0-Snow上达到35.28%执行准确率,在BIRD开发集上达到68.32%,无需微调或示例提示。 Conclusion: DSR-SQL有效解决了复杂数据库中推理不连贯的问题,显著提升了Text-to-SQL的准确性与鲁棒性。 Abstract: Recent divide-and-conquer reasoning approaches, particularly those based on Chain-of-Thought (CoT), have substantially improved the Text-to-SQL capabilities of Large Language Models (LLMs). However, when applied to complex enterprise databases, such methods struggle to maintain coherent reasoning due to limited context capacity, unreliable schema linking, and weak grounding in database semantics. To overcome these issues, we introduce DSR-SQL, a \textbf{D}ual-\textbf{S}tate \textbf{R}easoning framework that models Text-to-SQL as an interaction between an adaptive context state and a progressive generation state. The first constructs a compact, semantically faithful environment by refining large schemas and selecting relevant structures, while the second formalizes SQL synthesis as feedback-guided state transitions, enabling the model to self-correct and align with user intent. Without any post-training or in-context examples, DSR-SQL achieves competitive performance, reaching 35.28\% execution accuracy on Spider 2.0-Snow and 68.32\% on BIRD development set. Our implementation will be open-sourced at: https://github.com/DMIRLAB-Group/DSR-SQL.

[35] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

Kaifeng Hong,Yinglong Zhang,Xiaoying Hong,Xuewen Xia,Xing Xu

Main category: cs.CL

TL;DR: 本文提出了Odin,一种通过定向双模块机制在特定Transformer层中注入图结构的新架构,实现了文本与图结构的有效结合。Odin避免了过平滑问题,并在表达能力上超越了纯Transformer和GNN。轻量版Light Odin在保持性能的同时显著降低计算成本,在多个文本图基准上达到SOTA效果。

Details Motivation: 现有方法在处理文本图时存在局限:GNN受限于过平滑和_hop依赖的扩散,而Transformer忽略图拓扑结构。需要一种能有效融合强文本理解与结构推理的新模型。 Method: 提出Odin架构,通过在选定深度将图结构注入Transformer,采用定向双模块机制,在特定层整合多跳结构,实现与语义层次对齐的低、中、高层结构抽象。使用全局[CLS]表示进行聚合,避免过平滑。进一步设计轻量版Light Odin以提升效率。 Result: 实验表明,Odin在多个文本丰富图基准上达到最先进准确率,Light Odin在显著降低计算成本的同时仍具竞争力。理论分析显示Odin的表达能力严格包含纯Transformer和GNN。 Conclusion: Odin和Light Odin构成了一种统一、无_hop_的结构-文本融合框架,有效解决了GNN和Transformer在文本图上的固有问题,为文本图建模提供了新范式。 Abstract: Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs--limited by over-smoothing and hop-dependent diffusion--or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism.Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model's semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin's expressive power strictly contains that of both pure Transformers and GNNs.To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.

[36] A Systematic Study of Model Merging Techniques in Large Language Models

Oğuz Kağan Hitit,Leander Girrbach,Zeynep Akata

Main category: cs.CL

TL;DR: 本文对六种先进的模型合并方法在大语言模型(LLM)上的表现进行了大规模系统性评估,发现最简单的方法Task Arithmetic是唯一能稳定提升性能的方法,其他方法常导致性能下降,表明现有合并技术难以直接适用于现代LLM,需设计针对LLM的专用合并算法和微调方法。

Details Motivation: 探究在小模型和分类器上有效的模型合并方法是否能推广到大语言模型(LLM),并评估其在LLM上的适用性和性能增益。 Method: 在四个开源LLM、每个基础模型十二个微调检查点以及十六个标准LLM基准上,系统评估了六种最先进的合并方法(包括近期的子空间方法),使用标准化基准衡量合并模型相对于基础模型和最佳单个检查点的性能提升概率与相对增益。 Result: Task Arithmetic是最简单且最老的方法,是唯一能稳定带来性能提升的合并方法;其他干扰感知和子空间合并方法通常导致显著性能下降。 Conclusion: 当前的模型合并技术不能直接迁移到现代大语言模型上,未来需要设计专门针对大语言模型的合并算法以及支持合并的微调方法。 Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.

[37] Hierarchical Ranking Neural Network for Long Document Readability Assessment

Yurui Zheng,Yijun Chen,Shaohong Zhang

Main category: cs.CL

TL;DR: 本文提出了一种双向可读性评估机制和成对排序算法,通过捕捉上下文信息和建模可读性标签的序数关系,提升文本可读性评估性能。

Details Motivation: 现有深度学习方法在可读性评估中常忽略文本长度和可读性标签间的序数关系,导致评估效果受限。 Method: 提出双向可读性评估机制以捕获上下文信息,识别文本中语义丰富区域,进行句子级可读性预测,并利用这些预测结果辅助文档级可读性判断;引入成对排序算法通过标签相减建模可读性等级间的序数关系。 Result: 在中英文数据集上的实验表明,所提模型具有竞争力,性能优于其他基线模型。 Conclusion: 该方法有效提升了可读性评估的准确性,尤其在处理不同长度文本和保持标签顺序关系方面表现优越。 Abstract: Readability assessment aims to evaluate the reading difficulty of a text. In recent years, while deep learning technology has been gradually applied to readability assessment, most approaches fail to consider either the length of the text or the ordinal relationship of readability labels. This paper proposes a bidirectional readability assessment mechanism that captures contextual information to identify regions with rich semantic information in the text, thereby predicting the readability level of individual sentences. These sentence-level labels are then used to assist in predicting the overall readability level of the document. Additionally, a pairwise sorting algorithm is introduced to model the ordinal relationship between readability levels through label subtraction. Experimental results on Chinese and English datasets demonstrate that the proposed model achieves competitive performance and outperforms other baseline models.

[38] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

Lina Conti,Dennis Fucci,Marco Gaido,Matteo Negri,Guillaume Wisniewski,Luisa Bentivogli

Main category: cs.CL

TL;DR: 该研究探讨了语音翻译(ST)模型在处理语言中性别指代时如何利用声学线索和训练数据模式进行性别分配,发现模型不仅依赖于训练数据中的性别关联,还通过第一人称代词将说话者特征与性别信息关联,并利用频谱分布中的声学信息而非仅靠音高来提高性别识别准确性。

Details Motivation: 语音包含说话者的性别等信息,当从无语法性别的语言(如英语)翻译到有语法性别的语言时,可能因声学特征导致误判说话者性别。目前对ST模型如何做出这类决策尚不清楚,因此需要深入探究其机制以减少模态特定的偏见风险。 Method: 研究考察了三种语言对(en-es/fr/it),分析训练数据模式、内部语言模型(ILM)偏差与声学信息之间的交互作用,并使用对比特征归因方法分析梅尔频谱图,揭示模型如何利用声学特征进行性别指派。 Result: 模型并未简单复制训练数据中的特定词汇性别倾向,而是学习到更广泛的男性主导模式;尽管ILM存在强烈男性偏向,模型仍可根据声学输入覆盖该偏好;表现更好的模型利用第一人称代词将性别化术语与说话者联系起来,并依赖分布在全频谱而不仅是音高区域的性别信息。 Conclusion: 语音翻译模型在性别指代翻译中融合了语言模型偏见与声学输入信息,通过新型机制利用频谱中分布式特征进行更准确的性别判断,表明需关注多模态信息整合以减轻性别偏见。 Abstract: Unlike text, speech conveys information about the speaker, such as gender, through acoustic cues like pitch. This gives rise to modality-specific bias concerns. For example, in speech translation (ST), when translating from languages with notional gender, such as English, into languages where gender-ambiguous terms referring to the speaker are assigned grammatical gender, the speaker's vocal characteristics may play a role in gender assignment. This risks misgendering speakers, whether through masculine defaults or vocal-based assumptions. Yet, how ST models make these decisions remains poorly understood. We investigate the mechanisms ST models use to assign gender to speaker-referring terms across three language pairs (en-es/fr/it), examining how training data patterns, internal language model (ILM) biases, and acoustic information interact. We find that models do not simply replicate term-specific gender associations from training data, but learn broader patterns of masculine prevalence. While the ILM exhibits strong masculine bias, models can override these preferences based on acoustic input. Using contrastive feature attribution on spectrograms, we reveal that the model with higher gender accuracy relies on a previously unknown mechanism: using first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across the frequency spectrum rather than concentrated in pitch.

[39] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

Husne Ara Rubaiyeat,Hasan Mahmud,Md Kamrul Hasan

Main category: cs.CL

TL;DR: 本文提出了IsharaKhobor数据集及其子集,用于推动孟加拉手语翻译(BdSLT)的研究,解决了低资源语言的数据稀缺问题,并通过基准测试和词汇规范化进行了消融实验。

Details Motivation: 由于孟加拉手语资源极度匮乏,缺乏标准句子级数据集,严重限制了AI辅助工具的开发,因此需要构建高质量的数据集以促进相关研究。 Method: 构建了IsharaKhobor数据集及两个子集(IsharaKhobor_small和IsharaKhobor_canonical_small),采用基于关键点的原始嵌入和RQE嵌入进行基准测试,并对词汇限制和规范化进行了消融分析。 Result: 成功发布了IsharaKhobor数据集及其变体,公开在Kaggle上,为BdSLT研究提供了基础资源,并通过不同设置下的实验验证了数据处理方法的影响。 Conclusion: IsharaKhobor数据集的构建填补了孟加拉手语翻译领域的数据空白,为未来AI驱动的聋哑人辅助技术发展奠定了基础。 Abstract: Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creation for BdSLT is of immense importance for developing AI based assistive tools for deaf and hard of hearing people of Bangla speaking community. In this paper, we present a dataset, IsharaKhobor , and two subset of it for enabling research. We also present the challenges towards developing the dataset and present some way forward by benchmarking with landmark based raw and RQE embedding. We do some ablation on vocabulary restriction and canonicalization of the same within the dataset, which resulted in two more datasets, IsharaKhobor_small and IsharaKhobor_canonical_small. The dataset is publicly available at: www.kaggle.com/datasets/hasanssl/isharakhobor [1].

[40] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

Minjoon Choi

Main category: cs.CL

TL;DR: 本文提出了RoParQ基准和XParaCon评估指标,用于衡量大语言模型在回答改写问题时的一致性,并通过一种基于推理的微调方法提升模型对语义不变性的理解能力。

Details Motivation: 大语言模型在面对同义改写的问题时表现不一致,表明其依赖表面模式而非真正语义理解,因此需要更有效的评估手段和训练策略来提升其鲁棒性。 Method: 构建了RoParQ基准,利用专有模型生成标准数据集的改写问题,并筛选出导致判断模型置信度不一致的样本;提出XParaCon指标,通过多个问题变体准确率的标准差来衡量模型一致性;采用基于推理的、关注改写的监督微调(SFT)策略进行模型对齐。 Result: 实验表明,经过所提SFT策略微调的轻量级模型在跨改写一致性上显著提升,其表现可媲美更大规模的预训练模型。 Conclusion: 该研究有效缓解了大模型对表面信息的记忆依赖,增强了语义理解的稳定性与可靠性,验证了针对性训练策略在提升模型泛化能力方面的潜力。 Abstract: Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions, suggesting a reliance on surface-level patterns rather than true semantic understanding. To address this limitation, we introduce RoParQ, a benchmark specifically constructed to evaluate cross-paraphrase consistency in closed-book multiple-choice QA. This benchmark is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. We further propose XParaCon, a novel evaluation metric that quantifies a model's robustness by measuring the standard deviation of accuracies across question variants. Additionally, we implement a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy designed to align models toward semantic invariance. Our experiments demonstrate that this targeted alignment significantly enhances robustness. Notably, fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models. These results highlight the efficacy of our approach in mitigating superficial memorization and fostering more robust, reliable LLMs.

[41] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

Yixiu Zhao,Xiaozhi Wang,Zijun Yao,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 本文提出一种轻量且通用的方法,通过关联神经元激活与外部标签或模型自信度等辅助指标,来识别大语言模型中编码特定技能的神经元,并在多种任务上验证了其有效性。

Details Motivation: 大语言模型能力强大但内部机制不透明,需要可解释的方法来理解其如何编码和执行特定技能。 Method: 基于软提示训练和辅助指标(如外部标签、模型自信度)的相关性分析,定位与特定技能相关的神经元,无需手动聚合token。 Result: 在开放生成、自然语言推理和算术推理(BigBench)等多个任务中成功识别出与技能相关的神经元,并发现了算术推理中的新捷径。 Conclusion: 该方法能有效揭示大模型中任务特定的、可解释的神经元行为,有助于理解模型内部机制并发现潜在的推理捷径。 Abstract: Large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, yet their internal mechanisms remain largely opaque. In this paper, we introduce a simple, lightweight, and broadly applicable method with a focus on isolating neurons that encode specific skills. Building upon prior work that identified "skill neurons" via soft prompt training on classification tasks, our approach extends the analysis to complex scenarios involving multiple skills. We correlate neuron activations with auxiliary metrics -- such as external labels and the model's own confidence score -- thereby uncovering interpretable and task-specific behaviors without the need for manual token aggregation. We empirically validate our method on tasks spanning open-ended text generation and natural language inference, demonstrating its ability to detect neurons that not only drive known skills but also reveal previously unidentified shortcuts in arithmetic reasoning on BigBench.

[42] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Dongyang Fan,Diba Hashemi,Sai Praneeth Karimireddy,Martin Jaggi

Main category: cs.CL

TL;DR: 本研究探讨了在大语言模型预训练中引入多种元数据(如文档质量细粒度指标)以加速训练的方法,发现细粒度编码的元数据更有效,并提出通过附加元数据和可学习元标记来提升训练效率。

Details Motivation: 先前工作仅利用URL元数据加速LLM预训练,本文旨在探索其他类型元数据是否能带来更大收益。 Method: 研究多种元数据类型(如文档质量指标),将其前置或后置到输入序列,并引入可学习元标记与掩码损失辅助任务,通过探针分析隐层表示。 Result: 发现细粒度元数据能显著加速预训练;元数据附加和可学习元标记可有效恢复部分加速效果;探针结果显示元数据塑造了质量感知的潜在结构。 Conclusion: 细粒度、高信息量的元数据能有效提升LLM预训练效率,元数据附加和可学习元标记是可行策略,为高效预训练提供了实用指导。 Abstract: Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.

[43] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

Anna Marklová,Ondřej Vinš,Martina Vokáčová,Jiří Milička

Main category: cs.CL

TL;DR: 该研究探讨了捷克语AI生成诗歌与人类创作诗歌在母语者中的辨识度和审美评价,发现两者难以区分,且读者对AI创作的偏见影响其审美判断。

Details Motivation: 探索在训练数据较少的复杂语言(如捷克语)中,大语言模型生成诗歌的能力及其接受度。 Method: 通过让捷克语母语者判断诗歌作者并进行审美评分,分析其识别准确率和评价倾向,并使用逻辑回归模型探究影响因素。 Result: 参与者识别AI创作的准确率仅为45.8%,处于随机水平;当认为诗歌由AI创作时,审美评分更低,尽管AI诗歌实际评分相当或更高;喜爱程度越高,越难正确判断作者;诗歌熟悉度或文学背景不影响识别。 Conclusion: AI能够在资源较少、形态复杂的语言(如捷克语)中生成具有人类水平的诗歌,且读者的作者归属信念与其审美评价密切相关。 Abstract: Large language models are increasingly capable of producing creative texts, yet most studies on AI-generated poetry focus on English -- a language that dominates training data. In this paper, we examine the perception of AI- and human-written Czech poetry. We ask if Czech native speakers are able to identify it and how they aesthetically judge it. Participants performed at chance level when guessing authorship (45.8\% correct on average), indicating that Czech AI-generated poems were largely indistinguishable from human-written ones. Aesthetic evaluations revealed a strong authorship bias: when participants believed a poem was AI-generated, they rated it as less favorably, even though AI poems were in fact rated equally or more favorably than human ones on average. The logistic regression model uncovered that the more the people liked a poem, the less probable was that they accurately assign the authorship. Familiarity with poetry or literary background had no effect on recognition accuracy. Our findings show that AI can convincingly produce poetry even in a morphologically complex, low-resource (with respect of the training data of AI models) Slavic language such as Czech. The results suggest that readers' beliefs about authorship and the aesthetic evaluation of the poem are interconnected.

[44] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Dong Wang,Yang Li,Ansong Ni,Ching-Feng Yeh,Youssef Emad,Xinjie Lei,Liam Robbins,Karthik Padthe,Hu Xu,Xian Li,Asli Celikyilmaz,Ramya Raghavendra,Lifei Huang,Carole-Jean Wu,Shang-Wen Li

Main category: cs.CL

TL;DR: 本文提出了Matrix,一个去中心化的多智能体合成框架,通过分布式队列传递消息,消除了中心化协调器的瓶颈,显著提高了数据生成吞吐量。

Details Motivation: 现有框架依赖中心化协调器导致可扩展性差,或因硬编码限制灵活性,难以满足多样化和大规模的数据生成需求。 Method: 设计了一个基于Ray的去中心化框架Matrix,将控制流和数据流表示为通过分布式队列传递的序列化消息,任务由轻量级代理独立推进,计算密集型操作由分布式服务处理。 Result: 在多种合成场景中评估显示,Matrix在相同硬件资源下实现了2到15倍更高的数据生成吞吐量,且不牺牲输出质量。 Conclusion: Matrix提供了一种模块化、可配置的设计,能够高效支持大规模并发的多智能体数据生成工作流,适用于多种数据生成任务。 Abstract: Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.

[45] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Hongjin Su,Shizhe Diao,Ximing Lu,Mingjie Liu,Jiacheng Xu,Xin Dong,Yonggan Fu,Peter Belcak,Hanrong Ye,Hongxu Yin,Yi Dong,Evelina Bakhturina,Tao Yu,Yejin Choi,Jan Kautz,Pavlo Molchanov

Main category: cs.CL

TL;DR: ToolOrchestra提出了一种训练小型协调模型(Orchestrator)的方法,通过强化学习协调多种工具,在人类终极考试等复杂任务上以更低的成本超越GPT-5的性能,实现了效率与效果的更好平衡。

Details Motivation: 大型语言模型虽强大,但在解决如“人类终极考试”这类复杂问题时仍面临效率低和成本高的挑战,需要更高效的工具协同推理方法。 Method: 提出ToolOrchestra方法,使用基于结果、效率和用户偏好奖励的强化学习,训练一个8B的小型协调模型来管理多个工具和子模型。 Result: Orchestrator在HLE上得分为37.1%,超过GPT-5的35.1%,效率高2.5倍;在tau2-Bench和FRAMES上显著优于GPT-5,仅消耗约30%的成本,并展现出对新工具的良好泛化能力。 Conclusion: 通过轻量级协调模型组合多样化工具,比现有方法更高效且更有效,为实用、可扩展的工具增强型推理系统提供了新路径。 Abstract: Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.

[46] Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Yeganeh Kordi,Nihal V. Nayak,Max Zuo,Ilana Nguyen,Stephen H. Bach

Main category: cs.CL

TL;DR: 该研究通过大规模语言模型(LLM)和项目反应理论(IRT)对六种数据集中的样本难度进行细粒度排序,系统评估了LLM在不同任务难度间的泛化能力,发现训练数据无论偏难或偏易均难以实现跨难度的一致提升,表明训练和评估中应涵盖多样化的难度层次。

Details Motivation: 现有研究对训练数据难度(易或难)如何影响模型在不同难度测试数据上的表现存在分歧,且缺乏客观、细粒度的难度衡量方式。本文旨在通过更客观和系统的方法探究LLM在不同任务难度间的泛化能力,以指导数据构建与模型评估。 Method: 使用数千个不同LLM的输出结果结合项目反应理论(IRT)对六个数据集中的样本进行难度排序,该难度评估完全基于模型行为,不依赖人类主观判断;随后系统分析模型在不同难度训练与测试数据上的跨难度泛化表现。 Result: 研究发现跨难度泛化能力通常有限:仅训练在容易或困难的数据上都无法在所有难度水平上带来一致的性能提升;模型在某一难度上的改进往往无法推广到其他难度级别。 Conclusion: 训练和评估数据中应包含广泛的任务难度;仅依赖简单或困难数据进行数据筛选或模型训练存在风险,可能误导模型能力评估和数据优化策略。 Abstract: We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.

cs.CV [Back]

[47] Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

David Amebley,Sayanton Dibbo

Main category: cs.CV

TL;DR: 本文提出了一种神经科学启发的拓扑正则化框架(tau),用于增强多模态视觉-语言模型(VLMs)对黑盒成员推断攻击(MIA)的隐私防御能力,实验表明该方法显著降低了攻击成功率,同时保持了模型效用。

Details Motivation: 随着多模态模型在现实场景中的广泛应用,其面临新的隐私攻击向量,尤其是成员推断攻击(MIA)。现有研究主要集中在单模态系统上,而对多模态模型特别是受神经科学启发的模型在隐私攻击下的鲁棒性尚缺乏探索。 Method: 提出一种基于神经科学启发的拓扑正则化(tau)框架,构建具有更强拓扑结构稳定性的“NEURO-VLM”变体,并在BLIP、PaliGemma 2和ViT-GPT2三种VLM上,于COCO、CC3M和NoCaps三个数据集上评估其对图像-文本联合MIA攻击的抵抗能力。 Result: 实验结果显示,在BLIP+COCO设置下,NEURO-VLM的MIA攻击成功概率平均ROC-AUC下降24%,且生成文本与参考文本在MPNet和ROUGE-2指标上保持相似性能;在其他模型与数据集上的结果也验证了该方法的一致有效性。 Conclusion: 神经科学启发的拓扑正则化能有效提升多模态模型对成员推断攻击的隐私抗性,同时不显著牺牲模型性能,为构建更安全的多模态AI系统提供了可行路径。 Abstract: In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.

[48] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Inferix Team,Tianyu Feng,Yizeng Han,Jiahao He,Yuanyu He,Xi Lin,Teng Liu,Hanfeng Lu,Jiasheng Tang,Wei Wang,Zhiyuan Wang,Jichao Wu,Mingyang Yang,Yinghao Yu,Zeyu Zhang,Bohan Zhuang

Main category: cs.CV

TL;DR: Inferix是一个专为沉浸式世界合成设计的下一代推理引擎,通过优化半自回归解码实现高效、高质量的长视频生成,支持实时交互和精细评估。

Details Motivation: 现有的视频生成模型在生成长序列、物理真实且可交互的视频时存在局限性,标准视频扩散模型难以兼顾效率与质量,而当前以LLM为中心的视觉基础模型也难以满足世界模型对动态场景建模的需求。因此需要一种专门针对世界模拟的新一代推理引擎。 Method: 提出Inferix,采用半自回归(块扩散)解码范式,在块内使用扩散模型进行视频令牌生成,同时利用前序块的信息进行条件建模,并引入类似LLM的KV缓存管理机制以提升生成效率和可扩展性;集成LV-Bench进行细粒度评估,并支持交互式视频流与性能分析。 Result: Inferix实现了更连贯、稳定的长视频生成,支持变长、高效和高质量输出,具备实时交互能力和对世界动态的准确建模,且能无缝集成LV-Bench进行分钟级视频生成的基准测试。 Conclusion: Inferix作为专为世界模型设计的推理引擎,推动了从现有LLM中心化视觉模型向更具感知、理解和推理能力的世界模型新范式的转变,有望成为具身AI、代理AI和游戏等领域的核心模拟器。 Abstract: World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.

[49] Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?

Kun Guo,Yun Shen,Xijun Wang,Chaoqun You,Yun Rui,Tony Q. S. Quek

Main category: cs.CV

TL;DR: 提出了一种基于深度强化学习的自适应视频对象识别框架LTED-Ada,通过在本地跟踪与边缘检测之间智能切换,优化准确率、延迟和资源开销。

Details Motivation: 在资源受限设备上实现快速准确的视频对象识别面临挑战,现有混合方法缺乏有效的检测与跟踪调度策略。 Method: 建模单设备与多设备场景下的长期优化问题,提出LTED-Ada算法,结合深度强化学习进行自适应决策,并引入联邦学习实现多设备协同训练。 Result: 硬件在环实验表明,LTED-Ada在不同帧率和动态网络条件下均优于基线方法,有效平衡了识别精度与延迟。 Conclusion: LTED-Ada通过自适应选择检测与跟踪策略,显著提升了资源受限设备在边缘计算环境下的视频分析性能,联邦学习进一步增强了其泛化能力。 Abstract: Fast and accurate video object recognition, which relies on frame-by-frame video analytics, remains a challenge for resource-constrained devices such as traffic cameras. Recent advances in mobile edge computing have made it possible to offload computation-intensive object detection to edge servers equipped with high-accuracy neural networks, while lightweight and fast object tracking algorithms run locally on devices. This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking. To address this, we formulate two long-term optimization problems for both single-device and multi-device scenarios, taking into account the temporal correlation of consecutive frames and the dynamic conditions of mobile edge networks. Based on the formulation, we propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection, according to the frame rate as well as recognition accuracy and delay requirement. In multi-device setting, we further enhance LTED-Ada using federated learning to enable collaborative policy training across devices, thereby improving its generalization to unseen frame rates and performance requirements. Finally, we conduct extensive hardware-in-the-loop experiments using multiple Raspberry Pi 4B devices and a personal computer as the edge server, demonstrating the superiority of LTED-Ada.

[50] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

Haibo HU,Lianming Huang,Nan Guan,Chun Jason Xue

Main category: cs.CV

TL;DR: DeeAD是一种无需训练的动作引导早期退出框架,通过评估中间轨迹的物理可行性来加速视觉-语言动作(VLA)模型的规划过程,显著降低推理延迟的同时保持规划质量与安全性。

Details Motivation: VLA模型在自动驾驶中虽能统一感知、推理与轨迹生成,但因深层Transformer结构导致显著的推理延迟,限制了实时应用。 Method: 提出DeeAD框架,利用轻量级规划先验(如导航或低精度规划)判断中间轨迹是否满足物理可行性,在偏差小于2米时触发早期退出;引入多跳控制器自适应跳过冗余网络层,基于分数变化率动态调整。 Result: 在Bench2Drive基准上实验显示,DeeAD实现了最高28%的Transformer层稀疏性和29%的延迟减少,且保持了规划性能和安全性。 Conclusion: DeeAD可无缝集成到现有VLA模型(如ORION)中,无需重新训练,有效提升推理效率,为自动驾驶中的实时决策提供了高效解决方案。 Abstract: Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.

[51] Foundry: Distilling 3D Foundation Models for the Edge

Guillaume Letellier,Siddharth Srivastava,Frédéric Jurie,Gaurav Sharma

Main category: cs.CV

TL;DR: 本文提出了一种新的基础模型蒸馏(FMD)范式,用于压缩自监督学习的基础模型,在保持其通用表示能力的同时实现高效部署。

Details Motivation: 大型基础模型因体积和计算成本巨大,难以在边缘设备上部署;现有压缩方法会牺牲模型的通用性。 Method: 提出Foundation Model Distillation(FMD),通过重建教师模型的token级表示,训练学生模型学习紧凑的SuperTokens,捕捉其潜在空间的压缩基底。 Result: 实现了名为Foundry的首个面向3D点云的FMD方法,单个蒸馏模型在分类、部件分割和少样本等下游任务中表现出接近完整模型的性能,且显著减少tokens和FLOPs。 Conclusion: FMD能够在大幅压缩模型规模的同时保留基础模型的通用性和迁移能力,使基础模型更适用于资源受限设备的部署。 Abstract: Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient 'specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher's token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.

[52] DinoLizer: Learning from the Best for Generative Inpainting Localization

Minh Thong Doi,Jan Butora,Vincent Itier,Jérémie Boulanger,Patrick Bas

Main category: cs.CV

TL;DR: 本文提出DinoLizer,一种基于DINOv2的模型,用于定位生成式图像修复中的篡改区域。该方法在B-Free数据集上预训练以检测合成图像,并通过线性分类头在ViT的patch嵌入上预测篡改区域。采用滑动窗口策略处理大图像,并通过后处理优化二值化篡改掩码。实验表明,DinoLizer在多种数据集和后处理条件下均优于现有最先进方法,平均IoU提升12%。

Details Motivation: 现有的图像篡改检测方法在定位生成式修复区域时表现有限,尤其面对语义修改与常见后处理操作时鲁棒性不足。因此需要一种更强大且鲁棒的局部篡改检测模型。 Method: 基于DINOv2模型,在其ViT的patch embeddings上添加线性分类头以预测14×14 patch分辨率下的篡改区域;使用滑动窗口策略处理大尺寸图像输入,并对输出热图进行后处理以生成最终的二值化篡改掩码。模型在B-Free数据集上预训练以识别合成图像,并专注于语义改变区域。 Result: DinoLizer在多个生成式修复数据集上超越现有最先进方法,平均IoU高出12%,在经历缩放、噪声和JPEG压缩等后处理后仍保持优异性能。滑动窗口与后处理策略显著提升定位精度。实验还验证了DINOv2比DINOv3在此任务上的更强表征能力。 Conclusion: DinoLizer利用DINOv2强大的视觉表示能力,有效定位生成式图像修复中的篡改区域,具有高精度和强鲁棒性,是当前最先进的局部篡改检测方法。 Abstract: We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer's patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12\% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer's superiority. The code will be publicly available upon acceptance of the paper.

[53] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

Daeheon Jeong,Seoyeon Byun,Kihoon Son,Dae Hyun Kim,Juho Kim

Main category: cs.CV

TL;DR: 本文提出了CANVAS,一个用于评估视觉语言模型(VLM)在基于工具的用户界面设计中性能的新基准,包含598个任务,旨在衡量VLM在UI设计复制和修改中的能力。

Details Motivation: 现有的VLM虽然能够通过工具调用操作设计软件,但缺乏评估其在真实设计软件中迭代UI设计能力的基准,因此需要构建一个标准化的评估框架来揭示其潜力与局限。 Method: 构建了CANVAS基准,包含来自30个功能类别的3.3K移动UI设计中采样的598个基于工具的任务,分为设计复制和设计修改两类任务,要求VLM通过上下文感知的工具调用来逐步更新UI设计。 Result: 实验结果表明,领先的VLM能够进行更具策略性的工具调用,从而提升设计质量,同时研究识别出模型常见的错误模式,为未来改进提供了方向。 Conclusion: CANVAS为评估VLM在真实设计环境中的工具使用能力提供了有效基准,揭示了当前模型在UI设计任务中的潜力与挑战,推动VLM在人机协同设计中的应用发展。 Abstract: User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs' potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.

[54] Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru,Xiaochuang Han,Bhuwan Dhingra,Emily Dinan,Maha Elbayad

Main category: cs.CV

TL;DR: 提出文本引导的语义图像编码器(TIE),通过文本条件优化图像编码,提升视觉-语言模型在多项任务上的性能与推理效率。

Details Motivation: 传统图像编码器独立于文本和下游任务进行预训练,缺乏对特定查询的感知能力,限制了模型的语义对齐与效率。 Method: 设计文本引导的语义图像编码器(TIE),使图像表示生成过程受输入文本查询条件控制,实现更紧密的图文对齐。 Result: 在9个图文基准上,TIE在1B和3B模型规模下分别平均提升+1.5和+1.3点,部分任务如DocVQA提升达6点;同时仅用一半图像块即实现更优性能,显著提升推理效率,并展现出良好的通用查询适应性。 Conclusion: TIE通过文本条件化训练有效增强图像编码器对关键视觉特征的捕捉能力,提升了模型性能、效率、可解释性与查询对齐能力。 Abstract: Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.

[55] One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

Sindhuja Penchala,Gavin Money,Gabriel Marques,Samuel Wood,Jessica Kirschman,Travis Atkison,Shahram Rahimi,Noorbakhsh Amiri Golilarz

Main category: cs.CV

TL;DR: 本文提出了一种名为SMARC的新模型,能够在仅输入10%连续图像块的情况下,实现表面材质的重建与分类,结合部分卷积U-Net和分类头,在极端稀疏视觉输入下表现出色。

Details Motivation: 现有方法依赖密集或全场景观测,难以应对部分视野或受限环境下的表面理解问题,因此需要一种能在稀疏视觉线索下有效工作的模型。 Method: SMARC采用部分卷积U-Net结合分类头的架构,利用部分卷积进行空间修复(inpainting)并同时完成材质分类任务,仅需单个10%的连续图像块作为输入。 Result: 在Touch and Go数据集上,SMARC在PSNR指标上达到17.55 dB,材质分类准确率达到85.10%,优于包括ViT、MAE、Swin Transformer等在内的五种主流模型。 Conclusion: 部分卷积在处理缺失数据的空间推理中具有显著优势,SMARC为极简视觉条件下的表面理解提供了强有力的基础框架。 Abstract: Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation, and material perception. However, most existing methods rely on dense or full-scene observations, limiting their effectiveness in constrained or partial view environment. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders [17], Vision Transformer (ViT) [13], Masked Autoencoder (MAE) [5], Swin Transformer [9], and DETR [2] using Touch and Go dataset [16] of real-world surface textures. SMARC achieves state-of-the-art results with a PSNR of 17.55 dB and a material classification accuracy of 85.10%. Our findings highlight the advantages of partial convolution in spatial reasoning under missing data and establish a strong foundation for minimal-vision surface understanding.

[56] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang,Sudong Wang,Kaichen Zhang,Keming Wu,Sicong Leng,Yifan Zhang,Chengwei Qin,Shijian Lu,Xingxuan Li,Lidong Bing

Main category: cs.CV

TL;DR: 本文提出了LongVT,一个通过多模态工具链思维实现“长视频思考”的端到端代理框架,利用大模型的时序定位能力进行全局到局部的推理,有效缓解长视频理解中的幻觉问题。

Details Motivation: 现有的大视觉模型在处理长视频时容易产生幻觉,尤其当证据稀疏且时间分散时;受人类观看长视频方式(先概览再细看)启发,需要一种更可靠的长视频推理框架。 Method: 提出LongVT框架,通过交错的多模态工具链思维(Multimodal Chain-of-Tool-Thought),利用LMM自身的时序定位能力作为视频裁剪工具,实现从全局浏览到局部细节检查的迭代推理;同时构建了VideoSIAH数据集用于训练与评估,并采用三阶段训练策略。 Result: LongVT在四个具有挑战性的长视频理解与推理基准上持续优于现有强基线模型;发布了包含247.9K训练样本和1,280个标注QA对的VideoSIAH数据套件。 Conclusion: LongVT通过模拟人类的观看策略,结合工具增强的推理机制,显著提升了大模型在长视频理解中的准确性和鲁棒性,为减少幻觉提供了有效路径。 Abstract: Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

[57] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models

Souradeep Dutta,Keshav Bulia,Neena S Nair

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的KRISP模型复现版本,参数更少,适用于资源受限环境,并揭示了原模型的设计缺陷和实际问题。

Details Motivation: 重新审视工业级KRISP模型在资源受限场景下的适用性,探索其在边缘设备上的部署潜力。 Method: 通过系统性的消融研究进行轻量化重构,限制外部知识图谱域以防止AI幻觉,并在合成VQA数据和DAQUAR数据集上进行验证。 Result: 复现模型性能达到原模型的约75%,能有效防止AI幻觉,支持在智能手机和AR-VR等边缘设备上离线运行。 Conclusion: 知识增强型VQA架构可在低参数配置下保持较高有效性,具备在资源受限设备上应用的潜力。 Abstract: Facebook AI Research introduced KRISP [4], which integrates structured external knowledge into pipelines for vision-language reasoning. Despite its effectiveness, the original model has been developed for industrial-scale training, is computationally demanding, and is tightly connected to a large backbone. In this work, we reexamine KRISP from a different angle and offer a lightweight reproduction with significantly fewer parameters. Even though our replicated model performs about 75 % of the original, the replication process uncovers a number of design flaws, real-world pitfalls, and implicit problems that were not fully covered in the original paper. We offer insights into the scalability and efficacy of knowledge-enhanced VQA architectures under resource constraints through systematic ablation studies, which include a proof-of-concept on synthetic VQA data and evaluation on the DAQUAR dataset. Our model, configured with a low parameter setup and constrained by the external Knowledge graph domain, prevents AI hallucinations and generates outputs solely within that domain. Minimal parameters allow us to function on edge devices like smartphones and AR-VR, further improving offline visual reasoning.

[58] Intriguing Properties of Dynamic Sampling Networks

Dario Morle,Reid Zaffino

Main category: cs.CV

TL;DR: 本文提出了一种名为“warping”的新型算子,统一了深度学习中各种动态采样方法,并提供了其统计分析,揭示了前向与反向传播之间的不对称性,证明了其与传统卷积算子的本质区别,同时探讨了动态采样网络稳定训练的条件及离散化效应。

Details Motivation: 现有的动态采样机制在多个视觉模型中表现出色,但缺乏统一的理论分析框架,因此需要一种通用形式来连接并理解这些方法。 Method: 提出了“warping”算子作为动态采样的通用形式,通过建模输入为独立同分布变量和齐次随机场进行统计分析,并引入基于梯度更新的损失景观可视化方法。 Result: 成功重建了可变形卷积、主动卷积单元和空间变换网络等结构;发现了前向与反向传播间的独特不对称性;证明了动态采样算子属于不同于传统卷积的正交算子类别;明确了稳定训练的条件,并分析了离散化带来的统计影响。 Conclusion: Warping为动态采样提供了统一的理论框架,揭示了其内在机制与传统卷积的本质差异,为设计更稳定、高效的动态网络提供了理论基础与实践指导。 Abstract: Dynamic sampling mechanisms in deep learning architectures have demonstrated utility across many computer vision models, though the theoretical analysis of these structures has not yet been unified. In this paper we connect the various dynamic sampling methods by developing and analyzing a novel operator which generalizes existing methods, which we term "warping". Warping provides a minimal implementation of dynamic sampling which is amenable to analysis, and can be used to reconstruct existing architectures including deformable convolutions, active convolutional units, and spatial transformer networks. Using our formalism, we provide statistical analysis of the operator by modeling the inputs as both IID variables and homogeneous random fields. Extending this analysis, we discover a unique asymmetry between the forward and backward pass of the model training. We demonstrate that these mechanisms represent an entirely different class of orthogonal operators to the traditional translationally invariant operators defined by convolutions. With a combination of theoretical analysis and empirical investigation, we find the conditions necessary to ensure stable training of dynamic sampling networks. In addition, statistical analysis of discretization effects are studied. Finally, we introduce a novel loss landscape visualization which utilizes gradient update information directly, to better understand learning behavior.

[59] $Δ$-NeRF: Incremental Refinement of Neural Radiance Fields through Residual Control and Knowledge Transfer

Kriti Ghosh,Devjyoti Chakraborty,Lakshmish Ramaswamy,Suchendra M. Bhandarkar,In Kee Kim,Nancy O'Hare,Deepak Mishra

Main category: cs.CV

TL;DR: 提出Δ-NeRF,一种用于增量式NeRF细化的模块化残差框架,可在无历史数据情况下持续优化,适用于卫星遥感等时序观测场景。

Details Motivation: 现有NeRF需重新训练以加入新视图,难以应对数据流式到达的场景(如卫星时序观测),且易发生灾难性遗忘。 Method: 设计残差控制器对冻结的基础NeRF逐层注入修正;引入不确定性感知门控机制自适应融合基础与修正预测;采用视图选择策略减少训练数据;使用知识蒸馏压缩模型。 Result: 在卫星图像上性能媲美联合训练,训练时间减少30-42%;相比微调PSNR最高提升43.5%,部分指标优于联合训练;可将增强模型压缩至原大小的20%。 Conclusion: Δ-NeRF实现了高效、持续的NeRF增量更新,解决了灾难性遗忘问题,显著降低计算开销,适合实际部署于时序遥感分析等应用场景。 Abstract: Neural Radiance Fields (NeRFs) have demonstrated remarkable capabilities in 3D reconstruction and novel view synthesis. However, most existing NeRF frameworks require complete retraining when new views are introduced incrementally, limiting their applicability in domains where data arrives sequentially. This limitation is particularly problematic in satellite-based terrain analysis, where regions are repeatedly observed over time. Incremental refinement of NeRFs remains underexplored, and naive approaches suffer from catastrophic forgetting when past data is unavailable. We propose $Δ$-NeRF, a unique modular residual framework for incremental NeRF refinement. $Δ$-NeRF introduces several novel techniques including: (1) a residual controller that injects per-layer corrections into a frozen base NeRF, enabling refinement without access to past data; (2) an uncertainty-aware gating mechanism that prevents overcorrection by adaptively combining base and refined predictions; and (3) a view selection strategy that reduces training data by up to 47\% while maintaining performance. Additionally, we employ knowledge distillation to compress the enhanced model into a compact student network (20\% of original size). Experiments on satellite imagery demonstrate that $Δ$-NeRF achieves performance comparable to joint training while reducing training time by 30-42\%. $Δ$-NeRF consistently outperforms existing baselines, achieving an improvement of up to 43.5\% in PSNR over naive fine-tuning and surpassing joint training on some metrics.

[60] Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara,Yujia Chen,Ming-Hsuan Yang,James M. Rehg,Wen-Sheng Chu,Du Tran

Main category: cs.CV

TL;DR: 提出Split-then-Merge(StM)框架,通过自分解和重组无标签视频提升生成视频合成的控制能力与数据利用效率。

Details Motivation: 解决生成视频合成中对标注数据或手工规则的依赖以及数据稀缺问题。 Method: 将大量无标签视频拆分为动态前景和背景层,进行自组合;采用变换感知训练流程、多层融合增强和身份保持损失来实现可控合成与前景保真。 Result: 在定量基准和人类/VLLM定性评估中均优于现有最先进方法。 Conclusion: StM能有效学习复杂动态组合规律,显著提升生成视频的质量与可控性。 Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam,Saksham Aggarwal,Justin Yang Chae,Nidhi Rastogi

Main category: cs.CV

TL;DR: Sphinx是一个用于视觉感知和推理的合成环境,生成包含多种任务类型的可验证谜题,评估显示当前大模型表现有限,而基于可验证奖励的强化学习能显著提升性能。

Details Motivation: 为了推动视觉感知与多模态推理的发展,需要一个具备精确评估能力、涵盖核心认知能力的基准测试环境。 Method: 提出Sphinx环境,通过程序化生成包含 motifs、tiles、charts 等元素的25种类型谜题,并采用强化学习与可验证奖励(RLVR)来提升模型表现。 Result: 实验表明,即使最先进的GPT-5在该基准上准确率仅为51.1%,远低于人类;而使用RLVR方法可显著提升模型在Sphinx及其他外部视觉推理任务上的表现。 Conclusion: Sphinx为视觉推理提供了可扩展、可验证的评测平台,且RLVR是一种有前景的提升多模态模型推理能力的方法。 Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell'Erba,Andrew D. Bagdanov

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、无需数据的优化方法OVI,用于替代文本到图像生成中昂贵的扩散先验网络,并引入两种新约束提升生成质量,实验表明该方法可与现有先进先验相媲美。

Details Motivation: 现有的文本到图像扩散模型依赖计算成本高且需大量数据训练的先验网络,本文旨在探索是否可以完全避免使用此类训练过的先验。 Method: 提出基于优化的视觉反演(OVI),通过随机伪标记初始化潜在表示,并迭代优化使其与文本嵌入的余弦相似性最大化;同时引入Mahalanobis距离和最近邻损失作为正则化约束。 Result: 在Kandinsky 2.2上实验显示,OVI可有效替代传统先验;分析发现当前评估基准(如T2I-CompBench++)存在缺陷,仅用文本嵌入作先验也能得高分;所提约束方法尤其是最近邻方法在视觉保真度和量化指标上表现优异。 Conclusion: OVI提供了一种有前景的训练自由先验替代方案,揭示了当前评估标准的问题,并表明该方向值得进一步研究。 Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

[63] RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerline Graphs

Roman Naeem,David Hagerman,Jennifer Alvén,Fredrik Kahl

Main category: cs.CV

TL;DR: RefTr是一种基于Transformer的3D图像到图模型,用于血管树中心线生成,通过递归优化汇聚轨迹实现高召回率和精确的树状拓扑结构。

Details Motivation: 准确检测管状树(如血管和气道)的中心线并保持正确拓扑对临床诊断和手术导航至关重要,尤其需要高召回率以避免遗漏小分支导致严重误诊。 Method: 提出RefTr模型,采用Producer-Refiner架构,基于Transformer解码器,Producer生成初始汇聚轨迹,Refiner多次迭代优化这些轨迹;引入汇聚轨迹表示法以保持有效树拓扑,并设计高效的非极大值抑制算法合并重复分支。 Result: 在多个公开中心线数据集上,RefTr实现了优于现有方法的召回率和相当的精度,推理速度更快,解码器参数减少2.4倍。 Conclusion: RefTr在保持正确树拓扑的同时显著提升召回率与效率,展现出成为3D医学影像中血管树分析新SOTA框架的潜力。 Abstract: Tubular trees, such as blood vessels and lung airways, are essential for material transport within the human body. Accurately detecting their centerlines with correct tree topology is critical for clinical tasks such as diagnosis, treatment planning, and surgical navigation. In these applications, maintaining high recall is crucial, as missing small branches can result in fatal mistakes caused by incomplete assessments or undetected abnormalities. We present RefTr, a 3D image-to-graph model for centerline generation of vascular trees via recurrent refinement of confluent trajectories. RefTr uses a Producer-Refiner architecture based on a Transformer decoder, where the Producer proposes a set of initial confluent trajectories that are recurrently refined by the Refiner to produce final trajectories, which forms the centerline graph. The confluent trajectory representation enables refinement of complete trajectories while explicitly enforcing a valid tree topology. The recurrent refinement scheme improves precision and reuses the same Refiner block across multiple steps, yielding a 2.4x reduction in decoder parameters compared to previous SOTA. We also introduce an efficient non-maximum suppression algorithm for spatial tree graphs to merge duplicate branches and boost precision. Across multiple public centerline datasets, RefTr achieves superior recall and comparable precision to previous SOTA, while offering faster inference and substantially fewer parameters, demonstrating its potential as a new state-of-the-art framework for vascular tree analysis in 3D medical imaging.

[64] MODEST: Multi-Optics Depth-of-Field Stereo Dataset

Nisarg K. Trivedi,Vinayak A. Belludi,Li-Yun Wang,Pardis Taghavi,Dante Lok

Main category: cs.CV

TL;DR: 本文提出首个高分辨率立体DSLR数据集,包含18000张图像,系统性地覆盖多种焦距和光圈设置,旨在解决真实光学条件下深度估计的现实差距问题,并支持单目与双目深度估计、浅景深渲染等任务的评估。

Details Motivation: 现有深度估计研究受限于缺乏大规模、高保真、真实的立体DSLR数据集,导致模型在真实场景中泛化能力差,尤其是基于合成数据训练的模型难以应对真实相机光学复杂性。 Method: 采集了9个复杂场景,在10种焦距(28-70mm)和5种光圈(f/2.8-f/22)下使用两个相同的DSLR相机系统拍摄,共获得18000张高分辨率(5472×3648px)立体图像;每个光学配置均有标定图像集,并提供数据集、标定文件与评估代码。 Result: 数据集涵盖了多尺度光学错觉、反射表面、透明玻璃、精细细节和光照变化等挑战性视觉元素,可用于评估几何与光学效应对深度估计、去模糊、3D重建和新视角合成的影响,揭示当前最先进方法在真实光学条件下的局限性。 Conclusion: 该数据集有效弥合了合成训练数据与真实相机光学之间的现实差距,为深度估计及相关任务提供了高保真、可控制的真实世界基准,推动模型在真实场景中的泛化与可重复研究。 Abstract: Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.

[65] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

Sree Bhattacharyya,Yaman Kumar Singla,Sudhir Yarram,Somesh Kumar Singh,Harini S,James Z. Wang

Main category: cs.CV

TL;DR: 本文提出了一个大规模无监督数据集,用于建模视觉内容的可记忆性信号,包含超过82,000个视频及其回忆描述,利用Reddit等平台的“舌尖现象”(ToT)检索查询,显著推动了视觉可记忆性研究。

Details Motivation: 现有视觉可记忆性数据集受限于人工标注成本高、规模小,且仅提供聚合评分,无法捕捉开放回忆中的细粒度记忆信号。因此,亟需一个大规模、富含自然回忆描述的数据集以推动该领域发展。 Method: 通过从Reddit等在线平台收集‘舌尖现象’(ToT)检索查询,构建了一个包含82,000多个视频及其对应描述性回忆数据的大规模无监督数据集,并采用对比学习策略训练多模态ToT检索模型,同时微调大型视觉-语言模型以生成开放式的可记忆性描述。 Result: 基于该数据集微调的大型视觉-语言模型在生成视觉内容的记忆描述方面优于GPT-4o等最先进模型;同时实现了首个能够执行多模态ToT检索的模型,在两个可记忆性相关任务中展现出强大性能。 Conclusion: 该研究提供了首个专注于建模视觉记忆信号的大规模无监督数据集和相应模型,为视觉内容可记忆性研究开辟了新方向,显著提升了回忆生成与ToT检索的能力。 Abstract: Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.

[66] Estimating Fog Parameters from a Sequence of Stereo Images

Yining Ding,João F. C. Mota,Andrew M. Wallace,Sen Wang

Main category: cs.CV

TL;DR: 提出一种基于序列立体雾天图像的雾参数动态估计方法,通过联合优化解决传统方法误差累积问题,并构建首个真实雾天驾驶数据集SDIRF用于验证。

Details Motivation: 现有雾参数估计方法多假设全局均匀雾且逐帧独立估计,易导致误差传播,难以应对真实场景中广泛存在的非均匀雾;同时缺乏包含精确光度标定的真实雾天立体图像数据集,限制了视觉系统在雾天环境下的性能评估与改进。 Method: 提出一种联合优化框架,利用立体雾天图像序列同时估计所有雾参数(如大气光、散射系数等),假设雾在局部均匀而非全局均匀,以更准确建模真实非均匀雾;设计可嵌入现有视觉SLAM或里程计系统的即插即用模块;并构建名为SDIRF的新数据集,包含真实雾天与对应阴天清晰场景的高分辨率连续立体图像、相机光度参数及多种能见度条件。 Result: 在合成数据和真实SDIRF数据上均优于先前方法:合成数据中参数估计精度更高,在真实雾序列中对动态变化的非均匀雾适应性更强;所提方法能有效提升视觉SLAM系统在雾中的定位鲁棒性;SDIRF数据集为雾天视觉研究提供了重要资源。 Conclusion: 本文方法通过联合优化与局部均匀雾假设,实现了更准确、鲁棒的雾参数动态估计,显著提升了视觉系统在真实非均匀雾环境下的性能;所发布SDIRF数据集填补了真实雾天立体视觉数据的空白,推动了雾中视觉感知的研究进展。 Abstract: We propose a method which, given a sequence of stereo foggy images, estimates the parameters of a fog model and updates them dynamically. In contrast with previous approaches, which estimate the parameters sequentially and thus are prone to error propagation, our algorithm estimates all the parameters simultaneously by solving a novel optimisation problem. By assuming that fog is only locally homogeneous, our method effectively handles real-world fog, which is often globally inhomogeneous. The proposed algorithm can be easily used as an add-on module in existing visual Simultaneous Localisation and Mapping (SLAM) or odometry systems in the presence of fog. In order to assess our method, we also created a new dataset, the Stereo Driving In Real Fog (SDIRF), consisting of high-quality, consecutive stereo frames of real, foggy road scenes under a variety of visibility conditions, totalling over 40 minutes and 34k frames. As a first-of-its-kind, SDIRF contains the camera's photometric parameters calibrated in a lab environment, which is a prerequisite for correctly applying the atmospheric scattering model to foggy images. The dataset also includes the counterpart clear data of the same routes recorded in overcast weather, which is useful for companion work in image defogging and depth reconstruction. We conducted extensive experiments using both synthetic foggy data and real foggy sequences from SDIRF to demonstrate the superiority of the proposed algorithm over prior methods. Our method not only produces the most accurate estimates on synthetic data, but also adapts better to real fog. We make our code and SDIRF publicly available\footnote{https://github.com/SenseRoboticsLab/estimating-fog-parameters} to the community with the aim of advancing the research on visual perception in fog.

[67] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Jiancheng Pan,Runze Wang,Tianwen Qian,Mohammad Mahdi,Yanwei Fu,Xiangyang Xue,Xiaomeng Huang,Luc Van Gool,Danda Pani Paudel,Yuqian Fu

Main category: cs.CV

TL;DR: 本文提出V^2-SAM,一个统一的跨视角物体对应框架,通过两种互补的提示生成器将SAM2从单视角分割扩展到跨视角对应,显著提升了在不同视角和外观变化下的物体匹配性能。

Details Motivation: 由于视角和外观的巨大差异,现有的分割模型(如SAM2)难以直接应用于跨视角物体对应任务,因此需要一种能够适应这种复杂场景的新方法。 Method: 提出了V^2-SAM框架,包含两个提示生成器:基于DINOv3特征的跨视角锚点提示生成器(V^2-Anchor)用于建立几何感知的对应关系,并首次实现基于坐标的SAM2提示;跨视角视觉提示生成器(V^2-Visual)通过新颖的视觉提示匹配器从特征和结构角度对齐ego-exo表示。此外,采用多专家设计和后处理循环一致性选择器(PCCS)自适应选择最可靠的专家。 Result: 在Ego-Exo4D、DAVIS-2017和HANDAL-X等多个数据集上进行了广泛实验,验证了V^2-SAM的有效性,取得了新的最先进性能。 Conclusion: V^2-SAM成功地将SAM2扩展到跨视角物体对应任务中,通过结合几何和外观线索以及循环一致性选择机制,在多种挑战性场景下表现出色,为未来相关研究提供了新思路。 Abstract: Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).

[68] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation

Taehoon Kim,Henry Gouk,Timothy Hospedales

Main category: cs.CV

TL;DR: 提出Null-TTA方法,通过优化无条件文本嵌入实现扩散模型的测试时对齐,避免奖励黑客问题,并保持跨奖励泛化能力。

Details Motivation: 现有测试时对齐方法容易欠优化或过优化(奖励黑客),需更稳定的对齐方式。 Method: 在分类器自由引导中优化无条件文本嵌入,而非潜变量或噪声,利用文本嵌入空间的语义结构进行对齐。 Result: Null-TTA在目标测试时对齐上达到SOTA,同时保持强跨奖励泛化能力。 Conclusion: 语义空间优化是一种有效且原则性的新测试时对齐范式。 Abstract: Test-time alignment (TTA) aims to adapt models to specific rewards during inference. However, existing methods tend to either under-optimise or over-optimise (reward hack) the target reward function. We propose Null-Text Test-Time Alignment (Null-TTA), which aligns diffusion models by optimising the unconditional embedding in classifier-free guidance, rather than manipulating latent or noise variables. Due to the structured semantic nature of the text embedding space, this ensures alignment occurs on a semantically coherent manifold and prevents reward hacking (exploiting non-semantic noise patterns to improve the reward). Since the unconditional embedding in classifier-free guidance serves as the anchor for the model's generative distribution, Null-TTA directly steers model's generative distribution towards the target reward rather than just adjusting the samples, even without updating model parameters. Thanks to these desirable properties, we show that Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalisation. This establishes semantic-space optimisation as an effective and principled novel paradigm for TTA.

[69] GaINeR: Geometry-Aware Implicit Network Representation

Weronika Jakubowska,Mikołaj Zieliński,Rafał Tobiasz,Krzysztof Byrski,Maciej Zięba,Dominik Belter,Przemysław Spurek

Main category: cs.CV

TL;DR: 提出GaINeR:一种结合可训练高斯分布与神经隐式表示的几何感知图像表示框架,支持连续表示、可解释结构和局部编辑。

Details Motivation: 传统隐式神经表示(INRs)缺乏显式的几何结构,难以进行局部编辑和物理仿真集成,限制了其在动态或交互场景中的应用。 Method: GaINeR将可训练的高斯分布与基于神经网络的INR结合;对给定图像坐标检索K个最近高斯,聚合距离加权嵌入,并通过神经网络预测RGB值。 Result: 实现了连续图像表示、可解释的几何结构和灵活的局部编辑能力,支持物理感知与交互式图像操作。 Conclusion: GaINeR克服了传统INRs在几何结构和交互性方面的局限,为动态和可编辑图像建模提供了新方向。 Abstract: Implicit Neural Representations (INRs) have become an essential tool for modeling continuous 2D images, enabling high-fidelity reconstruction, super-resolution, and compression. Popular architectures such as SIREN, WIRE, and FINER demonstrate the potential of INR for capturing fine-grained image details. However, traditional INRs often lack explicit geometric structure and have limited capabilities for local editing or integration with physical simulation, restricting their applicability in dynamic or interactive settings. To address these limitations, we propose GaINeR: Geometry-Aware Implicit Network Representation, a novel framework for 2D images that combines trainable Gaussian distributions with a neural network-based INR. For a given image coordinate, the model retrieves the K nearest Gaussians, aggregates distance-weighted embeddings, and predicts the RGB value via a neural network. This design enables continuous image representation, interpretable geometric structure, and flexible local editing, providing a foundation for physically aware and interactive image manipulation. The official implementation of our method is publicly available at https://github.com/WJakubowska/GaINeR.

[70] A deep learning model to reduce agent dose for contrast-enhanced MRI of the cerebellopontine angle cistern

Yunjie Chen,Rianne A. Weber,Olaf M. Neve,Stephan R. Romeijn,Erik F. Hensen,Jelmer M. Wolterink,Qian Tao,Marius Staring,Berit M. Verbist

Main category: cs.CV

TL;DR: 该研究评估了一种深度学习模型在减少小脑桥脑角区T1加权增强MRI造影剂剂量方面的应用,结果显示该模型能有效提升低剂量MRI的图像质量,使仅使用标准剂量10%-30%的情况下仍可实现可靠的病灶检测和诊断。

Details Motivation: 旨在通过深度学习技术降低增强MRI所需的造影剂剂量,从而减少患者潜在的副作用风险,同时保持足够的图像质量和诊断准确性。 Method: 采用多中心回顾性研究设计,利用前庭神经鞘瘤患者的T1和增强T1 MRI数据模拟不同水平的低剂量增强图像,并训练深度学习模型从低剂量输入中恢复标准剂量图像,评估其图像质量和分割性能。 Result: 在10%输入剂量下,深度学习恢复后的图像显著改善了分割指标(Dice系数、Hausdorff距离和表面距离);结构相似性指数和峰值信噪比随输入剂量增加而提高;放射科医生评价显示10%和30%剂量下的恢复图像质量优良,后者更具信息量。 Conclusion: 深度学习模型能够显著提升低剂量MRI图像质量,使得在仅使用10%-30%标准造影剂剂量时仍可实现准确的病灶检测与诊断,具有临床应用潜力。 Abstract: Objectives: To evaluate a deep learning (DL) model for reducing the agent dose of contrast-enhanced T1-weighted MRI (T1ce) of the cerebellopontine angle (CPA) cistern. Materials and methods: In this multi-center retrospective study, T1 and T1ce of vestibular schwannoma (VS) patients were used to simulate low-dose T1ce with varying reductions of contrast agent dose. DL models were trained to restore standard-dose T1ce from the low-dose simulation. The image quality and segmentation performance of the DL-restored T1ce were evaluated. A head and neck radiologist was asked to rate DL-restored images in multiple aspects, including image quality and diagnostic characterization. Results: 203 MRI studies from 72 VS patients (mean age, 58.51 \pm 14.73, 39 men) were evaluated. As the input dose increased, the structural similarity index measure of the restored T1ce increased from 0.639 \pm 0.113 to 0.993 \pm 0.009, and the peak signal-to-noise ratio increased from 21.6 \pm 3.73 dB to 41.4 \pm 4.84 dB. At 10% input dose, using DL-restored T1ce for segmentation improved the Dice from 0.673 to 0.734, the 95% Hausdorff distance from 2.38 mm to 2.07 mm, and the average surface distance from 1.00 mm to 0.59 mm. Both DL-restored T1ce from 10% and 30% input doses showed excellent images, with the latter being considered more informative. Conclusion: The DL model improved the image quality of low-dose MRI of the CPA cistern, which makes lesion detection and diagnostic characterization possible with 10% - 30% of the standard dose.

[71] Smooth regularization for efficient video recognition

Gil Goldman,Raja Giryes,Mahadev Satyanarayanan

Main category: cs.CV

TL;DR: 提出一种基于高斯随机游走(GRW)的平滑正则化方法,增强视频识别模型的时间归纳偏置,显著提升轻量级模型在Kinetics-600上的性能。

Details Motivation: 轻量级视频模型难以有效捕捉复杂时间动态,需引入更强的时间归纳偏置以利用视频固有的时间连贯性。 Method: 通过建模连续帧中间层嵌入的变化为高斯随机游走(GRW),对表示的剧烈变化进行惩罚,鼓励低加速度、平滑的特征演化。 Result: 在Kinetics-600上,MoViNets系列提升3.8%-6.1%,MobileNetV3和MoViNets-Stream提升4.9%-6.4%,均超越各自FLOP或内存约束下的SOTA。 Conclusion: 所提平滑正则化技术有效增强了轻量级视频模型的时间建模能力,在多种架构上显著提升性能,推动了高效视频识别的发展。 Abstract: We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/gilgoldm/grw-smoothing.

[72] Open Vocabulary Compositional Explanations for Neuron Alignment

Biagio La Rosa,Leilani H. Gilpin

Main category: cs.CV

TL;DR: 提出一种基于开放词汇语义分割的框架,用于在视觉领域生成任意概念的神经元组合解释,克服了依赖人工标注数据的局限性。

Details Motivation: 现有组合解释方法依赖人工标注数据,限制了其在特定领域和预定义概念外的应用,本文旨在突破这一限制。 Method: 框架包含三个步骤:指定任意概念、使用开放词汇模型生成语义分割掩码、基于掩码推导组合解释。 Result: 实验表明该方法在定量指标和人类可解释性上优于先前方法,并支持跨任务和属性的灵活解释。 Conclusion: 该框架有效扩展了组合解释的适用范围,实现了无需人工标注即可探测神经元对任意概念的编码方式。 Abstract: Neurons are the fundamental building blocks of deep neural networks, and their interconnections allow AI to achieve unprecedented results. Motivated by the goal of understanding how neurons encode information, compositional explanations leverage logical relationships between concepts to express the spatial alignment between neuron activations and human knowledge. However, these explanations rely on human-annotated datasets, restricting their applicability to specific domains and predefined concepts. This paper addresses this limitation by introducing a framework for the vision domain that allows users to probe neurons for arbitrary concepts and datasets. Specifically, the framework leverages masks generated by open vocabulary semantic segmentation to compute open vocabulary compositional explanations. The proposed framework consists of three steps: specifying arbitrary concepts, generating semantic segmentation masks using open vocabulary models, and deriving compositional explanations from these masks. The paper compares the proposed framework with previous methods for computing compositional explanations both in terms of quantitative metrics and human interpretability, analyzes the differences in explanations when shifting from human-annotated data to model-annotated data, and showcases the additional capabilities provided by the framework in terms of flexibility of the explanations with respect to the tasks and properties of interest.

[73] UruDendro4: A Benchmark Dataset for Automatic Tree-Ring Detection in Cross-Section Images of Pinus taeda L

Henry Marichal,Joaquin Blanco,Diego Passarella,Gregory Randall

Main category: cs.CV

TL;DR: 本文提出了UruDendro4数据集,包含102个火炬松(Pinus taeda L.)的木材横截面图像,并提供了人工标注的年轮信息,支持基于多高度采样进行年轮生长的体积建模。同时提供了年轮自动检测的性能基线,DeepCS-TRD方法表现最佳,并验证了该数据集能提升模型在年轮检测任务中的泛化能力。

Details Motivation: 由于现有木材横截面数据稀缺且多局限于单一高度,难以支持树木年轮的三维体积建模,本文旨在构建一个高质量、多高度采样的公开数据集以推动年轮自动检测与生长建模研究。 Method: 提出UruDendro4数据集,包含102个火炬松样本,每个样本在树干不同高度采集并人工标注年轮;采用当前最先进的年轮检测方法进行基准测试,并通过消融实验优化参数配置;进一步评估该数据集对模型泛化能力的影响。 Result: DeepCS-TRD方法在UruDendro4上达到最高性能,平均精度为0.838,平均召回率为0.782,调整Rand误差为0.084;实验证明引入该数据集训练可显著提升模型在跨样本年轮检测中的泛化性能。 Conclusion: UruDendro4是一个独特且有价值的数据集,支持多高度年轮分析和体积建模,填补了当前数据资源的空白,并为年轮自动识别算法的发展提供了重要基础。 Abstract: Tree-ring growth represents the annual wood increment for a tree, and quantifying it allows researchers to assess which silvicultural practices are best suited for each species. Manual measurement of this growth is time-consuming and often imprecise, as it is typically performed along 4 to 8 radial directions on a cross-sectional disc. In recent years, automated algorithms and datasets have emerged to enhance accuracy and automate the delineation of annual rings in cross-sectional images. To address the scarcity of wood cross-section data, we introduce the UruDendro4 dataset, a collection of 102 image samples of Pinus taeda L., each manually annotated with annual growth rings. Unlike existing public datasets, UruDendro4 includes samples extracted at multiple heights along the stem, allowing for the volumetric modeling of annual growth using manually delineated rings. This dataset (images and annotations) allows the development of volumetric models for annual wood estimation based on cross-sectional imagery. Additionally, we provide a performance baseline for automatic ring detection on this dataset using state-of-the-art methods. The highest performance was achieved by the DeepCS-TRD method, with a mean Average Precision of 0.838, a mean Average Recall of 0.782, and an Adapted Rand Error score of 0.084. A series of ablation experiments were conducted to empirically validate the final parameter configuration. Furthermore, we empirically demonstrate that training a learning model including this dataset improves the model's generalization in the tree-ring detection task.

[74] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

Rawa Mohammed,Mina Attin,Bryar Shareef

Main category: cs.CV

TL;DR: 提出BUSTR,一种无需配对图像-报告监督的多任务视觉-语言框架,用于乳腺超声自动报告生成,通过结合标记级交叉熵和表示对齐损失提升生成质量和临床有效性。

Details Motivation: 现有的乳腺超声自动报告生成受限于缺乏配对的图像-报告数据集以及大语言模型的幻觉问题。 Method: BUSTR利用结构化描述符(如BI-RADS、病理、组织学)和放射组学特征构建报告,采用多头Swin编码器学习描述符感知的视觉表示,并通过结合标记级交叉熵和余弦相似性对齐损失的双层目标实现视觉与文本标记的对齐。 Result: 在BrEaST和BUS-BRA两个公开数据集上,BUSTR在标准自然语言生成指标和临床有效性指标上均取得一致提升,尤其在BI-RADS类别和病理判断方面表现更优。 Conclusion: 该方法在无需配对图像-报告数据的情况下,通过描述符感知的视觉模型和联合损失策略,显著提升了自动报告生成的质量和临床实用性。 Abstract: Automated radiology report generation (RRG) for breast ultrasound (BUS) is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models. We propose BUSTR, a multitask vision-language framework that generates BUS reports without requiring paired image-report supervision. BUSTR constructs reports from structured descriptors (e.g., BI-RADS, pathology, histology) and radiomics features, learns descriptor-aware visual representations with a multi-head Swin encoder trained using a multitask loss over dataset-specific descriptor sets, and aligns visual and textual tokens via a dual-level objective that combines token-level cross-entropy with a cosine-similarity alignment loss between input and output representations. We evaluate BUSTR on two public BUS datasets, BrEaST and BUS-BRA, which differ in size and available descriptors. Across both datasets, BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics, particularly for key targets such as BI-RADS category and pathology. Our results show that this descriptor-aware vision model, trained with a combined token-level and alignment loss, improves both automatic report metrics and clinical efficacy without requiring paired image-report data. The source code can be found at https://github.com/AAR-UNLV/BUSTR

[75] Beyond Realism: Learning the Art of Expressive Composition with StickerNet

Haoming Lu,David Kocharian,Humphrey Shi

Main category: cs.CV

TL;DR: 本文提出了表达性图像合成任务,旨在反映用户在真实创作平台上的编辑行为,而非追求视觉真实性。为此,作者设计了StickerNet框架,并基于180万条真实用户的编辑操作构建数据集,通过两阶段方法预测贴纸的类型与位置参数。实验表明该方法优于基线模型,更贴近人类的创作习惯。

Details Motivation: 传统图像合成注重视觉真实性和语义合理性,但现实中许多用户编辑行为偏向艺术性、趣味性或社交性,而非写实。因此需要一种新的图像合成范式来反映这种表达性意图。 Method: 提出StickerNet,一个两阶段框架:第一阶段确定合成类型,第二阶段预测透明度、掩码、位置和缩放等参数;使用从匿名在线平台收集的180万个真实编辑动作构建数据集,直接反映用户社区认可的布局决策。 Result: 用户研究和定量评估显示,StickerNet在匹配人类放置行为方面优于常见基线方法,验证了从真实世界编辑模式中学习的有效性。 Conclusion: 本工作引入了一种强调表达性和用户意图的新视觉理解方向,突破了传统对真实性的依赖,为图像合成提供了更贴近实际应用场景的新思路。 Abstract: As a widely used operation in image editing workflows, image composition has traditionally been studied with a focus on achieving visual realism and semantic plausibility. However, in practical editing scenarios of the modern content creation landscape, many compositions are not intended to preserve realism. Instead, users of online platforms motivated by gaining community recognition often aim to create content that is more artistic, playful, or socially engaging. Taking inspiration from this observation, we define the expressive composition task, a new formulation of image composition that embraces stylistic diversity and looser placement logic, reflecting how users edit images on real-world creative platforms. To address this underexplored problem, we present StickerNet, a two-stage framework that first determines the composition type, then predicts placement parameters such as opacity, mask, location, and scale accordingly. Unlike prior work that constructs datasets by simulating object placements on real images, we directly build our dataset from 1.8 million editing actions collected on an anonymous online visual creation and editing platform, each reflecting user-community validated placement decisions. This grounding in authentic editing behavior ensures strong alignment between task definition and training supervision. User studies and quantitative evaluations show that StickerNet outperforms common baselines and closely matches human placement behavior, demonstrating the effectiveness of learning from real-world editing patterns despite the inherent ambiguity of the task. This work introduces a new direction in visual understanding that emphasizes expressiveness and user intent over realism.

[76] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

Md Adnan Arefeen,Biplob Debnath,Srimat Chakradhar

Main category: cs.CV

TL;DR: 提出TrafficLens算法,通过序列化处理和对象级相似性检测,显著减少多摄像头交通视频转文本的时间,同时保持信息准确性。

Details Motivation: 现有方法在将交通视频数据转换为文本时效率低下,延迟了对交通视频的及时分析与应用。 Method: 采用基于摄像头覆盖区域重叠的序列化处理方式,迭代使用不同token限制的视觉-语言模型(VLM),并将前序输出作为后续提示,结合对象级相似性检测跳过冗余的VLM调用。 Result: 实验结果表明,TrafficLens可将视频到文本的转换时间减少最多4倍,同时保持信息准确。 Conclusion: TrafficLens为多摄像头交通交叉口提供了一种高效、快速且准确的视频分析方案,提升了交通视频数据的利用效率。 Abstract: Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.

[77] Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI

Al Amin,Kamrul Hasan,Liang Hong,Sharif Ullah

Main category: cs.CV

TL;DR: 提出了一种结合Vision Transformer和同态加密的隐私保护联邦学习框架,用于跨医疗机构的安全病理图像分类,通过加密CLS token实现高效且安全的模型聚合。

Details Motivation: 传统的联邦学习中模型梯度容易受到重构攻击,威胁医疗数据隐私,因此需要更安全的隐私保护机制。 Method: 利用ViT的CLS token作为紧凑特征表示,使用CKKS同态加密算法在传输前对CLS token进行加密,并在密文上直接进行推理和聚合。 Result: 与梯度加密相比,该方法通信量减少30倍;在三客户端设置中,传统梯度易受逆向攻击(PSNR: 52.26 dB, SSIM: 0.999),而本方法能有效防御此类攻击,每轮聚合仅需326 KB加密数据传输;全局分类准确率在明文下为96.12%,密文下为90.02%。 Conclusion: 该框架在保证强隐私性的同时实现了高效的跨机构协作学习,适用于敏感医疗图像分析场景。 Abstract: Collaborative machine learning across healthcare institutions promises improved diagnostic accuracy by leveraging diverse datasets, yet privacy regulations such as HIPAA prohibit direct patient data sharing. While federated learning (FL) enables decentralized training without raw data exchange, recent studies show that model gradients in conventional FL remain vulnerable to reconstruction attacks, potentially exposing sensitive medical information. This paper presents a privacy-preserving federated learning framework combining Vision Transformers (ViT) with homomorphic encryption (HE) for secure multi-institutional histopathology classification. The approach leverages the ViT CLS token as a compact 768-dimensional feature representation for secure aggregation, encrypting these tokens using CKKS homomorphic encryption before transmission to the server. We demonstrate that encrypting CLS tokens achieves a 30-fold communication reduction compared to gradient encryption while maintaining strong privacy guarantees. Through evaluation on a three-client federated setup for lung cancer histopathology classification, we show that gradients are highly susceptible to model inversion attacks (PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741), enabling near-perfect image reconstruction. In contrast, the proposed CLS-protected HE approach prevents such attacks while enabling encrypted inference directly on ciphertexts, requiring only 326 KB of encrypted data transmission per aggregation round. The framework achieves 96.12 percent global classification accuracy in the unencrypted domain and 90.02 percent in the encrypted domain.

[78] Inversion-Free Style Transfer with Dual Rectified Flows

Yingying Deng,Xiangyu He,Fan Tang,Weiming Dong,Xucheng Yin

Main category: cs.CV

TL;DR: 提出了一种基于双校正流的无需反演的风格迁移框架,仅通过前向传播即可实现高效、高质量的图像风格迁移。

Details Motivation: 现有的基于扩散模型的无训练方法依赖计算成本高的反演过程,且反演不准确时会导致视觉失真,因此需要一种更高效、稳定的风格迁移方法。 Method: 提出双校正流框架,平行预测内容与风格轨迹,并通过动态中点插值融合二者速度场;结合注意力注入机制引导风格融合,全程无需反演,仅需前向传播。 Result: 在多种风格和内容组合上表现出色,实现了更好的内容保持、更高的视觉保真度和更快的推理速度。 Conclusion: 该方法有效解决了传统扩散模型在风格迁移中的效率与失真问题,提供了一个高效、通用且无需反演的新范式。 Abstract: Style transfer, a pivotal task in image processing, synthesizes visually compelling images by seamlessly blending realistic content with artistic styles, enabling applications in photo editing and creative design. While mainstream training-free diffusion-based methods have greatly advanced style transfer in recent years, their reliance on computationally inversion processes compromises efficiency and introduces visual distortions when inversion is inaccurate. To address these limitations, we propose a novel \textit{inversion-free} style transfer framework based on dual rectified flows, which tackles the challenge of finding an unknown stylized distribution from two distinct inputs (content and style images), \textit{only with forward pass}. Our approach predicts content and style trajectories in parallel, then fuses them through a dynamic midpoint interpolation that integrates velocities from both paths while adapting to the evolving stylized image. By jointly modeling the content, style, and stylized distributions, our velocity field design achieves robust fusion and avoids the shortcomings of naive overlays. Attention injection further guides style integration, enhancing visual fidelity, content preservation, and computational efficiency. Extensive experiments demonstrate generalization across diverse styles and content, providing an effective and efficient pipeline for style transfer.

[79] RefOnce: Distilling References into a Prototype Memory for Referring Camouflaged Object Detection

Yu-Huan Wu,Zi-Xuan Zhu,Yan Wang,Liangli Zhen,Deng-Ping Fan

Main category: cs.CV

TL;DR: 提出一种新的Ref-COD框架,通过在训练时将参考信息蒸馏到类别原型记忆中,在推理时无需参考图像即可生成引导向量,实现高效、简洁的伪装目标检测。

Details Motivation: 现有Ref-COD方法依赖测试时的参考图像,导致部署困难、延迟高和数据收集负担重。 Method: 采用EMA更新的类别原型记忆库,通过查询条件下的原型混合预测权重生成参考向量,并引入双向注意力对齐模块缩小参考统计与查询特征之间的表示差距。 Result: 在R2C7K大规模基准上实验表明,该方法性能优于或媲美当前最先进方法。 Conclusion: 该方法提供了一种无需测试时参考图像的简洁高效Ref-COD解决方案,有利于实际部署。 Abstract: Referring Camouflaged Object Detection (Ref-COD) segments specified camouflaged objects in a scene by leveraging a small set of referring images. Though effective, current systems adopt a dual-branch design that requires reference images at test time, which limits deployability and adds latency and data-collection burden. We introduce a Ref-COD framework that distills references into a class-prototype memory during training and synthesizes a reference vector at inference via a query-conditioned mixture of prototypes. Concretely, we maintain an EMA-updated prototype per category and predict mixture weights from the query to produce a guidance vector without any test-time references. To bridge the representation gap between reference statistics and camouflaged query features, we propose a bidirectional attention alignment module that adapts both the query features and the class representation. Thus, our approach yields a simple, efficient path to Ref-COD without mandatory references. We evaluate the proposed method on the large-scale R2C7K benchmark. Extensive experiments demonstrate competitive or superior performance of the proposed method compared with recent state-of-the-arts. Code is available at https://github.com/yuhuan-wu/RefOnce.

[80] Wavefront-Constrained Passive Obscured Object Detection

Zhiwen Zheng,Yiwei Ouyang,Zhao Huang,Tao Zhang,Xiaoshuai Zhang,Huiyu Zhou,Wenwen Tang,Shaowei Jiang,Jin Liu,Xingru Huang

Main category: cs.CV

TL;DR: 提出了一种物理驱动的WavePCNet网络,用于通过模拟波前传播来增强遮挡物体的感知能力,在低信噪比和复杂介质条件下实现了更准确、鲁棒的非视域成像。

Details Motivation: 现有方法在建模相干光传播时忽略复振幅特性,且在低信噪比下易产生非物理解,难以应对多重散射和介质扰动,因此需要一种能更好融合物理规律的高鲁棒性方法。 Method: 提出WavePCNet,引入Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) 模块以复振幅传递算子精确约束相干传播过程,并结合动量记忆机制抑制扰动累积;同时设计高频跨层补偿增强模块,构建多尺度感受野的频率选择路径,动态建模层间结构一致性。 Result: 在四个真实采集的数据集上实验表明,WavePCNet在定位与分割被遮挡物体方面显著优于现有最先进方法,兼具更高精度与鲁棒性。 Conclusion: WavePCNet通过深度融合物理模型与深度学习,有效提升了非视域成像在复杂条件下的性能,增强了模型的可解释性与稳定性。 Abstract: Accurately localizing and segmenting obscured objects from faint light patterns beyond the field of view is highly challenging due to multiple scattering and medium-induced perturbations. Most existing methods, based on real-valued modeling or local convolutional operations, are inadequate for capturing the underlying physics of coherent light propagation. Moreover, under low signal-to-noise conditions, these methods often converge to non-physical solutions, severely compromising the stability and reliability of the observation. To address these challenges, we propose a novel physics-driven Wavefront Propagating Compensation Network (WavePCNet) to simulate wavefront propagation and enhance the perception of obscured objects. This WavePCNet integrates the Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) to incorporate complex amplitude transfer operators to precisely constrain coherent propagation behavior, along with a momentum memory mechanism to effectively suppress the accumulation of perturbations. Additionally, a High-frequency Cross-layer Compensation Enhancement is introduced to construct frequency-selective pathways with multi-scale receptive fields and dynamically model structural consistency across layers, further boosting the model's robustness and interpretability under complex environmental conditions. Extensive experiments conducted on four physically collected datasets demonstrate that WavePCNet consistently outperforms state-of-the-art methods across both accuracy and robustness.

[81] GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision

Yuxiao Xiang,Junchi Chen,Zhenchao Jin,Changtao Miao,Haojie Yuan,Qi Chu,Tao Gong,Nenghai Yu

Main category: cs.CV

TL;DR: 本文提出了GuardTrace-VL,一种面向多模态大推理模型的视觉感知安全审计方法,通过联合分析图像和文本内容来监控完整的问答推理链,有效检测推理过程中出现的不安全内容。

Details Motivation: 现有的多模态安全防护机制主要关注输入问题和最终答案,忽略了中间推理过程可能包含有害内容(如偏见推断或违规使用视觉信息),存在部署风险。因此需要一种能够监控整个推理链的安全审计方法。 Method: 提出GuardTrace-VL,结合图像与文本对问题-思考-答案(QTA)全流程进行联合分析;构建GuardTrace数据集,采用多样化提示策略生成并经由大模型与人工投票验证精炼;设计三阶段渐进式训练方案,使模型能根据不同的风险等级学习细粒度的安全偏好。 Result: 在涵盖领域内和领域外场景的测试集上,GuardTrace-VL在检测不安全推理任务中的F1分数达到93.1%,相比此前最强的多模态安全防御方法提升了13.5%。 Conclusion: GuardTrace-VL能有效识别多模态推理过程中的潜在安全隐患,显著优于现有方法,为安全可控的多模态大模型部署提供了有力支持。 Abstract: Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.

[82] From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

Jingxi Chen,Yixiao Zhang,Xiaoye Qian,Zongxia Li,Cornelia Fermuller,Caren Chen,Yiannis Aloimonos

Main category: cs.CV

TL;DR: 提出一种基于扩散模型的图像分层分解方法,通过轻量级微调和新型多模态上下文融合模块,在合成数据上训练,实现高质量的对象移除与遮挡恢复。

Details Motivation: 现有单幅图像分层分解方法受限于数据和模型能力,难以有效分离前景与背景并保持细节。 Method: 利用扩散型修复模型,结合轻量级微调和多模态上下文融合模块,通过线性注意力机制在潜在空间中保留更多细节。 Result: 在合成数据集上训练后,模型在对象移除和遮挡恢复任务中表现出优越性能。 Conclusion: 该方法为图像编辑和创意应用提供了有效的分层分解方案,无需真实标注数据即可实现高质量结果。 Abstract: Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.

[83] Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

Xiaoxing You,Qiang Huang,Lingyu Li,Chi Zhang,Xiaopeng Liu,Min Zhang,Jun Yu

Main category: cs.CV

TL;DR: 提出了一种名为MERGE的多模态实体感知检索增强生成框架,用于新闻图像描述生成,通过构建实体中心的多模态知识库提升信息覆盖、跨模态对齐和视觉实体定位能力,在多个数据集上显著优于现有方法。

Details Motivation: 现有新闻图像描述方法在信息覆盖、跨模态对齐和视觉实体定位方面存在不足,难以有效结合上下文生成高质量描述。 Method: 提出MERGE框架,构建融合文本、视觉与结构化知识的实体中心多模态知识库(EMKB),采用多阶段假设-描述策略改善跨模态对齐,并通过图像引导的动态检索增强视觉实体匹配。 Result: 在GoodNews和NYTimes800k上CIDEr分别提升6.84和1.16,F1-score在NER上提高4.14和2.64;在未见的Visual News数据集上CIDEr提升20.17,F1-score提升6.22,表现出强鲁棒性与领域适应性。 Conclusion: MERGE有效解决了新闻图像描述中的关键挑战,具备优越的生成质量与泛化能力,是首个支持检索增强生成的实体感知多模态框架。 Abstract: News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.

[84] MetaRank: Task-Aware Metric Selection for Model Transferability Estimation

Yuhang Liu,Wenjie Zhao,Yunhui Guo

Main category: cs.CV

TL;DR: 本文提出了MetaRank,一种基于元学习的模型迁移性评估指标选择框架,通过将数据集和评估指标的文本描述嵌入共享语义空间,实现任务感知的自动指标排序。

Details Motivation: 现有模型迁移性评估(MTE)方法的效果高度依赖任务,且缺乏对不同目标数据集自适应选择最优MTE指标的机制。 Method: 将MTE指标选择建模为学习排序问题,利用预训练语言模型编码数据集与MTE指标的文本描述,并在共享语义空间中训练一个离线元预测器,采用listwise目标优化以正确排序高性能指标。 Result: 在11个预训练模型和11个目标数据集上的实验表明,MetaRank能有效为新任务选择最优MTE指标,显著优于固定或平均表现驱动的选择策略。 Conclusion: MetaRank实现了任务感知的自动化MTE指标选择,提升了迁移学习中源模型选择的效率与准确性。 Abstract: Selecting an appropriate pre-trained source model is a critical, yet computationally expensive, task in transfer learning. Model Transferability Estimation (MTE) methods address this by providing efficient proxy metrics to rank models without full fine-tuning. In practice, the choice of which MTE metric to use is often ad hoc or guided simply by a metric's average historical performance. However, we observe that the effectiveness of MTE metrics is highly task-dependent and no single metric is universally optimal across all target datasets. To address this gap, we introduce MetaRank, a meta-learning framework for automatic, task-aware MTE metric selection. We formulate metric selection as a learning-to-rank problem. Rather than relying on conventional meta-features, MetaRank encodes textual descriptions of both datasets and MTE metrics using a pretrained language model, embedding them into a shared semantic space. A meta-predictor is then trained offline on diverse meta-tasks to learn the intricate relationship between dataset characteristics and metric mechanisms, optimized with a listwise objective that prioritizes correctly ranking the top-performing metrics. During the subsequent online phase, MetaRank efficiently ranks the candidate MTE metrics for a new, unseen target dataset based on its textual description, enabling practitioners to select the most appropriate metric a priori. Extensive experiments across 11 pretrained models and 11 target datasets demonstrate the strong effectiveness of our approach.

[85] Structure-Aware Prototype Guided Trusted Multi-View Classification

Haojian Huang,Jiahao Shi,Zhe Liu,Harold Haodong Chen,Han Fang,Hao Sun,Zhongjiang He

Main category: cs.CV

TL;DR: 提出一种新的多视图分类框架,通过引入原型表示各视图的邻域结构,简化视图内关系学习并实现视图间动态对齐,提升分类的效率与可信度。

Details Motivation: 现有可信多视图分类方法依赖全局密集邻域关系,计算成本高且难以保证视图间一致性;同时使用手动赋权聚合证据,缺乏类空间内多视图结构一致性的保障,影响分类可信度。 Method: 引入原型来表示每个视图的邻域结构,简化视图内依赖建模,并通过动态对齐机制实现视图内与视图间结构的一致性学习,从而提高跨视图共识发现的效率与可靠性。 Result: 在多个公开多视图数据集上的实验表明,该方法在下游任务性能和鲁棒性方面优于或媲美现有的主流可信多视图分类方法。 Conclusion: 所提基于原型的多视图分类框架有效提升了多源信息融合的效率与一致性,增强了分类结果的可信性,为复杂场景下的可靠决策提供了新思路。 Abstract: Trustworthy multi-view classification (TMVC) addresses the challenge of achieving reliable decision-making in complex scenarios where multi-source information is heterogeneous, inconsistent, or even conflicting. Existing TMVC approaches predominantly rely on globally dense neighbor relationships to model intra-view dependencies, leading to high computational costs and an inability to directly ensure consistency across inter-view relationships. Furthermore, these methods typically aggregate evidence from different views through manually assigned weights, lacking guarantees that the learned multi-view neighbor structures are consistent within the class space, thus undermining the trustworthiness of classification outcomes. To overcome these limitations, we propose a novel TMVC framework that introduces prototypes to represent the neighbor structures of each view. By simplifying the learning of intra-view neighbor relations and enabling dynamic alignment of intra- and inter-view structure, our approach facilitates more efficient and consistent discovery of cross-view consensus. Extensive experiments on multiple public multi-view datasets demonstrate that our method achieves competitive downstream performance and robustness compared to prevalent TMVC methods.

[86] CameraMaster: Unified Camera Semantic-Parameter Control for Photography Retouching

Qirui Yang,Yang Yang,Ying Zeng,Xiaobin Hu,Bo Li,Huanjing Yue,Jingyu Yang,Peng-Tao Jiang

Main category: cs.CV

TL;DR: 本文提出CameraMaster,一种统一的相机感知图像润饰框架,通过解耦相机指令并融合意图与参数信息,实现对曝光、白平衡、变焦等参数的精确控制,克服了现有方法在可扩展性和多参数组合上的局限。

Details Motivation: 现有基于文本引导的扩散模型在图像编辑中难以实现物理一致且精确的相机参数控制,主要受限于模糊的文本提示或需额外训练头/权重,导致可扩展性差、多参数组合困难。 Method: 提出CameraMaster框架,显式解耦相机指令;引入参数嵌入调制内容语义和相机指令,并通过交叉注意力将调制后的指令注入特征;同时将指令和参数嵌入作为条件和门控信号注入时间嵌入,实现去噪过程中逐层统一调控。 Result: 构建了包含78K图像-提示对的大规模数据集;实验表明CameraMaster对参数变化具有单调且近线性的响应,支持无缝的多参数组合,在性能上显著优于现有方法。 Conclusion: CameraMaster实现了高敏感性、可组合性和物理一致性的图像润饰,为精确相机参数控制提供了可扩展的统一解决方案。 Abstract: Text-guided diffusion models have greatly advanced image editing and generation. However, achieving physically consistent image retouching with precise parameter control (e.g., exposure, white balance, zoom) remains challenging. Existing methods either rely solely on ambiguous and entangled text prompts, which hinders precise camera control, or train separate heads/weights for parameter adjustment, which compromises scalability, multi-parameter composition, and sensitivity to subtle variations. To address these limitations, we propose CameraMaster, a unified camera-aware framework for image retouching. The key idea is to explicitly decouple the camera directive and then coherently integrate two critical information streams: a directive representation that captures the photographer's intent, and a parameter embedding that encodes precise camera settings. CameraMaster first uses the camera parameter embedding to modulate both the camera directive and the content semantics. The modulated directive is then injected into the content features via cross-attention, yielding a strongly camera-sensitive semantic context. In addition, the directive and camera embeddings are injected as conditioning and gating signals into the time embedding, enabling unified, layer-wise modulation throughout the denoising process and enforcing tight semantic-parameter alignment. To train and evaluate CameraMaster, we construct a large-scale dataset of 78K image-prompt pairs annotated with camera parameters. Extensive experiments show that CameraMaster produces monotonic and near-linear responses to parameter variations, supports seamless multi-parameter composition, and significantly outperforms existing methods.

[87] CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang,Yunong Liu,Bohan Zhai,Ximeng Sun,Zicheng Liu,Emad Barsoum,Manling Li,Chenfeng Xu

Main category: cs.CV

TL;DR: 提出了一种基于实用性的图像字幕评估基准CaptionQA,通过衡量字幕在下游任务中的表现来评估其质量,揭示了现有模型在字幕实用性上的显著缺陷。

Details Motivation: 现有字幕评估方法未能回答一个核心问题:字幕能否在真实下游任务中替代图像?因此需要一种基于实用性的新评估方式。 Method: 构建了一个可扩展的、领域相关的基准CaptionQA,包含4个领域、25个顶层和69个子类别,并创建了33,027个需依赖视觉信息的多项选择题,使用LLM仅基于字幕回答问题以评估其效用。 Result: 评估显示,即使在传统图像问答基准上表现相近的多模态大模型,在字幕实用性方面差距高达32%,表明当前字幕无法有效保留图像的实用信息。 Conclusion: CaptionQA能有效衡量字幕在下游任务中的实用性,揭示了当前生成模型的重大缺陷,为未来改进提供了方向。 Abstract: Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.

[88] FlowerDance: MeanFlow for Efficient and Refined 3D Dance Generation

Kaixing Yang,Xulong Tang,Ziqiao Peng,Xiangyue Zhang,Puwei Wang,Jun He,Hongyan Liu

Main category: cs.CV

TL;DR: FlowerDance是一种高效的音乐到舞蹈生成方法,结合MeanFlow和物理一致性约束,在少量采样步骤下生成高质量、物理合理且富有表现力的舞蹈动作,同时通过BiMamba骨干网络和通道级跨模态融合实现快速推理与低内存占用,并支持交互式动作编辑。

Details Motivation: 现有音乐到舞蹈生成方法生成效率低,限制了在高保真3D渲染等实际应用中的表现力和实时性需求。 Method: 提出FlowerDance,结合MeanFlow与物理一致性约束以减少采样步数;采用BiMamba骨干网络和通道级跨模态融合的非自回归架构,提升推理速度和内存效率,并支持运动编辑功能。 Result: 在AIST++和FineDance数据集上实验表明,FlowerDance在运动质量和生成效率方面均达到SOTA水平,显著提升推理速度并降低内存使用。 Conclusion: FlowerDance在保证舞蹈动作高质量的同时大幅提升了生成效率,具备良好的实用性与交互能力,适用于虚拟现实、数字娱乐等实际场景。 Abstract: Music-to-dance generation aims to translate auditory signals into expressive human motion, with broad applications in virtual reality, choreography, and digital entertainment. Despite promising progress, the limited generation efficiency of existing methods leaves insufficient computational headroom for high-fidelity 3D rendering, thereby constraining the expressiveness of 3D characters during real-world applications. Thus, we propose FlowerDance, which not only generates refined motion with physical plausibility and artistic expressiveness, but also achieves significant generation efficiency on inference speed and memory utilization . Specifically, FlowerDance combines MeanFlow with Physical Consistency Constraints, which enables high-quality motion generation with only a few sampling steps. Moreover, FlowerDance leverages a simple but efficient model architecture with BiMamba-based backbone and Channel-Level Cross-Modal Fusion, which generates dance with efficient non-autoregressive manner. Meanwhile, FlowerDance supports motion editing, enabling users to interactively refine dance sequences. Extensive experiments on AIST++ and FineDance show that FlowerDance achieves state-of-the-art results in both motion quality and generation efficiency. Code will be released upon acceptance.

[89] LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules

Cheng Yang,Hui Jin,Xinlei Yu,Zhipeng Wang,Yaoqun Liu,Fenglei Fan,Dajiang Lei,Gangyong Jia,Changmiao Wang,Ruiquan Ge

Main category: cs.CV

TL;DR: 本文提出了一种名为LungNoduleAgent的协作式多智能体系统,用于分析肺部CT扫描,通过三个模块提高肺结节描述和恶性分级的准确性,并在多个数据集上表现出优于现有模型的性能。

Details Motivation: 现有的多模态大语言模型在准确描述肺结节形态和融合医学专业知识方面存在局限,影响其在临床应用中的可靠性;而多智能体系统在病理学中的潜力尚未充分探索。 Method: 将诊断过程分解为三个模块:Nodule Spotter负责检测结节,Radiologist结合局部图像描述技术生成详细CT报告,Doctor Agent System结合影像、报告、病理知识库和多智能体框架进行恶性推理。 Result: 在两个私有数据集和公开LIDC-IDRI数据集上的实验表明,LungNoduleAgent在结节描述和恶性分级任务上优于主流视觉-语言模型、智能体系统和专家模型。 Conclusion: 区域级语义对齐和多智能体协作对肺结节诊断至关重要,LungNoduleAgent有望成为支持临床肺结节分析的基础工具。 Abstract: Diagnosing lung cancer typically involves physicians identifying lung nodules in Computed tomography (CT) scans and generating diagnostic reports based on their morphological features and medical expertise. Although advancements have been made in using multimodal large language models for analyzing lung CT scans, challenges remain in accurately describing nodule morphology and incorporating medical expertise. These limitations affect the reliability and effectiveness of these models in clinical settings. Collaborative multi-agent systems offer a promising strategy for achieving a balance between generality and precision in medical applications, yet their potential in pathology has not been thoroughly explored. To bridge these gaps, we introduce LungNoduleAgent, an innovative collaborative multi-agent system specifically designed for analyzing lung CT scans. LungNoduleAgent streamlines the diagnostic process into sequential components, improving precision in describing nodules and grading malignancy through three primary modules. The first module, the Nodule Spotter, coordinates clinical detection models to accurately identify nodules. The second module, the Radiologist, integrates localized image description techniques to produce comprehensive CT reports. Finally, the Doctor Agent System performs malignancy reasoning by using images and CT reports, supported by a pathology knowledge base and a multi-agent system framework. Extensive testing on two private datasets and the public LIDC-IDRI dataset indicates that LungNoduleAgent surpasses mainstream vision-language models, agent systems, and advanced expert models. These results highlight the importance of region-level semantic alignment and multi-agent collaboration in diagnosing nodules. LungNoduleAgent stands out as a promising foundational tool for supporting clinical analyses of lung nodules.

[90] PG-ControlNet: A Physics-Guided ControlNet for Generative Spatially Varying Image Deblurring

Hakki Motorcu,Mujdat Cetin

Main category: cs.CV

TL;DR: 本文提出了一种新的图像去模糊框架,通过将强大的生成先验与显式的密集物理约束相结合,解决了空间变化模糊的病态问题,在物理准确性和感知真实性之间取得了良好平衡。

Details Motivation: 现有的基于学习的去模糊方法在物理约束和感知质量之间难以兼顾:模型驱动方法虽有较强物理约束但纹理过平滑,生成模型虽视觉效果好但容易产生幻觉细节。本文旨在融合二者优势。 Method: 将退化场建模为高维压缩核的稠密连续体,并利用该描述场作为条件引导ControlNet架构下的扩散采样过程,从而在保留物理准确性的同时提升纹理细节的感知质量。 Result: 实验表明,该方法在严重模糊的复杂场景下优于现有最先进的模型驱动方法和生成模型,在PSNR和LPIPS等指标上均有提升。 Conclusion: 本文提出的框架成功结合了模型驱动与生成模型的优点,通过稠密物理约束调制生成先验,实现了更真实且物理一致的去模糊结果。 Abstract: Spatially varying image deblurring remains a fundamentally ill-posed problem, especially when degradations arise from complex mixtures of motion and other forms of blur under significant noise. State-of-the-art learning-based approaches generally fall into two paradigms: model-based deep unrolling methods that enforce physical constraints by modeling the degradations, but often produce over-smoothed, artifact-laden textures, and generative models that achieve superior perceptual quality yet hallucinate details due to weak physical constraints. In this paper, we propose a novel framework that uniquely reconciles these paradigms by taming a powerful generative prior with explicit, dense physical constraints. Rather than oversimplifying the degradation field, we model it as a dense continuum of high-dimensional compressed kernels, ensuring that minute variations in motion and other degradation patterns are captured. We leverage this rich descriptor field to condition a ControlNet architecture, strongly guiding the diffusion sampling process. Extensive experiments demonstrate that our method effectively bridges the gap between physical accuracy and perceptual realism, outperforming state-of-the-art model-based methods as well as generative baselines in challenging, severely blurred scenarios.

[91] MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization

Yingjie Xia,Xi Wang,Jinglei Shi,Vicky Kalogeiton,Jian Yang

Main category: cs.CV

TL;DR: 本文提出了MUSE,首个统一的图像情感生成与编辑框架,通过测试时扩展策略实现无需额外训练的情感合成,解决了如何稳定引导、何时引入情感指导以及选择何种情感的问题。

Details Motivation: 现有图像情感合成方法将生成与编辑任务割裂,导致效率低下且限制了在治疗、叙事等场景的应用。需要一个统一框架来同时处理这两类任务。 Method: 采用类似测试时扩展(TTS)的策略,利用现成的情绪分类器进行基于梯度优化的情感token控制;通过语义相似性确定最佳情感引导时机,并设计多情绪损失函数以减少情绪干扰。 Result: 实验表明,MUSE在情感准确性和语义多样性上优于现有方法,同时保持内容保真度、文本对齐和真实情感表达之间的良好平衡。 Conclusion: MUSE为图像情感合成建立了新范式,首次实现了无需模型更新和专用数据集的统一生成与编辑框架。 Abstract: Images evoke emotions that profoundly influence perception, often prioritized over content. Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling. In this work, we introduce MUSE, the first unified framework capable of both emotional generation and editing. By adopting a strategy conceptually aligned with Test-Time Scaling (TTS) that widely used in both LLM and diffusion model communities, it avoids the requirement for additional updating diffusion model and specialized emotional synthesis datasets. More specifically, MUSE addresses three key questions in emotional synthesis: (1) HOW to stably guide synthesis by leveraging an off-the-shelf emotion classifier with gradient-based optimization of emotional tokens; (2) WHEN to introduce emotional guidance by identifying the optimal timing using semantic similarity as a supervisory signal; and (3) WHICH emotion to guide synthesis through a multi-emotion loss that reduces interference from inherent and similar emotions. Experimental results show that MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining an optimal balance between desired content, adherence to text prompts, and realistic emotional expression. It establishes a new paradigm for emotion synthesis.

[92] Long-Term Alzheimers Disease Prediction: A Novel Image Generation Method Using Temporal Parameter Estimation with Normal Inverse Gamma Distribution on Uneven Time Series

Xin Hong,Xinze Sun,Yinhao Li,Yen-Wei Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于时间参数化正态逆伽马分布(T-NIG)的模型,用于在不规则时间间隔下生成脑部图像并实现阿尔茨海默病的长期预测,有效保持疾病相关特征并降低不确定性。

Details Motivation: 现有方法在处理不规则时间间隔的序列数据时难以维持与疾病相关的特征,影响长期AD预测的准确性。 Method: 提出T-NIG模型,利用两个时间点的脑图像生成中间和未来图像;通过坐标邻域提取特征,并将时间参数引入正态逆伽马分布以建模不均匀时间间隔下的特征变化,同时结合不确定性估计来减少认知和随机不确定性。 Result: T-NIG在短期和长期预测任务中均表现出最先进的性能,能够准确预测疾病进展,并在不规则时间分布下保持疾病相关特征。 Conclusion: T-NIG有效解决了不规则时间间隔下图像生成中疾病特征丢失的问题,提升了长期AD预测的鲁棒性和准确性。 Abstract: Image generation can provide physicians with an imaging diagnosis basis in the prediction of Alzheimer's Disease (AD). Recent research has shown that long-term AD predictions by image generation often face difficulties maintaining disease-related characteristics when dealing with irregular time intervals in sequential data. Considering that the time-related aspects of the distribution can reflect changes in disease-related characteristics when images are distributed unevenly, this research proposes a model to estimate the temporal parameter within the Normal Inverse Gamma Distribution (T-NIG) to assist in generating images over the long term. The T-NIG model employs brain images from two different time points to create intermediate brain images, forecast future images, and predict the disease. T-NIG is designed by identifying features using coordinate neighborhoods. It incorporates a time parameter into the normal inverse gamma distribution to understand how features change in brain imaging sequences that have varying time intervals. Additionally, T-NIG utilizes uncertainty estimation to reduce both epistemic and aleatoric uncertainties in the model, which arise from insufficient temporal data. In particular, the T-NIG model demonstrates state-of-the-art performance in both short-term and long-term prediction tasks within the dataset. Experimental results indicate that T-NIG is proficient in forecasting disease progression while maintaining disease-related characteristics, even when faced with an irregular temporal data distribution.

[93] MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Ziyun Zeng,Hang Hua,Jiebo Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为MIRA的轻量级多模态推理代理,通过迭代感知-推理-行动循环来实现基于自然语言指令的图像编辑,显著提升了复杂指令下的语义一致性和视觉质量。

Details Motivation: 现有的基于扩散模型的图像编辑方法在处理复杂的自然语言指令时,难以准确理解组合关系、上下文信息或指代表达,导致编辑结果偏离用户意图。因此需要一种能够更好解析复杂指令并进行精确编辑的方法。 Method: 提出MIRA(Multimodal Iterative Reasoning Agent),采用迭代的感知-推理-行动框架,逐步生成原子化编辑指令,并利用视觉反馈优化决策;构建包含15万样本的多模态工具使用数据集MIRA-Editing,并采用两阶段SFT + GRPO训练策略。 Result: MIRA在多个开源图像编辑模型(如Flux.1-Kontext、Step1X-Edit、Qwen-Image-Edit)上显著提升了语义一致性和感知质量,性能媲美甚至超过GPT-Image和Nano-Banana等专有系统。 Conclusion: MIRA通过模拟人类多轮交互过程,有效解决了复杂指令下图像编辑的语义偏差问题,具备良好的通用性和实用性,为指令引导的图像编辑提供了新的解决方案。 Abstract: Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

[94] AnchorOPT: Towards Optimizing Dynamic Anchors for Adaptive Prompt Learning

Zheng Li,Yibing Song,Xin Zhang,Lei Luo,Xiang Li,Jian Yang

Main category: cs.CV

TL;DR: 提出了一种名为AnchorOPT的动态锚点提示学习框架,通过动态学习任务相关的锚点值和可优化的位置矩阵,提升了CLIP模型在不同任务和训练阶段的适应性与泛化能力。

Details Motivation: 现有基于CLIP的提示学习方法使用静态文本标记作为锚点,缺乏跨任务和训练阶段的灵活性,限制了模型的泛化能力。 Method: AnchorOPT在两个维度引入动态性:一是从任务数据中动态学习锚点值,取代手工设计的固定文本;二是通过一个依赖于训练阶段和任务上下文的可学习位置矩阵,自适应地优化锚点与软提示之间的位置关系。训练分为两阶段:先学习并冻结锚点,再优化软提示和位置矩阵。 Result: 实验表明,仅使用简单的可学习锚点和位置矩阵,AnchorOPT即可达到甚至超过一些引入额外模块或正则化技术的方法的性能,并在多个数据集上实现一致的性能提升。 Conclusion: AnchorOPT作为一种即插即用模块,能够有效增强现有提示学习框架的灵活性和表现力,具有良好的通用性和应用前景。 Abstract: Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., "shape", "color"), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.

[95] CLRecogEye : Curriculum Learning towards exploiting convolution features for Dynamic Iris Recognition

Geetanjali Sharma,Gaurav Jaswal,Aditya Nigam,Raghavendra Ramachandra

Main category: cs.CV

TL;DR: 提出了一种基于3D-CNN和课程学习的新型虹膜识别框架,通过建模虹膜特征的时空结构提升在旋转、尺度、反光和模糊等干扰下的识别鲁棒性和泛化能力。

Details Motivation: 现有虹膜识别方法多依赖点对点距离度量,未能有效利用虹膜模式的空间-时空结构,且在面对旋转、尺度变化、反光和散焦模糊时鲁棒性不足。 Method: 将虹膜图像沿一维分割成子图像序列,输入3D-CNN以捕捉空间和时空特征;采用课程学习策略训练模型,并结合三元组损失和ArcFace损失进行端到端优化,使特征空间能编码时间依赖关系。 Result: 所提方法在多种干扰条件下表现出更强的判别能力,显著提升了虹膜认证的鲁棒性和泛化性能。 Conclusion: 该框架通过引入3D-CNN和课程学习,有效挖掘了虹膜特征的时空结构,为复杂条件下的虹膜识别提供了更具鲁棒性和通用性的解决方案。 Abstract: Iris authentication algorithms have achieved impressive recognition performance, making them highly promising for real-world applications such as border control, citizen identification, and both criminal investigations and commercial systems. However, their robustness is still challenged by variations in rotation, scale, specular reflections, and defocus blur. In addition, most existing approaches rely on straightforward point-to-point comparisons, typically using cosine or L2 distance, without effectively leveraging the spatio-spatial-temporal structure of iris patterns. To address these limitations, we propose a novel and generalized matching pipeline that learns rich spatio-spatial-temporal representations of iris features. Our approach first splits each iris image along one dimension, generating a sequence of sub-images that serve as input to a 3D-CNN, enabling the network to capture both spatial and spatio-spatial-temporal cues. To further enhance the modeling of spatio-spatial-temporal feature dynamics, we train the model in curriculum manner. This design allows the network to embed temporal dependencies directly into the feature space, improving discriminability in the deep metric domain. The framework is trained end-to-end with triplet and ArcFace loss in a curriculum manner, enforcing highly discriminative embeddings despite challenges like rotation, scale, reflections, and blur. This design yields a robust and generalizable solution for iris authentication.Github code: https://github.com/GeetanjaliGTZ/CLRecogEye

[96] Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis

Jiyun Bae,Hyunjong Ok,Sangwoo Mo,Jaeho Lee

Main category: cs.CV

TL;DR: 本研究通过引入包含多种干扰项的视觉问答数据集Idis,探究了视觉-语言模型中视觉干扰项对测试时扩展性的影响,发现视觉干扰项与文本干扰项存在本质差异:尽管均表现出逆向扩展现象,但视觉干扰项在不增加推理长度的情况下降低准确率。

Details Motivation: 研究视觉-语言模型在多模态场景下是否表现出类似纯语言模型中的逆向扩展效应,特别是干扰信息如何影响模型推理过程和性能。 Method: 构建了一个系统化调节语义、数值和空间维度干扰项的视觉问答数据集Idis,并分析视觉干扰项对推理长度、属性计数追踪及准确率的影响,同时验证趋势在Waterbirds等视觉偏见基准上的普适性。 Result: 发现视觉干扰项导致准确率下降但未延长推理长度,揭示了其与文本干扰项的根本差异;通过追踪推理过程中属性计数可更好理解干扰项、推理长度与准确率之间的关系;所提出的提示策略能有效缓解基于偏见的预测。 Conclusion: 视觉干扰项在多模态模型中引发的逆向扩展机制不同于文本干扰项,需针对性设计缓解策略以提升模型鲁棒性和推理效率。 Abstract: How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.

[97] Pygmalion Effect in Vision: Image-to-Clay Translation for Reflective Geometry Reconstruction

Gayoung Lee,Junho Kim,Jin-Hwa Kim,Junmo Kim

Main category: cs.CV

TL;DR: 提出“视觉中的皮格马利翁效应”,通过图像到黏土的转换来抑制镜面反射,实现对含复杂反射的多视角图像进行鲁棒3D重建。

Details Motivation: 由于视点依赖的反射导致外观与几何的纠缠,理解反射在3D重建中长期存在挑战。 Method: 引入双分支网络,一个基于BRDF的反射分支,另一个由黏土引导的分支用于稳定几何和优化表面法线,利用合成的无反射黏土图像作为监督信号联合训练。 Result: 在合成和真实数据集上实验表明,相比现有方法,在法线准确性和网格完整性上有显著提升。 Conclusion: 将辐射转化为中性(seeing by unshining)可作为学习反射物体几何的强归纳偏置。 Abstract: Understanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically "sculpts" reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.

[98] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Wenbo Hu,Jingli Lin,Yilin Long,Yunlong Ran,Lihan Jiang,Yifan Wang,Chenming Zhu,Runsen Xu,Tai Wang,Jiangmiao Pang

Main category: cs.CV

TL;DR: G$^2$VLM 是一种几何感知的视觉-语言模型,通过结合3D视觉几何特征提升空间理解与推理能力,在无需密集标注的情况下实现对3D属性的预测和空间任务性能的提升。

Details Motivation: 现有视觉-语言模型在空间智能方面表现不足,主要因缺乏从2D图像重建3D空间的几何学习过程。 Method: 提出 G$^2$VLM,利用多视角图像和视频数据进行训练,原生融合学习到的3D视觉几何特征,通过上下文学习和交错推理增强空间理解和3D重建能力。 Result: 在3D重建任务上达到与最先进前馈模型相当的性能,在空间理解与推理任务上表现更好或具有竞争力。 Conclusion: G$^2$VLM 统一了强语义VLM与低层3D视觉任务,可作为空间智能研究的强基线,并推动如3D场景编辑等应用的发展。 Abstract: Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.

[99] Scaling Foundation Models for Radar Scene Understanding

Pushkal Mishra,Kshitiz Bansal,Dinesh Bharadia

Main category: cs.CV

TL;DR: 本文提出了RadarFM,一种基于结构化空间语言监督的雷达基础模型,旨在统一雷达感知中的场景级表征学习,提升跨任务迁移能力。

Details Motivation: 现有雷达感知方法多为任务特定且碎片化,缺乏统一表征,限制了模型在不同下游任务间的迁移与泛化能力。同时,雷达与基础模型的结合尚未被充分探索。 Method: 提出结构化字幕框架以在原生雷达坐标中编码车辆分布,并设计哈希感知的对比学习目标,实现对连续场景相似性的量化,支持细粒度空间推理;利用CARLA模拟器生成大规模标注数据集。 Result: 通过引入定位感知指标,在传统检测指标之外更准确地评估空间精度;实验验证了模型在多种驾驶场景下的有效性与泛化能力。 Conclusion: RadarFM通过结构化语言监督和对比学习实现了统一的雷达表征学习,推动了雷达感知向基础模型的发展,具备良好的跨任务迁移潜力。 Abstract: Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.

[100] EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

Ze Feng,Sen Yang,Boqiang Duan,Wankou Yang,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的知识蒸馏范式EM-KD,用于增强高效多模态大语言模型(MLLMs),通过曼哈顿距离和匈牙利匹配算法对齐教师与学生模型的视觉token,并引入两种蒸馏策略(VLAD和VSD)提升视觉-语言理解能力,在多个基准上显著优于现有方法。

Details Motivation: 现有的高效MLLM在压缩视觉token时会丢失信息,导致理解能力下降;传统知识蒸馏方法忽视了师生模型间视觉token不平衡带来的细粒度视觉理解差异。 Method: 提出EM-KD:首先计算教师与学生视觉logits之间的曼哈顿距离,并使用匈牙利算法在空间维度上对齐视觉token;随后引入两种蒸馏策略——视觉-语言亲和性蒸馏(VLAD)和视觉语义蒸馏(VSD);VLAD最小化师生模型在文本与对齐视觉token间亲和矩阵的平滑L1距离,VSD则利用反向KL散度衡量对齐视觉logits在词汇空间上的概率分布差异。 Result: 在多个基准测试中,采用EM-KD训练的模型在准确性和效率方面均显著优于之前的高效MLLM;相比其他蒸馏方法(结合本文提出的匹配策略进行公平比较),EM-KD也表现出更优性能。 Conclusion: EM-KD有效解决了因视觉token数量不均衡导致的知识蒸馏中的细粒度理解差距,为高效多模态模型提供了更强的视觉信息保留与融合能力,推动了轻量化MLLM的发展。 Abstract: Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.

[101] FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain

YuAn Wang,Xiaofan Li,Chi Huang,Wenhao Zhang,Hao Li,Bosheng Wang,Xun Sun,Jun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为FaithFusion的3DGS-扩散模型融合框架,通过像素级期望信息增益(EIG)实现几何保真与视觉一致性的可控驾驶场景重建,在大视角变换下实现了最先进的性能。

Details Motivation: 在可控驾驶场景重建中,现有方法难以在保持几何结构准确的同时生成视觉上逼真的外观,尤其是在大视角变化下容易出现过修复和几何漂移问题。 Method: 提出FaithFusion框架,利用像素级的期望信息增益(EIG)作为统一策略:EIG指导扩散模型优化高不确定性区域,并通过像素级加权将编辑结果蒸馏回3D高斯散点图(3DGS),实现无需额外先验或结构修改的即插即用融合系统。 Result: 在Waymo数据集上实验表明,该方法在NTA-IoU、NTL-IoU和FID指标上达到SOTA水平,即使在6米车道偏移下仍保持107.47的FID值。 Conclusion: FaithFusion通过EIG驱动的3DGS-扩散模型融合,有效解决了几何一致性与外观真实感之间的平衡问题,适用于大视角变化下的高质量驾驶场景生成。 Abstract: In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce \textbf{FaithFusion}, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications.Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift. Our code is available at https://github.com/wangyuanbiubiubiu/FaithFusion.

[102] Deformation-aware Temporal Generation for Early Prediction of Alzheimers Disease

Xin Honga,Jie Lin,Minghui Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Deformation-Aware Temporal Generative Network (DATGN) 的新方法,用于通过自动生成和分析脑部MRI图像序列来实现阿尔茨海默病(AD)的早期预测。该方法能够处理时间序列中常见的缺失数据问题,并生成符合疾病进展趋势的未来MRI图像,从而提升分类准确率。

Details Motivation: 阿尔茨海默病是一种进行性脑萎缩的神经退行性疾病,早期预测有助于延缓病情发展。现有方法多依赖手工提取脑部影像形态学特征,且难以有效处理不完整的时间序列MRI数据,因此需要一种能自动学习疾病相关形态变化并应对缺失数据的方法。 Method: 提出DATGN模型:首先对不完整的MRI时间序列进行插值补全;然后通过双向时间形变感知模块建模脑部形态随时间的变化规律,生成符合AD进展趋势的未来MRI图像;最后将生成的合成数据用于SVM、CNN和3DCNN等分类器中以提升分类性能。 Result: 在ADNI数据集上验证表明,DATGN在PSNR和MMSE等图像质量指标上表现优异;生成的MRI图像与真实的脑萎缩趋势一致;当合成数据用于分类任务时,AD vs. NC分类准确率提升6.21%~16%,AD vs. MCI vs. NC分类准确率提升7.34%~21.25%。 Conclusion: DATGN能够有效建模阿尔茨海默病的脑部形态演变过程,生成高质量的未来MRI图像,不仅解决了时间序列中数据缺失的问题,还显著提升了疾病的早期预测准确性,具有临床辅助诊断潜力。 Abstract: Alzheimer's disease (AD), a degenerative brain condition, can benefit from early prediction to slow its progression. As the disease progresses, patients typically undergo brain atrophy. Current prediction methods for Alzheimers disease largely involve analyzing morphological changes in brain images through manual feature extraction. This paper proposes a novel method, the Deformation-Aware Temporal Generative Network (DATGN), to automate the learning of morphological changes in brain images about disease progression for early prediction. Given the common occurrence of missing data in the temporal sequences of MRI images, DATGN initially interpolates incomplete sequences. Subsequently, a bidirectional temporal deformation-aware module guides the network in generating future MRI images that adhere to the disease's progression, facilitating early prediction of Alzheimer's disease. DATGN was tested for the generation of temporal sequences of future MRI images using the ADNI dataset, and the experimental results are competitive in terms of PSNR and MMSE image quality metrics. Furthermore, when DATGN-generated synthetic data was integrated into the SVM vs. CNN vs. 3DCNN-based classification methods, significant improvements were achieved from 6. 21\% to 16\% in AD vs. NC classification accuracy and from 7. 34\% to 21. 25\% in AD vs. MCI vs. NC classification accuracy. The qualitative visualization results indicate that DATGN produces MRI images consistent with the brain atrophy trend in Alzheimer's disease, enabling early disease prediction.

[103] Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models

Changlin Li,Jiawei Zhang,Zeyi Shi,Zongxin Yang,Zhihui Li,Xiaojun Chang

Main category: cs.CV

TL;DR: 本文提出EntPruner,一种基于熵引导的渐进式剪枝框架,用于大规模视觉生成模型(如扩散和流模型),通过条件熵偏差(CED)评估模块重要性,并结合零样本自适应剪枝策略,在保持生成质量的同时实现高达2.22倍的推理加速。

Details Motivation: 预训练的大规模视觉生成模型在迁移到下游任务时常存在显著的参数冗余,影响效率,亟需有效的剪枝方法以兼顾性能与轻量化。 Method: 提出熵引导剪枝,利用数据相关的条件熵偏差(CED)衡量网络块的重要性,并设计零样本自适应渐进剪枝框架,在训练过程中动态决定剪枝时机与程度,避免模式崩溃,提升剪枝鲁棒性。 Result: 在DiT和SiT模型上进行大量实验,结果表明EntPruner可在ImageNet及三个下游数据集上实现最高2.22倍的推理速度提升,同时保持具有竞争力的生成质量。 Conclusion: EntPruner为扩散与流模型提供了一种有效且自适应的剪枝方案,能够在不同下游任务中自动精简模型结构,显著提升推理效率而不牺牲生成多样性与保真度。 Abstract: Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive pruning framework for diffusion and flow models. First, we introduce entropy-guided pruning, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require preserving the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes pruning of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive pruning framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot pruning, mitigating mode collapse, and preserving model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.

[104] CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

Dianbing Xi,Jiepeng Wang,Yuanzhi Liang,Xi Qiu,Jialun Liu,Hao Pan,Yuchi Huo,Rui Wang,Haibin Huang,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了CtrlVDiff,一个统一的扩散模型,结合多种图形模态(如深度、法线、分割、边缘及材质属性)实现高质量、可控的视频生成与理解。通过引入混合模态控制策略(HMCS)和构建多模态对齐数据集MMVideo,解决了几何线索不足和多模态输入不完整的问题,实现了强时间一致性与精细编辑能力(如重光照、材质替换)。

Details Motivation: 仅依赖几何线索(如深度、边缘)进行视频生成无法充分约束外观、材质和光照,导致编辑能力受限且易产生时间不一致。需要引入更多图形学模态以提升理解和控制的精确性。 Method: 提出CtrlVDiff模型,采用混合模态控制策略(HMCS),支持任意子集的多模态输入(包括深度、法线、分割、边缘、反照率、粗糙度、金属度等),并通过特征路由与融合机制保持时间一致性;构建大规模多模态对齐数据集MMVideo(融合真实与合成视频)用于训练。 Result: 在视频理解和生成任务上超越现有最先进方法,支持逐层编辑(如重光照、材质调整、物体插入),在模态缺失时仍保持鲁棒性,并展现出优异的时间连贯性和生成保真度。 Conclusion: 通过引入丰富的图形语义模态与混合控制策略,CtrlVDiff实现了更精准、可控且物理合理的视频生成与编辑,验证了多模态协同在视频建模中的关键作用。 Abstract: We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.

[105] DeepRFTv2: Kernel-level Learning for Image Deblurring

Xintian Mao,Haofei Song,Yin-Nian Liu,Qingli Li,Yan Wang

Main category: cs.CV

TL;DR: 提出傅里叶核估计器(FKE),在频域中将空间域的卷积问题转化为乘法问题,实现低复杂度、无监督的核级模糊过程学习,并结合多尺度可逆网络结构提升图像去模糊性能。

Details Motivation: 现有深度网络局限于像素级学习,无法让去模糊模型真正理解模糊的本质机制,尤其是在核级别上建模模糊过程。 Method: 提出傅里叶核估计器(FKE),在傅里叶空间中进行激活操作,将空间域卷积转为频域乘法;使用网络提取的特征而非原始图像与估计核进行卷积,以更好学习模糊本质;设计解耦的多尺度可逆子U-Net架构,提升特征提取效率并降低训练内存消耗。 Result: 方法在运动去模糊任务上达到最先进的性能,能学习到物理意义明确的模糊核,且具有处理其他核相关问题的潜力。 Conclusion: 通过在频域实现核级模糊建模并结合高效多尺度架构,FKE使网络更深入理解模糊过程,显著提升了去模糊效果和模型效率。 Abstract: It is well-known that if a network aims to learn how to deblur, it should understand the blur process. Blurring is naturally caused by the convolution of the sharp image with the blur kernel. Thus, allowing the network to learn the blur process in the kernel-level can significantly improve the image deblurring performance. But, current deep networks are still at the pixel-level learning stage, either performing end-to-end pixel-level restoration or stage-wise pseudo kernel-level restoration, failing to enable the deblur model to understand the essence of the blur. To this end, we propose Fourier Kernel Estimator (FKE), which considers the activation operation in Fourier space and converts the convolution problem in the spatial domain to a multiplication problem in Fourier space. Our FKE, jointly optimized with the deblur model, enables the network to learn the kernel-level blur process with low complexity and without any additional supervision. Furthermore, we change the convolution object of the kernel from ``image" to network extracted ``feature", whose rich semantic and structural information is more suitable to blur process learning. With the convolution of the feature and the estimated kernel, our model can learn the essence of blur in kernel-level. To further improve the efficiency of feature extraction, we design a decoupled multi-scale architecture with multiple hierarchical sub-unets with a reversible strategy, which allows better multi-scale encoding and decoding in low training memory. Extensive experiments indicate that our method achieves state-of-the-art motion deblurring results and show potential for handling other kernel-related problems. Analysis also shows our kernel estimator is able to learn physically meaningful kernels. The code will be available at https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur.

[106] Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning

Changlin Li,Jiawei Zhang,Shuhao Liu,Sihao Lin,Zeyi Shi,Zhihui Li,Xiaojun Chang

Main category: cs.CV

TL;DR: 提出了一种名为Ent-Prog的高效训练框架,用于人体视频生成中的扩散模型,通过熵引导的优先级和渐进式学习显著减少训练时间和显存消耗。

Details Motivation: 由于在高分辨率、多帧数据上训练扩散模型存在高计算成本和大内存消耗的问题,因此需要一种更高效的训练方法。 Method: 提出了条件熵膨胀(CEI)来评估模型组件的重要性,并引入自适应渐进训练策略,根据收敛效率动态增加计算复杂度。 Result: 在三个数据集上的实验表明,Ent-Prog最多可实现2.2倍的训练加速和2.4倍的GPU内存减少,同时不损害生成性能。 Conclusion: Ent-Prog是一种有效的训练框架,能够在保持生成质量的同时显著降低扩散模型在人体视频生成任务中的训练开销。 Abstract: Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.

[107] Referring Video Object Segmentation with Cross-Modality Proxy Queries

Baoli Sun,Xinzhu Ma,Ning Wang,Zhihui Wang,Zhiyong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为ProxyFormer的新架构,用于解决指代表达视频对象分割(RVOS)中的跨模态对齐问题,通过引入代理查询来增强视觉与语言语义的融合,并在多个基准上实现了优于现有方法的性能。

Details Motivation: 现有RVOS方法在跨模态对齐中存在两个主要问题:缺乏帧间依赖建模,以及文本约束集成过晚导致注意力偏离目标对象。 Method: 提出ProxyFormer,引入可传播更新的代理查询,在视频特征编码器的多阶段中实现视觉与文本语义的动态融合;将跨模态交互解耦为时空维度以降低计算成本,并设计联合语义一致性(JSC)训练策略。 Result: 在四个主流RVOS基准上的实验表明,ProxyFormer在性能上显著优于现有最先进方法。 Conclusion: ProxyFormer通过动态演化的代理查询有效增强了跨模态语义对齐和帧间依赖建模,提升了RVOS任务中对象跟踪的准确性和连贯性。 Abstract: Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.

[108] TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

Jiaming He,Guanyu Hou,Hongwei Li,Zhicong Huang,Kangjie Chen,Yi Yu,Wenbo Jiang,Guowen Xu,Tianwei Zhang

Main category: cs.CV

TL;DR: 提出了一种名为TEAR的时序感知自动化红队框架,用于发现文本到视频生成模型中的安全风险,特别是在动态时序方面的潜在问题。

Details Motivation: 现有的安全评估方法主要针对静态图像和文本生成,难以捕捉视频生成中复杂的时序动态,因此需要专门针对T2V模型的时序安全性进行评估。 Method: TEAR采用两阶段优化的时序感知测试生成器,结合初始生成器训练和时序感知的在线偏好学习,并通过循环优化的精炼模型提升提示的隐蔽性和攻击有效性。 Result: 在多个开源和商业T2V系统上实验表明,TEAR的攻击成功率超过80%,显著高于此前最佳的57%。 Conclusion: TEAR能有效揭示T2V模型在时序动态上的安全隐患,为未来视频生成模型的安全评估提供了新方向。 Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80% attack success rate, a significant boost from prior best result of 57%.

[109] LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

Shichu Sun,Yichen Zhang,Haolin Song,Zonghao Guo,Chi Chen,Yidan Zhang,Yuan Yao,Zhiyuan Liu,Maosong Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为LLaVA-UHD v3的多模态大语言模型,其核心是渐进式视觉压缩(PVC)方法,可在保持高性能的同时显著降低推理延迟。

Details Motivation: 为了在全局高分辨率视觉编码带来的性能提升与高计算开销之间取得平衡,研究者希望设计一种高效且兼容性强的视觉编码架构。 Method: 提出PVC方法,包含精细化的patch嵌入和分层的窗口化token压缩模块,可集成到标准ViT中实现渐进式压缩。 Result: 所构建的ViT-UHD在多个基准上表现优异,相比MoonViT将首令牌时间(TTFT)减少2.4倍;LLaVA-UHD v3性能媲美Qwen2-VL,同时TTFT降低1.9倍。 Conclusion: PVC是一种有效的视觉token压缩框架,能够在保持多模态模型理解能力的同时大幅提升推理效率,推动高效MLLM的发展。 Abstract: Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.

[110] Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation

Joonhyung Park,Hyeongwon Jang,Joowon Kim,Eunho Yang

Main category: cs.CV

TL;DR: 本文提出了GridAR,一种用于视觉自回归模型的测试时扩展框架,通过网格分区的渐进生成和布局指定的提示重构策略,显著提升了文本到图像生成的质量与效率。

Details Motivation: 现有的视觉自回归模型在测试时计算扩展方面尚未被探索,且存在生成轨迹错误和缺乏整体画布蓝图的问题,导致扩展效果受限。 Method: 提出GridAR框架,采用网格分区的渐进生成机制,早期剪枝不可行候选,并以可行结果为锚点引导后续解码;结合布局指定的提示重构策略,基于局部视图推断可行布局以指导生成。 Result: 在N=4时,GridAR在T2I-CompBench++上比Best-of-N(N=8)提升14.4%,成本降低25.6%;在PIE-Bench上图像编辑任务中语义保持提升13.9%。 Conclusion: GridAR有效提升了视觉自回归模型在有限测试时计算下的生成质量与效率,并可推广至图像编辑任务,解决了生成过程中蓝图缺失与计算浪费问题。 Abstract: Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.

[111] Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding

Yutao Tang,Cheng Zhao,Gaurav Mittal,Rohith Kukkala,Rama Chellappa,Cheng Peng,Mei Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为NDTokenizer3D的通用3D视觉-语言模型,通过多尺度NDT表示和MSDec解码器实现3D场景的细粒度、整体性令牌化,并支持多种3D理解任务与人机交互提示。

Details Motivation: 现有的3D视觉-语言模型在将3D场景有效令牌化为整体场景令牌方面仍面临挑战,且难以在多种任务中统一利用这些令牌。 Method: 提出一种基于多尺度NDT表示和多尺度NDT解码器(MSDec)的三阶段场景令牌化流程;利用MSDec融合跨尺度特征生成可用于大语言模型的全局场景令牌,并将其扩展用于交互式提示和分割解码。 Result: NDTokenizer3D在3D指代表达分割、3D视觉问答和3D密集描述等任务上实现了显著性能提升,具备细粒度理解和多任务统一处理能力。 Conclusion: NDTokenizer3D通过紧凑而统一的设计,成功连接了语言推理与3D空间理解,为通用3D视觉-语言建模提供了高效且灵活的解决方案。 Abstract: Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.

[112] When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models

Hui Lu,Yi Yu,Yiming Yang,Chenyu Yi,Qixin Zhang,Bingquan Shen,Alex C. Kot,Xudong Jiang

Main category: cs.CV

TL;DR: 本文提出了一种针对视觉-语言-动作(VLA)模型的通用、可迁移对抗补丁攻击方法UPA-RFAS,能够在未知架构和跨模型设置下实现高效攻击,揭示了VLA系统在实际场景中的安全漏洞。

Details Motivation: 现有的对抗补丁大多过拟合于单一模型,在黑盒环境下迁移性差,缺乏对VLA模型通用攻击的系统研究。 Method: 提出UPA-RFAS框架,结合特征空间优化、鲁棒增强的两阶段最小-最大训练策略,并引入注意力主导和语义错位损失,在无标签情况下生成物理可实现的通用对抗补丁。 Result: 实验表明该方法在多种VLA模型、操作任务和真实物理环境中均具有良好的跨模型、跨任务和跨视角迁移能力。 Conclusion: UPA-RFAS揭示了VLA驱动机器人面临的真实补丁攻击威胁,为后续防御机制研究提供了强基线。 Abstract: Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.

[113] You Can Trust Your Clustering Model: A Parameter-free Self-Boosting Plug-in for Deep Clustering

Hanyang Li,Yuheng Jia,Hui Liu,Junhui Hou

Main category: cs.CV

TL;DR: 提出了一种名为DCBoost的无参数插件,通过利用局部结构一致性来增强深度聚类模型的全局特征结构,显著提升了多种现有方法的聚类性能。

Details Motivation: 现有深度聚类方法存在全局与局部特征结构不一致的问题,导致聚类边界模糊、分离效果差。 Method: 通过自适应k近邻一致性过滤筛选高置信度样本作为可靠锚点,并利用这些样本计算判别性损失,以增强类内紧凑性和类间可分性,从而优化网络。 Result: 在多个基准数据集上实验表明,DCBoost显著提升了现有深度聚类模型的性能,相比当前最优方法(如ProPos)提升超过3%,轮廓系数提高逾7倍。 Conclusion: DCBoost是一种有效且即插即用的模块,能够通过局部可靠信息增强全局特征结构,显著提升深度聚类性能。 Abstract: Recent deep clustering models have produced impressive clustering performance. However, a common issue with existing methods is the disparity between global and local feature structures. While local structures typically show strong consistency and compactness within class samples, global features often present intertwined boundaries and poorly separated clusters. Motivated by this observation, we propose DCBoost, a parameter-free plug-in designed to enhance the global feature structures of current deep clustering models. By harnessing reliable local structural cues, our method aims to elevate clustering performance effectively. Specifically, we first identify high-confidence samples through adaptive $k$-nearest neighbors-based consistency filtering, aiming to select a sufficient number of samples with high label reliability to serve as trustworthy anchors for self-supervision. Subsequently, these samples are utilized to compute a discriminative loss, which promotes both intra-class compactness and inter-class separability, to guide network optimization. Extensive experiments across various benchmark datasets showcase that our DCBoost significantly improves the clustering performance of diverse existing deep clustering models. Notably, our method improves the performance of current state-of-the-art baselines (e.g., ProPos) by more than 3% and amplifies the silhouette coefficient by over $7\times$. Code is available at .

[114] BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data

Selene Cerna,Sara Si-Moussi,Wilfried Thuiller,Hadrien Hendrikx,Vincent Miele

Main category: cs.CV

TL;DR: 本文提出了BotaCLIP,一种轻量级多模态对比框架,用于将生态学领域的专业知识注入预训练的地球观测基础模型(DOFA)中,通过高分辨率航拍图像与植物群落调查数据对齐,实现无需从头训练的知识适应。

Details Motivation: 在生物多样性建模等数据稀缺的实际应用场景中,需要将领域知识有效融入基础模型,以提升下游生态任务的性能,同时避免高昂的再训练成本。 Method: 提出BotaCLIP框架,采用多模态对比学习方法,将航拍图像与植物群落调查数据进行对齐,并引入正则化策略缓解灾难性遗忘,从而在不重新训练的前提下完成领域自适应。 Result: 在植物存在预测、蝴蝶出现建模和土壤营养类群丰度估计三个生态任务中,BotaCLIP的表示均优于原始DOFA和有监督基线方法,表现出更强的迁移能力。 Conclusion: 领域感知的基础模型适配能够有效注入专家知识,支持在数据有限场景下的高效表示学习,为环境科学中的基础模型应用提供了可行路径。 Abstract: Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.

[115] Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

Baoli Sun,Yihan Wang,Xinzhu Ma,Zhihui Wang,Kun Lu,Zhiyong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Action-Region Tracking (ART) 的新框架,用于细粒度动作识别,通过查询-响应机制发现并跟踪局部区域的动态变化,结合视觉语言模型中的文本语义信息,利用多级对比约束和任务特定微调机制提升识别性能,在多个基准上超越了现有方法。

Details Motivation: 现有的动作识别方法通常捕捉的是粗粒度的运动模式,难以识别细粒度动作类别之间在局部区域随时间演变的细微差异,因此需要一种能够有效捕捉和追踪这些细微动态细节的方法来提高识别精度。 Method: 提出Action-Region Tracking (ART) 框架,包含区域特定语义激活模块,使用判别性和文本约束的语义作为查询以捕获每帧中最相关的区域响应,并将这些响应组织成表征基于区域的动作动态的动作轨迹(action tracklets);采用多级轨迹对比约束优化tracklet,并设计任务特定微调机制优化VLM提取的文本语义表示。 Result: 在多个广泛使用的动作识别基准上进行了综合实验,结果表明所提方法优于之前的最先进方法。 Conclusion: ART框架通过结合文本约束的语义查询与区域响应跟踪,有效提升了细粒度动作识别的性能,验证了挖掘局部时空动态与优化语义表示在该任务中的重要性。 Abstract: Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.

[116] From Diffusion to One-Step Generation: A Comparative Study of Flow-Based Models with Application to Image Inpainting

Umang Agarwal,Rudraksh Sangore,Sumit Laddha

Main category: cs.CV

TL;DR: 本研究对三种生成模型范式(DDPM、CFM和MeanFlow)进行了系统比较,采用统一的TinyUNet架构在CIFAR-10上实现。结果表明,CFM在50步采样下FID为24.15,显著优于DDPM;MeanFlow通过单步生成实现50倍推理加速,FID为29.15。此外,CFM扩展至图像修复任务,在多种掩码下经微调后PSNR和SSIM显著提升,验证了修复感知训练的有效性。

Details Motivation: 比较不同生成模型在相同架构下的性能差异,并探索高效采样与实际应用(如图像修复)的潜力。 Method: 在统一的TinyUNet架构下实现DDPM、CFM和MeanFlow,并在CIFAR-10上评估FID;将CFM扩展至图像修复任务,采用掩码引导采样和微调策略。 Result: CFM在50步采样下达到FID 24.15,显著优于DDPM的402.98;MeanFlow实现单步生成,FID为29.15,推理速度快50倍;在图像修复中,PSNR从4.95提升至8.57 dB,SSIM从0.289提升至0.418。 Conclusion: CFM在生成质量上表现最佳,MeanFlow在推理效率方面优势明显;通过微调,CFM可有效应用于图像修复等下游任务。 Abstract: We present a comprehensive comparative study of three generative modeling paradigms: Denoising Diffusion Probabilistic Models (DDPM), Conditional Flow Matching (CFM), and MeanFlow. While DDPM and CFM require iterative sampling, MeanFlow enables direct one-step generation by modeling the average velocity over time intervals. We implement all three methods using a unified TinyUNet architecture (<1.5M parameters) on CIFAR-10, demonstrating that CFM achieves an FID of 24.15 with 50 steps, significantly outperforming DDPM (FID 402.98). MeanFlow achieves FID 29.15 with single-step sampling -- a 50X reduction in inference time. We further extend CFM to image inpainting, implementing mask-guided sampling with four mask types (center, random bbox, irregular, half). Our fine-tuned inpainting model achieves substantial improvements: PSNR increases from 4.95 to 8.57 dB on center masks (+73%), and SSIM improves from 0.289 to 0.418 (+45%), demonstrating the effectiveness of inpainting-aware training.

[117] 3-Tracer: A Tri-level Temporal-Aware Framework for Audio Forgery Detection and Localization

Shuhan Xia,Xuannan Liu,Xing Cui,Peipei Li

Main category: cs.CV

TL;DR: 本文提出T3-Tracer,首个联合帧、片段和音频三级分析的框架,用于检测部分音频伪造,通过FA-FAM和SMDAM模块实现细粒度伪造检测与边界识别。

Details Motivation: 现有方法仅独立检测单帧伪造,缺乏对多时间尺度下瞬态和持续异常的层次化建模能力,难以应对部分音频伪造中局部语义篡改的问题。 Method: 提出T3-Tracer框架,包含帧-音频特征聚合模块(FA-FAM)和片段级多尺度差异感知模块(SMDAM);FA-FAM融合帧级与音频级信息检测帧真实性,SMDAM通过双分支结构在多尺度时窗下建模帧间差异以识别伪造边界。 Result: 在三个具有挑战性的数据集上实验表明,该方法在部分音频伪造检测方面达到最先进性能。 Conclusion: T3-Tracer通过多层次联合分析有效提升了部分音频伪造的检测能力,尤其在捕捉伪造边界和全局语义不一致方面表现优异。 Abstract: Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3-Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3-Tracer consists of two complementary core modules: the Frame-Audio Feature Aggregation Module (FA-FAM) and the Segment-level Multi-Scale Discrepancy-Aware Module (SMDAM). FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state-of-the-art performance.

[118] FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision

Chen Ling,Henglin Shi,Hedvig Kjellström

Main category: cs.CV

TL;DR: 本文提出了一种名为FIELDS的新方法,通过引入直接的3D表情参数监督和情感识别分支,解决了现有3D人脸重建技术在捕捉细微情感细节上的不足,实现了从单张图像生成富有真实情感的高保真人脸模型。

Details Motivation: 现有的3D人脸重建方法主要依赖2D监督,缺乏3D真值数据,难以准确还原面部表情中的细微情感信息。为此,本文旨在提升3D人脸重建中对真实情感表达的还原能力。 Method: 提出FIELDS框架,结合自监督的2D图像一致性线索,并引入来自自发4D面部扫描的真实3D表情参数作为直接监督信号,同时设计一个强度感知的情感识别损失函数,以更好地捕捉真实情感内容并避免表情夸张。 Result: FIELDS在单图像输入下生成了情感丰富且高度逼真的3D人脸模型,在野外场景下的面部表情识别性能显著提升,同时保持了自然性,有效弥合了2D与3D领域的差距并缓解了表情强度偏差问题。 Conclusion: 通过引入直接3D表达监督和情感感知损失,FIELDS能够更准确地重建富含情感细节的3D人脸,在表达保真度和下游任务性能上均优于现有方法。 Abstract: Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.

[119] Shift-Equivariant Complex-Valued Convolutional Neural Networks

Quentin Gabot,Teck-Yian Lim,Jérémy Fix,Joana Frontera-Pons,Chengfang Ren,Jean-Philippe Ovarlez

Main category: cs.CV

TL;DR: 本文将Learnable Polyphase Sampling (LPS) 扩展到复数神经网络,并提出一种从复数域到实数域的投影层,用于在Gumbel Softmax前保持移位等变性和不变性,实验验证了其在极化SAR图像分类、重建和语义分割任务中的有效性。

Details Motivation: 传统卷积神经网络因下采样和上采样操作破坏了移位等变性和不变性,缺乏系统性方法来保证这些性质。 Method: 扩展LPS至复数神经网络,引入复数到实数的投影层,并结合Gumbel Softmax实现理论上的移位等变/不变性。 Result: 在极化SAR图像的分类、重建和语义分割任务中验证了所提方法的有效性,表现出良好的等变与不变性能。 Conclusion: 所提出的复数LPS框架为实现系统性的移位等变和不变性提供了有效途径,适用于多种计算机视觉任务,尤其在处理复数数据如SAR图像时具有优势。 Abstract: Convolutional neural networks have shown remarkable performance in recent years on various computer vision problems. However, the traditional convolutional neural network architecture lacks a critical property: shift equivariance and invariance, broken by downsampling and upsampling operations. Although data augmentation techniques can help the model learn the latter property empirically, a consistent and systematic way to achieve this goal is by designing downsampling and upsampling layers that theoretically guarantee these properties by construction. Adaptive Polyphase Sampling (APS) introduced the cornerstone for shift invariance, later extended to shift equivariance with Learnable Polyphase up/downsampling (LPS) applied to real-valued neural networks. In this paper, we extend the work on LPS to complex-valued neural networks both from a theoretical perspective and with a novel building block of a projection layer from $\mathbb{C}$ to $\mathbb{R}$ before the Gumbel Softmax. We finally evaluate this extension on several computer vision problems, specifically for either the invariance property in classification tasks or the equivariance property in both reconstruction and semantic segmentation problems, using polarimetric Synthetic Aperture Radar images.

[120] AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

Shuhan Xia,Peipei Li,Xuannan Liu,Dongsen Zhang,Xinyu Guo,Zekun Li

Main category: cs.CV

TL;DR: 本文提出了首个涵盖丰富伪造语义的音视频伪造检测基准AVFakeBench,包含12K个音视频问题,覆盖七种伪造类型和四个层次的标注,并通过多阶段混合伪造框架生成高质量伪造数据。评估结果显示,当前音视频大模型在细粒度感知与推理方面仍存在明显不足。

Details Motivation: 现有音视频伪造检测基准局限于DeepFake和单一粒度标注,无法反映真实世界中复杂多样的伪造场景,因此需要构建更全面、多层次的评估基准。 Method: 提出AVFakeBench,包含12K精心设计的音视频问题,覆盖人类主体与通用主体的七类伪造;采用多阶段混合伪造框架,结合任务规划模型与专家生成模型生成高质量伪造样本;构建多任务评估体系,包括二分类判断、伪造类型识别、细节定位和解释性推理。 Result: 在AVFakeBench上评测了11个音视频大语言模型和2种主流检测方法,发现AV-LMMs具备作为伪造检测工具的潜力,但在细粒度感知和逻辑推理方面表现较弱。 Conclusion: AVFakeBench为音视频伪造检测提供了更全面的评估平台,揭示了现有模型在复杂伪造场景下的局限性,推动未来研究向细粒度分析与可解释推理方向发展。 Abstract: The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.

[121] LaGen: Towards Autoregressive LiDAR Scene Generation

Sizhuo Zhou,Xiaosong Jia,Fanrui Zhang,Junjie Li,Juyong Zhang,Yukang Feng,Jianwen Sun,Songbur Wong,Junqi You,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出了LaGen,首个能够实现长时距逐帧自回归生成LiDAR场景的框架,支持基于单帧输入和边界框条件的高保真4D点云生成,并通过场景解耦估计和噪声调制模块提升交互性和减少误差累积。

Details Motivation: 现有LiDAR数据生成方法仅支持单帧生成,预测方法缺乏交互性且无法进行长时距逐帧生成,难以满足自动驾驶中对交互式、长时间场景建模的需求。 Method: 提出LaGen框架,采用自回归方式逐帧生成LiDAR序列;引入场景解耦估计模块以实现对象级内容的交互控制,并设计噪声调制模块抑制长程生成中的误差累积;利用nuScenes数据构建长时距生成评估协议。 Result: 实验表明,LaGen在长时距LiDAR场景生成任务上优于现有的生成与预测模型,尤其在后续帧的生成质量上表现更优。 Conclusion: LaGen是首个支持长时距逐帧生成的LiDAR世界模型,为自动驾驶中的交互式仿真与规划提供了新的技术路径。 Abstract: Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long-horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame-by-frame autoregressive generation of long-horizon LiDAR scenes. LaGen is able to take a single-frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high-fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model's interactive generation capability for object-level content, as well as a noise modulation module to mitigate error accumulation during long-horizon generation. We construct a protocol based on nuScenes for evaluating long-horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on the later frames.

[122] Unlocking Zero-shot Potential of Semi-dense Image Matching via Gaussian Splatting

Juncheng Chen,Chao Xu,Yanjun Cao

Main category: cs.CV

TL;DR: 本文提出了MatchGS,首个利用3D高斯点阵(3DGS)进行零样本图像匹配的系统性框架,通过几何修正和2D-3D表示对齐,显著提升匹配性能。

Details Motivation: 基于学习的图像匹配依赖高质量训练数据,而现有3DGS生成的数据存在几何不准确和深度偏差问题,限制了可靠对应点标注。 Method: 提出两阶段方法:1)几何保真数据生成 pipeline,优化3DGS几何以生成高精度对应标签;2)2D-3D表示对齐策略,将3DGS的显式3D知识注入2D匹配器,引导其学习视角不变的3D表示。 Result: 生成的真值对应点使极线误差减少达40倍,支持极端视角变化下的监督,并通过高斯属性提供自监督信号。仅使用该数据训练的SOTA匹配器在公开基准上零样本性能提升高达17.7%。 Conclusion: 经过适当几何修正后,3DGS可作为可扩展、高保真且结构丰富的数据源,推动新一代鲁棒零样本图像匹配器的发展。 Abstract: Learning-based image matching critically depends on large-scale, diverse, and geometrically accurate training data. 3D Gaussian Splatting (3DGS) enables photorealistic novel-view synthesis and thus is attractive for data generation. However, its geometric inaccuracies and biased depth rendering currently prevent robust correspondence labeling. To address this, we introduce MatchGS, the first framework designed to systematically correct and leverage 3DGS for robust, zero-shot image matching. Our approach is twofold: (1) a geometrically-faithful data generation pipeline that refines 3DGS geometry to produce highly precise correspondence labels, enabling the synthesis of a vast and diverse range of viewpoints without compromising rendering fidelity; and (2) a 2D-3D representation alignment strategy that infuses 3DGS' explicit 3D knowledge into the 2D matcher, guiding 2D semi-dense matchers to learn viewpoint-invariant 3D representations. Our generated ground-truth correspondences reduce the epipolar error by up to 40 times compared to existing datasets, enable supervision under extreme viewpoint changes, and provide self-supervisory signals through Gaussian attributes. Consequently, state-of-the-art matchers trained solely on our data achieve significant zero-shot performance gains on public benchmarks, with improvements of up to 17.7%. Our work demonstrates that with proper geometric refinement, 3DGS can serve as a scalable, high-fidelity, and structurally-rich data source, paving the way for a new generation of robust zero-shot image matchers.

[123] Co-Training Vision Language Models for Remote Sensing Multi-task Learning

Qingyun Li,Shuran Ma,Junwei Luo,Yi Yu,Yue Zhou,Fengxiang Wang,Xudong Lu,Xiaoxing Wang,Xin He,Yushi Chen,Xue Yang,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出RSCoVLM,一个用于遥感多任务学习的简单而灵活的视觉语言模型基线,通过数据管理引擎、动态分辨率策略和Zoom-in Chain机制,在多种任务上实现领先性能,并开源全部资源以推动通用遥感模型发展。

Details Motivation: 随着Transformer在单一遥感任务上的成功,研究者希望构建一个能在多个任务上同时表现优异的统一模型。多任务学习(MTL)相比单任务方法具有更好的泛化性、可扩展性和实用性。同时,视觉语言模型(VLMs)在遥感图像理解、定位和超高清图像推理中展现出潜力,但缺乏统一且开放的基线模型。因此,本文旨在建立一个适用于遥感MTL的通用VLM基线。 Method: 提出RSCoVLM,包含:1)数据管理引擎,涵盖数据获取、离线处理与整合、在线加载与加权,生成灵活的视觉-语言对话;2)统一的动态分辨率策略,适应遥感图像中不同的尺度;3)针对超高清图像设计Zoom-in Chain机制及配套数据集LRS-VQA-Zoom;4)增强模型的目标检测能力并提出新的评估协议,以公平比较VLM与传统检测模型。 Result: 实验表明,RSCoVLM在多种遥感任务上达到当前最优性能,超越现有遥感VLM,并接近专用专家模型的表现。提出的动态分辨率和Zoom-in Chain有效降低了计算负担,新评估协议提升了可比性。所有工具、模型权重和数据集均已开源。 Conclusion: RSCoVLM作为一个简单而灵活的基线,推动了面向多任务、跨尺度遥感理解的通用视觉语言模型的发展,其开源贡献有助于促进该领域的 reproducibility 和进一步研究。 Abstract: With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.

[124] PathMamba: A Hybrid Mamba-Transformer for Topologically Coherent Road Segmentation in Satellite Imagery

Jules Decaestecker,Nicolas Vigne

Main category: cs.CV

TL;DR: 本文提出了一种名为PathMamba的新型混合架构,结合Mamba的状态空间模型与Transformer的全局推理能力,用于卫星图像中的道路分割,在保持高精度的同时显著提升拓扑连续性,并具备线性计算效率。

Details Motivation: 现有基于Vision Transformer的方法在道路分割中虽能捕获全局上下文,但其二次计算复杂度限制了在资源受限设备上的高效部署;而道路网络具有长连续结构,需要更高效的序列建模方式。 Method: 提出PathMamba,将Mamba模块用于捕捉道路的连续拓扑结构,同时引入Transformer模块融合全局上下文信息,实现高效且拓扑一致的分割。 Result: 在DeepGlobe和Massachusetts Roads数据集上达到最先进性能,尤其在APLS指标上显著优于现有方法,同时保持较低计算成本。 Conclusion: PathMamba通过结合Mamba的线性效率与Transformer的全局建模能力,实现了高精度、强拓扑连续性的道路提取,为遥感图像分割提供了高效且实用的解决方案。 Abstract: Achieving both high accuracy and topological continuity in road segmentation from satellite imagery is a critical goal for applications ranging from urban planning to disaster response. State-of-the-art methods often rely on Vision Transformers, which excel at capturing global context, yet their quadratic complexity is a significant barrier to efficient deployment, particularly for on-board processing in resource-constrained platforms. In contrast, emerging State Space Models like Mamba offer linear-time efficiency and are inherently suited to modeling long, continuous structures. We posit that these architectures have complementary strengths. To this end, we introduce PathMamba, a novel hybrid architecture that integrates Mamba's sequential modeling with the Transformer's global reasoning. Our design strategically uses Mamba blocks to trace the continuous nature of road networks, preserving topological structure, while integrating Transformer blocks to refine features with global context. This approach yields topologically superior segmentation maps without the prohibitive scaling costs of pure attention-based models. Our experiments on the DeepGlobe Road Extraction and Massachusetts Roads datasets demonstrate that PathMamba sets a new state-of-the-art. Notably, it significantly improves topological continuity, as measured by the APLS metric, setting a new benchmark while remaining computationally competitive.

[125] CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation

Chenyu Liu,Hongze Chen,Jingzhi Bao,Lingting Zhu,Runze Zhang,Weikai Chen,Zeyu Hu,Yingda Yin,Keyang Luo,Xin Wang

Main category: cs.CV

TL;DR: 本文提出CaliTex,一种基于几何校准注意力的3D纹理生成框架,通过部分对齐注意力和条件路由注意力解决现有扩散模型中的跨视图不一致问题。

Details Motivation: 现有的3D纹理生成系统因注意力模糊导致跨视图纹理不一致,影响几何与外观的稳定耦合。 Method: 引入几何校准注意力机制,包括Part-Aligned Attention实现语义部件的空间对齐,Condition-Routed Attention通过几何条件路径传递外观信息,并结合两阶段扩散Transformer。 Result: CaliTex在保持空间保真度方面表现优异,生成无缝且视图一致的纹理,在开源和商业基准上均优于现有方法。 Conclusion: CaliTex将几何一致性内化为网络行为,有效解决了扩散模型中3D纹理生成的跨视图不一致问题。 Abstract: Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling. To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure. It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity. Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization. Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.

[126] HTTM: Head-wise Temporal Token Merging for Faster VGGT

Weitian Wang,Lukas Meiner,Rai Shubham,Cecilia De La Parra,Akash Kumar

Main category: cs.CV

TL;DR: 提出了一种无需训练的3D token合并方法HTTM,用于加速VGGT模型,在保持性能的同时实现最高7倍的推理加速。

Details Motivation: VGGT模型在大场景3D重建中因全局注意力机制导致推理延迟高,亟需加速方法。 Method: 提出头级时间合并(HTTM),在多头注意力粒度上进行token合并,利用头级别的时空特性提高合并效率。 Result: HTTM实现了最高7倍的加速,且GPU推理下性能下降可忽略。 Conclusion: HTTM通过保留注意力头间的特征独特性,有效提升VGGT的推理效率,适用于长序列输入的大场景重建。 Abstract: The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.

[127] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Stefanos Koutoupis,Michaela Areti Zervou,Konstantinos Kontras,Maarten De Vos,Panagiotis Tsakalides,Grigorios Tsagatakis

Main category: cs.CV

TL;DR: 本文提出了Contrastive Fusion (ConFu)框架,通过扩展传统的对比学习目标,联合嵌入单个模态及其融合表示,以同时捕捉高阶模态依赖和保持良好的成对对齐。

Details Motivation: 现有方法在处理多模态学习时主要集中在成对对齐,难以有效建模高阶交互,且在单模态任务中表现受限。需要一种既能捕捉高阶关系又能保留成对关系的方法。 Method: 提出ConFu框架,在传统成对对比目标基础上增加融合模态对比项,将各单模态及其融合组合共同嵌入到统一表示空间中,实现多模态联合表示学习。 Result: 在合成数据和真实多模态基准上验证了ConFu的有效性,能够更好捕捉跨模态互补性和高阶依赖(如XOR关系),并在检索与分类任务中取得有竞争力的结果,支持统一的一对一和二对一检索。 Conclusion: ConFu能够在保持强成对对齐的同时有效建模高阶模态交互,为多模态表示学习提供了一种更全面且可扩展的对比学习框架。 Abstract: Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

[128] Hybrid SIFT-SNN for Efficient Anomaly Detection of Traffic Flow-Control Infrastructure

Munish Rathee,Boris Bačić,Maryam Doborjeh

Main category: cs.CV

TL;DR: 提出SIFT-SNN框架,用于交通基础设施中结构异常的低延迟类脑信号处理,结合SIFT特征提取与脉冲神经网络分类,在真实数据集上实现92.3%准确率和9.5ms推理速度,支持可解释、低功耗边缘部署。

Details Motivation: 为实现交通基础设施(如可移动混凝土护栏)在多变环境下的实时、低功耗结构安全监测,需克服传统CNN在边缘设备上高延迟、难解释和高功耗的问题。 Method: 提出SIFT-SNN混合框架:利用SIFT进行空间特征编码,通过延迟驱动的脉冲转换层将特征转为脉冲,再由Leaky Integrate-and-Fire(LIF)脉冲神经网络(SNN)进行分类,并在嵌入式系统上部署验证。 Result: 在包含6000帧的Auckland Harbour Bridge数据集上达到92.3%±0.8%的分类准确率,单帧推理时间为9.5ms,脉冲稀疏性达8.1%,实现亚10毫秒延迟,具备实时性和低功耗优势。 Conclusion: SIFT-SNN框架在保持空间特征可解释性的同时,实现了高速、低功耗的结构异常检测,适用于边缘设备部署,为交通基础设施的安全监测提供了一种可推广的解决方案,但其在未见现场条件下的泛化能力仍需进一步验证。 Abstract: This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport infrastructure. The proposed approach integrates Scale-Invariant Feature Transform (SIFT) for spatial feature encoding with a latency-driven spike conversion layer and a Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN) for classification. The Auckland Harbour Bridge dataset is recorded under various weather and lighting conditions, comprising 6,000 labelled frames that include both real and synthetically augmented unsafe cases. The presented system achieves a classification accuracy of 92.3% (+- 0.8%) with a per-frame inference time of 9.5 ms. Achieved sub-10 millisecond latency, combined with sparse spike activity (8.1%), enables real-time, low-power edge deployment. Unlike conventional CNN-based approaches, the hybrid SIFT-SNN pipeline explicitly preserves spatial feature grounding, enhances interpretability, supports transparent decision-making, and operates efficiently on embedded hardware. Although synthetic augmentation improved robustness, generalisation to unseen field conditions remains to be validated. The SIFT-SNN framework is validated through a working prototype deployed on a consumer-grade system and framed as a generalisable case study in structural safety monitoring for movable concrete barriers, which, as a traffic flow-control infrastructure, is deployed in over 20 cities worldwide.

[129] SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Tae-Min Choi,Tae Kyeong Jeong,Garam Kim,Jaemin Lee,Yeongyoon Koh,In Cheul Choi,Jae-Ho Chung,Jong Woong Park,Juyoun Park

Main category: cs.CV

TL;DR: SurgMLLMBench是一个统一的多模态基准,用于开发和评估手术场景理解中的交互式多模态大语言模型,包含像素级分割和结构化VQA标注,并提出新的MAVIS数据集。

Details Motivation: 现有手术数据集多采用异构分类体系的视觉问答格式,缺乏像素级分割支持,限制了模型的可比性和应用性。 Method: 构建包含像素级器械分割掩码和统一分类体系下结构化VQA标注的SurgMLLMBench基准,涵盖腹腔镜、机器人辅助和显微手术领域,并集成新收集的MAVIS数据集。 Result: 在SurgMLLMBench上训练的单一模型在多个手术领域内表现一致,并能有效泛化到未见数据集。 Conclusion: SurgMLLMBench为多模态手术AI研究提供了强大、公开的资源,支持可重复评估和交互式手术推理模型的发展。 Abstract: Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

[130] PFF-Net: Patch Feature Fitting for Point Cloud Normal Estimation

Qing Li,Huifang Feng,Kanle Shi,Yue Gao,Yi Fang,Yu-Shen Liu,Zhizhong Han

Main category: cs.CV

TL;DR: 本文提出了一种基于多尺度特征融合的点云法向量估计新方法,通过patch特征拟合(PFF)框架实现多尺度特征聚合与跨尺度特征补偿,有效解决了不同数据和几何结构下邻域尺度选择的难题,在更少参数和更短运行时间下实现了最先进的性能。

Details Motivation: 现有的点云法向估计方法在处理不同数据或几何结构时难以确定合适的局部邻域大小,且常依赖参数繁重的策略,难以兼顾精度与效率。 Method: 提出多尺度特征融合的patch特征拟合(PFF)框架,包含多尺度特征聚合模块(逐步将多尺度patch特征聚集到中心并缩小patch范围)和跨尺度特征补偿模块(提升大尺度特征的可重用性并揭示不同尺度间的关联信息),以逼近最优几何描述。 Result: 在合成和真实世界数据集上均取得了最先进的法向估计性能,同时模型参数更少、运行时间更短。 Conclusion: 所提出的多尺度特征逼近策略能够自适应不同局部patch的尺度变化,提供最优特征描述,显著提升了点云法向估计的鲁棒性、精度与效率。 Abstract: Estimating the normal of a point requires constructing a local patch to provide center-surrounding context, but determining the appropriate neighborhood size is difficult when dealing with different data or geometries. Existing methods commonly employ various parameter-heavy strategies to extract a full feature description from the input patch. However, they still have difficulties in accurately and efficiently predicting normals for various point clouds. In this work, we present a new idea of feature extraction for robust normal estimation of point clouds. We use the fusion of multi-scale features from different neighborhood sizes to address the issue of selecting reasonable patch sizes for various data or geometries. We seek to model a patch feature fitting (PFF) based on multi-scale features to approximate the optimal geometric description for normal estimation and implement the approximation process via multi-scale feature aggregation and cross-scale feature compensation. The feature aggregation module progressively aggregates the patch features of different scales to the center of the patch and shrinks the patch size by removing points far from the center. It not only enables the network to precisely capture the structure characteristic in a wide range, but also describes highly detailed geometries. The feature compensation module ensures the reusability of features from earlier layers of large scales and reveals associated information in different patch sizes. Our approximation strategy based on aggregating the features of multiple scales enables the model to achieve scale adaptation of varying local patches and deliver the optimal feature description. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets with fewer network parameters and running time.

[131] Endo-G$^{2}$T: Geometry-Guided & Temporally Aware Time-Embedded 4DGS For Endoscopic Scenes

Yangle Liu,Fengze Li,Kan Liu,Jieming Ma

Main category: cs.CV

TL;DR: 提出Endo-G²T,一种几何引导且时间感知的4D高斯点阵训练框架,用于动态内窥镜场景的高质量、高效重建。

Details Motivation: 内窥镜视频存在强烈的视角依赖效应(如镜面反射、湿反射和遮挡),纯光度监督易导致早期几何漂移,影响重建质量。需要在保持时序一致性和效率的同时,尽早锚定几何结构。 Method: 1)通过置信门控单目深度生成几何引导先验,采用尺度不变深度和梯度损失进行软蒸馏;2)设计时间嵌入的高斯场与旋量式旋转表示,实现时序连贯的动态建模;3)采用关键帧约束流式策略,在最大点数预算下优化关键帧,非关键帧轻量更新以提升效率和长时稳定性。 Result: 在EndoNeRF和StereoMIS-P1数据集上,Endo-G²T在单目重建方法中达到最先进水平,显著优于现有基线方法。 Conclusion: Endo-G²T通过几何先验引导和时间感知建模,有效缓解了内窥镜视频中的几何漂移问题,实现了高质量、高效率且时序一致的动态场景重建。 Abstract: Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns with geometry and triggers early geometric drift, where erroneous shapes are reinforced during densification and become hard to correct. We ask how to anchor geometry early for 4D Gaussian splatting (4DGS) while maintaining temporal consistency and efficiency in dynamic endoscopic scenes. Thus, we present Endo-G$^{2}$T, a geometry-guided and temporally aware training scheme for time-embedded 4DGS. First, geo-guided prior distillation converts confidence-gated monocular depth into supervision with scale-invariant depth and depth-gradient losses, using a warm-up-to-cap schedule to inject priors softly and avoid early overfitting. Second, a time-embedded Gaussian field represents dynamics in XYZT with a rotor-like rotation parameterization, yielding temporally coherent geometry with lightweight regularization that favors smooth motion and crisp opacity boundaries. Third, keyframe-constrained streaming improves efficiency and long-horizon stability through keyframe-focused optimization under a max-points budget, while non-keyframes advance with lightweight updates. Across EndoNeRF and StereoMIS-P1 datasets, Endo-G$^{2}$T achieves state-of-the-art results among monocular reconstruction baselines.

[132] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Xin Gu,Haoji Zhang,Qihang Fan,Jingxuan Niu,Zhipeng Zhang,Libo Zhang,Guang Chen,Fan Chen,Longyin Wen,Sijie Zhu

Main category: cs.CV

TL;DR: 本文提出了STVG-o1,首个无需架构修改即可实现最先进性能的多模态大语言模型(MLLM)框架,用于时空视频定位(STVG)。通过引入边界框思维链机制和多维强化奖励函数,显著提升了区域-词语对齐与定位精度,在多个基准上取得领先结果。

Details Motivation: 现有的MLLM在STVG任务上表现不佳,主要由于训练目标不一致以及视觉编码器中细粒度区域-词语对齐弱。因此需要一种新方法来提升MLLM在该任务上的精确性。 Method: 提出STVG-o1框架,引入边界框思维链(bounding-box chain-of-thought)机制以显式推理时空位置,并设计包含格式、一致性、时间、空间和思维奖励的多维强化奖励函数,通过强化微调提供几何感知监督。 Result: 在HCSTVG-v1/v2和VidSTG数据集上评估,STVG-o1在HCSTVG-v1上比最佳任务特定方法高出7.3%的m_tIoU,与专用模型在VidSTG上表现相当,并大幅超越现有基于MLLM的方法。 Conclusion: STVG-o1成功使现成的MLLM在STVG任务上达到最先进水平,展现出强大的开放词汇泛化能力,证明MLLM可作为精确时空定位的有效骨干模型。 Abstract: Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3\% m\_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.

[133] Monet: Reasoning in Latent Visual Space Beyond Images and Language

Qixun Wang,Yang Shi,Yifei Wang,Yuanxing Zhang,Pengfei Wan,Kun Gai,Xianghua Ying,Yisen Wang

Main category: cs.CV

TL;DR: 本文提出了Monet框架,使多模态大语言模型能在潜在视觉空间中直接进行推理,通过生成连续嵌入作为中间视觉思维,提升了视觉推理能力。

Details Motivation: 现有视觉推理方法受限于外部工具,缺乏人类抽象视觉思维的灵活性,难以在潜在空间中有效进行视觉推理。 Method: 提出三阶段基于蒸馏的监督微调(SFT)流程,并设计VLPO(视觉-潜在策略优化)强化学习方法,将潜在嵌入显式纳入策略梯度更新。同时构建了包含12.5万样本的高质量交错图文推理数据集Monet-SFT-125K。 Result: Monet-7B在真实世界感知与推理基准上表现持续提升,在具有挑战性的抽象视觉推理任务中展现出强泛化能力。 Conclusion: 该工作推动了多模态大模型在潜在视觉空间中的自主推理能力,为未来视觉推理研究提供了有效框架与实证经验。 Abstract: "Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.

[134] DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models

Mingue Park,Prin Phunyaphibarn,Phillip Y. Lee,Minhyuk Sung

Main category: cs.CV

TL;DR: 本文提出了DiverseVAR框架,通过在测试时注入文本嵌入噪声并结合新提出的scale-travel潜变量优化技术,在无需重训练或微调的情况下显著提升了视觉自回归模型的生成多样性,同时有效保持了图像质量。

Details Motivation: 视觉自回归模型(VAR)在图像生成上表现出色,但在相同提示下常生成高度相似的图像,缺乏多样性,这一问题在注重图像质量的研究中被忽视。 Method: 首先在文本嵌入中注入噪声以提升多样性;然后提出scale-travel方法,利用多尺度自编码器提取粗粒度token,在生成中间阶段恢复生成过程,以保持图像质量。 Result: 实验表明,该方法在多样性和图像质量之间实现了新的帕累托最优,显著提高了生成结果的多样性,同时最小化了质量下降。 Conclusion: DiverseVAR为VAR模型提供了一种高效、无需训练的多样性增强方案,解决了其在文本到图像生成中的关键局限性。 Abstract: We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.

[135] SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning

Futian Wang,Mengqi Wang,Xiao Wang,Haowen Wang,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种结合SAM基础模型与知识图谱的遥感图像变化描述方法,通过融合全局视觉特征、语义/运动级变化区域和兴趣对象信息,实现了更精准的自然语言变化描述。

Details Motivation: 现有遥感变化描述方法区域感知弱、时序对齐能力有限,缺乏对兴趣区域的显式建模,限制了描述的准确性和可解释性。 Method: 采用CNN/Transformer提取全局视觉特征,利用SAM模型分割语义和运动层面的变化区域,并构建知识图谱引入兴趣对象知识;通过交叉注意力机制融合多源异构信息,由Transformer解码器生成自然语言描述。 Result: 在多个主流遥感变化描述数据集上达到最先进性能,显著优于现有方法,验证了区域级表示和知识注入的有效性。 Conclusion: 将基础模型SAM与知识图谱结合用于遥感变化描述是有效且有前景的方向,增强了模型对变化区域的理解与描述能力。 Abstract: Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning

[136] E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework

Adeela Islam,Stefano Fiorini,Manuel Lecha,Theodore Tsesmelis,Stuart James,Pietro Morerio,Alessio Del Bue

Main category: cs.CV

TL;DR: 本文提出了一种名为E-M3RF的等变多模态3D重装配框架,结合几何和颜色特征,利用SE(3)流匹配预测碎片的变换,显著提升了在文化遗产数据集上的重装配精度。

Details Motivation: 现有基于深度学习的3D重装配方法主要依赖几何特征,在几何信息不足或模糊时(如小碎片、侵蚀或对称碎片)表现不佳,且缺乏防止重叠的物理约束。 Method: 提出E-M3RF框架,使用旋转等变编码器提取3D点位置的几何特征,用Transformer编码每个点的颜色特征,并融合为多模态表示,通过SE(3)流匹配预测重装配所需的变换。 Result: 在Breaking Bad、Fantastic Breaks、RePAIR和Presious四个数据集上实验表明,E-M3RF在RePAIR数据集上相比现有方法旋转误差降低23.1%,平移误差降低13.2%,Chamfer距离减少18.4%。 Conclusion: E-M3RF通过融合几何与颜色的多模态表示和等变网络结构,有效提升了复杂场景下3D碎片的重装配精度,尤其适用于文化遗产修复等实际应用。 Abstract: 3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.

[137] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Jiajie Zhang,Sören Schwertfeger,Alexander Kleiner

Main category: cs.CV

TL;DR: 提出了一种无监督框架,从连续工业视频流中自动提取和组织视觉-语言-动作(VLA)预训练数据。

Details Motivation: 利用大量未标记的人类操作视频数据,推动制造业中具身AI的发展。 Method: 首先训练轻量级运动分词器编码运动动态,然后使用基于“潜在动作能量”的无监督动作分割器发现并分割语义连贯的动作原语。 Result: 在公开基准和自建电机装配数据集上验证了关键任务的有效分割,并通过视觉-语言模型聚类和量化评估确认了动作原语的语义一致性。 Conclusion: 这是首个从非结构化工业视频中全自动提取VLA预训练数据的端到端系统,为制造业中的具身AI提供了可扩展的解决方案。 Abstract: We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.

[138] EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream Aggregation

Futian Wang,Fan Zhang,Xiao Wang,Mengqi Wang,Dexing Huang,Jin Tang

Main category: cs.CV

TL;DR: 提出一种基于超图的时空事件流补全机制,通过超图连接不同时空位置的事件,并利用上下文信息传递来补全稀疏事件,支持与RGB模态融合,有效提升事件分类性能。

Details Motivation: 现有事件表示学习方法难以解决因空间稀疏性导致的欠采样问题。 Method: 设计超图引导的时空事件流补全机制,将事件token通过超图连接并进行消息传递以补全稀疏事件;引入RGB token作为超图节点实现多模态补全,并通过自注意力聚合不同时刻的节点信息以融合多模态特征。 Result: 在单标签和多标签事件分类任务上进行了大量实验,验证了所提框架的有效性。 Conclusion: 该方法有效缓解了事件数据的空间稀疏性问题,实现了更优的多模态特征学习与融合,提升了事件分类性能。 Abstract: Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.

[139] MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices

Shuai Zhang,Bao Tang,Siyuan Yu,Yueting Zhu,Jingfeng Yao,Ya Zou,Shanglin Yuan,Li Yu,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: 本文提出了MobileI2V,一个轻量级扩散模型,用于在移动设备上实现实时720p图像到视频生成,具备高效注意力机制、时间步蒸馏策略和移动端优化,显著提升速度且保持高质量。

Details Motivation: 由于扩散模型计算复杂度高、生成速度慢,现有方法难以在资源受限的移动设备上实现实时高分辨率视频生成,因此需要一种高效的I2V解决方案。 Method: 提出了一种线性混合架构去噪器以平衡效率与质量;设计时间步蒸馏策略,将采样步骤从20步以上压缩至仅2步;应用面向移动端的注意力优化技术以加速推理。 Result: 实现了在移动设备上每帧低于100ms的720p视频生成速度,较原有方法提速约10倍,并保持了与现有模型相当的生成质量。 Conclusion: MobileI2V首次在移动设备上实现了高质量、实时的图像到视频生成,为移动端视频生成应用提供了可行方案。 Abstract: Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the substantial computational complexity and slow generation speed of diffusion models pose significant challenges for real-time, high-resolution video generation on resource-constrained mobile devices. In this work, we propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices. The core lies in: (1) We analyzed the performance of linear attention modules and softmax attention modules on mobile devices, and proposed a linear hybrid architecture denoiser that balances generation efficiency and quality. (2) We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss, resulting in a 10-fold increase in generation speed. (3) We apply mobile-specific attention optimizations that yield a 2-fold speed-up for attention operations during on-device inference. MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models. Under one-step conditions, the generation speed of each frame of 720p video is less than 100 ms. Our code is available at: https://github.com/hustvl/MobileI2V.

[140] Frequency-Aware Token Reduction for Efficient Vision Transformer

Dong-Jae Lee,Jiwan Hur,Jaehyun Choi,Jaemyung Yu,Junmo Kim

Main category: cs.CV

TL;DR: 本文提出了一种频率感知的token缩减策略,通过分离高频和低频token并分别处理,在降低Vision Transformer计算开销的同时缓解了秩坍缩和过平滑问题,提升了模型效率与性能。

Details Motivation: 现有的token缩减方法忽略了自注意力机制中的频率特性(如秩坍缩和过平滑现象),导致性能下降或信息丢失,因此需要一种能够保留关键频率成分的更优缩减策略。 Method: 将token划分为高频和低频部分:选择性保留高频token,将低频token聚合为一个紧凑的直流token以保留主要低频信息,从而在减少token数量的同时维持模型表达能力。 Result: 实验表明该方法显著降低了计算开销,有效缓解了秩坍缩和过平滑问题,并在多个任务上提升了准确率;同时对现有方法进行了频率特性分析,揭示了其隐含局限。 Conclusion: 所提出的频率感知token缩减策略在提升Vision Transformer效率的同时保持甚至增强了模型性能,为理解与改进token缩减提供了新的视角。 Abstract: Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations.

[141] Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning

Taehoon Kim,Donghwan Jang,Bohyung Han

Main category: cs.CV

TL;DR: 提出了一种名为Merge-and-Bound(M&B)的类增量学习新训练方法,通过在参数空间中直接操作模型权重来优化,并引入两种权重合并策略和有界更新技术以减轻灾难性遗忘。

Details Motivation: 在类增量学习中,模型容易因参数更新而遗忘先前任务的知识,现有方法通常依赖复杂的架构或目标修改,缺乏在参数空间直接优化的有效策略。 Method: 提出Inter-task和Intra-task两种权重合并机制:前者通过平均之前各阶段模型的权重来统一旧模型,后者在当前阶段内融合参数以促进当前任务学习;同时设计有界更新技术,限制参数更新幅度,使新模型靠近旧模型,从而减少遗忘。 Result: 在多个标准CIL基准上进行了广泛评估,结果表明M&B显著优于当前最先进的方法,且无需修改网络结构或学习目标,具有良好的通用性和集成性。 Conclusion: Merge-and-Bound提供了一种有效、简洁且无需重放的CIL训练范式,证明了在参数空间中通过受控权重合并进行优化是缓解灾难性遗忘的可行且高性能的路径。 Abstract: We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the parameter space for optimization. Our algorithm involves two types of weight merging: inter-task weight merging and intra-task weight merging. Inter-task weight merging unifies previous models by averaging the weights of models from all previous stages. On the other hand, intra-task weight merging facilitates the learning of current task by combining the model parameters within current stage. For reliable weight merging, we also propose a bounded update technique that aims to optimize the target model with minimal cumulative updates and preserve knowledge from previous tasks; this strategy reveals that it is possible to effectively obtain new models near old ones, reducing catastrophic forgetting. M&B is seamlessly integrated into existing CIL methods without modifying architecture components or revising learning objectives. We extensively evaluate our algorithm on standard CIL benchmarks and demonstrate superior performance compared to state-of-the-art methods.

[142] CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation

Shizhe Sun,Wataru Ohyama

Main category: cs.CV

TL;DR: 提出了一种基于交叉注意力的非局部知识蒸馏方法CanKD,通过增强像素级特征关系提升教师-学生模型间的知识迁移效果。

Details Motivation: 传统基于自注意力的知识蒸馏方法独立对齐教师和学生特征图,难以充分捕捉跨特征图的像素间依赖关系,限制了知识转移效率。 Method: 引入交叉注意力机制,使学生特征图的每个像素能够动态关注教师特征图中的所有像素,实现非局部知识传递,并仅通过添加额外损失函数完成训练。 Result: 在目标检测和图像分割任务上显著优于现有特征和混合蒸馏方法,验证了其有效性。 Conclusion: CanKD为计算机视觉任务中的注意力引导知识蒸馏提供了一种新范式,具有更强的特征表示学习能力。 Abstract: We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD's potential as a new paradigm for attention-guided distillation in computer vision tasks. Code is available at https://github.com/tori-hotaru/CanKD

[143] Generalized Design Choices for Deepfake Detectors

Lorenzo Pellegrini,Serafino Pandolfini,Davide Maltoni,Matteo Ferrara,Marco Prati,Marco Ramilli

Main category: cs.CV

TL;DR: 本文系统地研究了不同设计选择对深度伪造检测模型准确性和泛化能力的影响,旨在建立与架构无关的最佳实践,以提升检测性能并在AI-GenBench基准上实现最先进表现。

Details Motivation: 深度伪造检测方法的性能常受实现细节(如数据预处理、增强策略和优化技术)影响,导致难以公平比较和识别关键影响因素。因此,需要系统性分析这些设计选择的作用。 Method: 通过控制变量实验,隔离训练、推理和增量更新等环节中各个设计因素的影响,评估其对检测性能的作用,并总结出可广泛适用的设计准则。 Result: 实验确定了一组能持续提升深度伪造检测性能的设计选择,在AI-GenBench基准上实现了最先进的性能表现。 Conclusion: 本文提出的架构无关的最佳实践有助于未来深度伪造检测系统的设计与开发,提升了模型的准确性与泛化能力。 Abstract: The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentation strategies, and optimization techniques. These factors make it difficult to fairly compare detectors and to understand which factors truly contribute to their performance. To address this, we systematically investigate how different design choices influence the accuracy and generalization capabilities of deepfake detection models, focusing on aspects related to training, inference, and incremental updates. By isolating the impact of individual factors, we aim to establish robust, architecture-agnostic best practices for the design and development of future deepfake detection systems. Our experiments identify a set of design choices that consistently improve deepfake detection and enable state-of-the-art performance on the AI-GenBench benchmark.

[144] Self-Paced Learning for Images of Antinuclear Antibodies

Yiyang Jiang,Guangwu Qian,Jiaxin Wu,Qi Huang,Qing Li,Yongkang Wu,Xiao-Yong Wei

Main category: cs.CV

TL;DR: 提出了一种新的端到端框架用于抗核抗体(ANA)检测,通过实例采样、伪标签分配和自步学习策略,在多实例多标签场景下显著提升性能,取得了当前最优结果。

Details Motivation: 手动ANA检测耗时、费力且需要大量训练,同时存在多种荧光模式组合,传统机器学习方法难以有效处理临床中的多实例多标签(MIML)复杂性。 Method: 提出一个无需预处理的新型MIML框架,包含三个组件:实例采样器(过滤低置信度实例)、概率伪标签分发器(根据可区分性分配标签)和自步学习权重调整机制(动态优化训练过程),直接使用原始显微镜图像进行端到端学习。 Result: 在ANA数据集上比先前最佳方法提升+7.0% F1-Macro和+12.6% mAP,在多个公共医学MIML基准上排名前二,Hamming loss和one-error最多降低18.2%和26.9%。 Conclusion: 该框架有效解决了ANA检测中的MIML挑战,具有更强的鲁棒性和泛化能力,为临床自动化ANA检测提供了先进且实用的解决方案。 Abstract: Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren's syndrome, and scleroderma. Despite its importance, manual ANA detection is slow, labor-intensive, and demands years of training. ANA detection is complicated by over 100 coexisting antibody types, resulting in vast fluorescent pattern combinations. Although machine learning and deep learning have enabled automation, ANA detection in real-world clinical settings presents unique challenges as it involves multi-instance, multi-label (MIML) learning. In this paper, a novel framework for ANA detection is proposed that handles the complexities of MIML tasks using unaltered microscope images without manual preprocessing. Inspired by human labeling logic, it identifies consistent ANA sub-regions and assigns aggregated labels accordingly. These steps are implemented using three task-specific components: an instance sampler, a probabilistic pseudo-label dispatcher, and self-paced weight learning rate coefficients. The instance sampler suppresses low-confidence instances by modeling pattern confidence, while the dispatcher adaptively assigns labels based on instance distinguishability. Self-paced learning adjusts training according to empirical label observations. Our framework overcomes limitations of traditional MIML methods and supports end-to-end optimization. Extensive experiments on one ANA dataset and three public medical MIML benchmarks demonstrate the superiority of our framework. On the ANA dataset, our model achieves up to +7.0% F1-Macro and +12.6% mAP gains over the best prior method, setting new state-of-the-art results. It also ranks top-2 across all key metrics on public datasets, reducing Hamming loss and one-error by up to 18.2% and 26.9%, respectively. The source code can be accessed at https://github.com/fletcherjiang/ANA-SelfPacedLearning.

[145] EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

Pierre Adorni,Minh-Tan Pham,Stéphane May,Sébastien Lefèvre

Main category: cs.CV

TL;DR: 提出一种高效的遥感基础模型集成框架(Ensemble-of-Specialists),通过轻量级、可复用的任务专家模块实现可持续、可扩展的遥感建模。

Details Motivation: 现有遥感基础模型依赖大规模参数和数据,导致计算资源消耗大、碳足迹高,难以普及且不符合绿色AI理念。 Method: 采用模块化设计,将训练分解为多个轻量级、任务特定的ConvNeXtV2专家模型,支持冻结、重用、联邦学习、剪枝和持续集成。 Result: 实现了高效、可解释、可扩展的遥感基础模型框架,支持分布式协作与资源受限环境下的部署。 Conclusion: 该方法为构建可持续、高效的遥感基础模型提供了新方向,降低了准入门槛并提升了环境友好性。 Abstract: Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs.

[146] The Age-specific Alzheimer 's Disease Prediction with Characteristic Constraints in Nonuniform Time Span

Xin Hong,Kaifeng Huang

Main category: cs.CV

TL;DR: 本研究提出了一种基于定量指标引导的序列图像生成方法,并结合年龄缩放因子生成特定年龄的MRI图像,以提高阿尔茨海默病的长期预测准确性。

Details Motivation: 阿尔茨海默病的早期识别对个性化治疗至关重要,但当输入序列在不规则时间间隔采集时,生成能准确反映疾病特征的图像具有挑战性。 Method: 提出一种由定量指标引导的序列图像生成方法,并引入年龄缩放因子,采用年龄缩放像素损失优化MRI图像合成过程。 Result: 消融实验表明,引入定量指标显著提升了MRI图像合成的准确性,年龄缩放像素损失优化了图像的迭代生成;在长期疾病预测中,结构相似性指数达到0.882,表明合成图像具有高度相似性。 Conclusion: 该方法能有效保留疾病进展的关键特征,生成高质量的年龄特异性MRI图像,有助于提升阿尔茨海默病的预测性能。 Abstract: Alzheimer's disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development of personalized treatment strategies that aim to mitigate its progression. The application of generated images for the prediction of Alzheimer's disease poses challenges, particularly in accurately representing the disease's characteristics when input sequences are captured at irregular time intervals. This study presents an innovative methodology for sequential image generation, guided by quantitative metrics, to maintain the essential features indicative of disease progression. Furthermore, an age-scaling factor is integrated into the process to produce age-specific MRI images, facilitating the prediction of advanced stages of the disease. The results obtained from the ablation study suggest that the inclusion of quantitative metrics significantly improves the accuracy of MRI image synthesis. Furthermore, the application of age-scaled pixel loss contributed to the enhanced iterative generation of MRI images. In terms of long-term disease prognosis, the Structural Similarity Index reached a peak value of 0.882, indicating a substantial degree of similarity in the synthesized images.

[147] Video Generation Models Are Good Latent Reward Models

Xiaoyue Mi,Wenqing Yu,Jiesong Lian,Shibo Jie,Ruizhe Zhong,Zijun Liu,Guozhen Zhang,Zixiang Zhou,Zhiyong Xu,Yuan Zhou,Qinglin Lu,Fan Tang

Main category: cs.CV

TL;DR: 提出PRFL框架,通过在噪声潜在空间中进行偏好优化,解决了现有视频生成奖励模型在像素空间操作导致的高计算成本和缺乏早期监督的问题。

Details Motivation: 现有的视频奖励模型依赖于为像素空间输入设计的视觉-语言模型,限制了ReFL优化只能在去噪后期进行,导致计算开销大且无法有效优化运动动态和结构连贯性。 Method: 利用预训练视频生成模型在噪声潜在空间中直接进行奖励建模,提出PRFL框架,实现全去噪过程中的梯度反向传播,避免VAE解码。 Result: 实验表明,PRFL显著提升了与人类偏好的对齐效果,同时大幅降低了内存消耗和训练时间。 Conclusion: PRFL通过在潜在空间中进行奖励反馈学习,为视频生成提供了更高效、更有效的偏好对齐方法。 Abstract: Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

[148] UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes

Kang Du,Xue Liao,Junpeng Xia,Chaozheng Guo,Yi Gu,Yirui Guan,Duotun Wang,ShengHuang,Zeyu Wang

Main category: cs.CV

TL;DR: UAVLight is a new benchmark for illumination-robust 3D reconstruction using UAVs, featuring multi-view captures under varying natural lighting conditions while maintaining consistent geometry and viewpoints.

Details Motivation: Illumination inconsistency due to changing sunlight, shadows, and cloud cover poses a major challenge for 3D reconstruction methods; existing datasets either lack sufficient lighting variation or introduce confounding geometric/semantic changes over long time spans. Method: The authors introduce UAVLight, a controlled real-world dataset with repeatable, geo-referenced UAV flight paths capturing each scene at multiple times of day to ensure natural illumination changes without altering geometry, calibration, or viewpoints. Result: UAVLight enables standardized evaluation of reconstruction methods under diverse lighting, supporting the development of more robust, consistent, and relightable 3D models in outdoor environments. Conclusion: UAVLight provides a reliable and realistic benchmark for studying illumination robustness in 3D reconstruction, particularly valuable for advancing UAV-based and neural rendering approaches. Abstract: Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable. However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness. We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.

[149] Multimodal Robust Prompt Distillation for 3D Point Cloud Models

Xiang Gu,Liming Lu,Xu Zheng,Anan Du,Yongbin Zhou,Shuchao Pang

Main category: cs.CV

TL;DR: 提出了一种高效的多模态鲁棒提示蒸馏框架(MRPD),用于提升3D点云模型对对抗攻击的防御能力,训练时蒸馏鲁棒特征,推理无额外开销。

Details Motivation: 现有3D点云对抗防御方法存在计算开销高和跨攻击泛化能力差的问题,亟需一种高效且通用的解决方案。 Method: 设计了一种教师-学生框架MRPD,通过融合视觉模型(深度投影)、高性能3D模型和文本编码器三种教师模态的鲁棒嵌入,利用置信度门控机制动态对齐并蒸馏轻量级提示至学生模型。 Result: 在多种白盒和黑盒攻击下显著优于现有防御方法,且在干净数据上性能更优,推理阶段无额外计算成本。 Conclusion: MRPD为构建高效、鲁棒的3D视觉系统提供了一种实用的新范式,通过多模态知识蒸馏实现强泛化性和低推理开销。 Abstract: Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.

[150] Enhanced Landmark Detection Model in Pelvic Fluoroscopy using 2D/3D Registration Loss

Chou Mo,Yehyun Suh,J. Ryan Martin,Daniel Moyer

Main category: cs.CV

TL;DR: 提出一种结合2D/3D标志点配准的U-Net框架,用于在可变患者姿态下提高术中骨盆X光图像中解剖标志点检测的准确性。

Details Motivation: 现有骨盆X光标志点检测方法大多假设为固定的前后位视角,难以应对术中成像角度和患者体位变化的问题,因此需要一种对姿态变化鲁棒的检测方法。 Method: 将2D/3D标志点配准信息融入U-Net模型训练过程,引入姿态估计损失(Pose Estimation Loss),并与基线U-Net、使用该损失训练及微调的模型进行比较。 Result: 在模拟真实术中可变体位条件下,所提方法相比基线模型显著提升了标志点检测的准确性。 Conclusion: 融合2D/3D注册信息的U-Net模型能有效提升在复杂术中环境下骨盆解剖标志点的检测鲁棒性和精度,具有临床应用潜力。 Abstract: Automated landmark detection offers an efficient approach for medical professionals to understand patient anatomic structure and positioning using intra-operative imaging. While current detection methods for pelvic fluoroscopy demonstrate promising accuracy, most assume a fixed Antero-Posterior view of the pelvis. However, orientation often deviates from this standard view, either due to repositioning of the imaging unit or of the target structure itself. To address this limitation, we propose a novel framework that incorporates 2D/3D landmark registration into the training of a U-Net landmark prediction model. We analyze the performance difference by comparing landmark detection accuracy between the baseline U-Net, U-Net trained with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss under realistic intra-operative conditions where patient pose is variable.

[151] Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Teng Hu,Zhentao Yu,Guozhen Zhang,Zihan Su,Zhengguang Zhou,Youliang Zhang,Yuan Zhou,Qinglin Lu,Ran Yi

Main category: cs.CV

TL;DR: 本文提出了Harmony框架,以解决开源生成模型在音视频同步生成中的对齐难题,通过新训练范式、注意力模块和改进的CFG方法,在音视频同步性与生成质量上均达到SOTA。

Details Motivation: 现有开源音视频生成模型在同步对齐方面存在显著问题,主要源于联合扩散过程中的对应漂移、低效注意力机制和单模态偏差。 Method: 提出Harmony框架:采用跨任务协同训练抑制潜变量漂移,设计全局-局部解耦交互模块增强时序对齐,并引入同步增强型CFG(SyncCFG)强化跨模态对齐信号。 Result: 实验表明,Harmony在生成保真度和细粒度音视频同步上显著优于现有方法,实现了新的最先进性能。 Conclusion: Harmony通过机制化设计有效解决了音视频联合生成中的同步瓶颈,为多模态生成提供了可推广的解决方案。 Abstract: The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

[152] Deep Learning-Based Multiclass Classification of Oral Lesions with Stratified Augmentation

Joy Naoum,Revana Salama,Ali Hamdi

Main category: cs.CV

TL;DR: 该研究提出了一种基于深度学习的多分类模型,用于口腔病变的早期检测,通过分层数据划分、数据增强和过采样技术有效应对数据不平衡问题,在准确率、精确率和召回率方面优于现有方法。

Details Motivation: 由于口腔癌在早期难以通过视觉区分良性和恶性病变,常在晚期才被诊断,因此需要可靠的计算机辅助诊断系统以提高早期检出率和临床疗效。 Method: 采用深度学习构建十六类口腔病变的多分类器,结合分层数据分割、高级数据增强和过采样策略来缓解数据量少和类别不平衡的问题。 Result: 实验结果达到83.33%的准确率、89.12%的精确率和77.31%的召回率,显著优于当前主流方法,尤其提升了少数类别的分类性能。 Conclusion: 所提出的框架在推动可靠、可落地的口腔癌早期计算机辅助诊断系统方面具有潜力,为临床应用提供了可行的第一步。 Abstract: Oral cancer is highly common across the globe and is mostly diagnosed during the later stages due to the close visual similarity to benign, precancerous, and malignant lesions in the oral cavity. Implementing computer aided diagnosis systems early on has the potential to greatly improve clinical outcomes. This research intends to use deep learning to build a multiclass classifier for sixteen different oral lesions. To overcome the challenges of limited and imbalanced datasets, the proposed technique combines stratified data splitting and advanced data augmentation and oversampling to perform the classification. The experimental results, which achieved 83.33 percent accuracy, 89.12 percent precision, and 77.31 percent recall, demonstrate the superiority of the suggested model over state of the art methods now in use. The suggested model effectively conveys the effectiveness of oversampling and augmentation strategies in situations where the minority class classification performance is noteworthy. As a first step toward trustworthy computer aided diagnostic systems for the early detection of oral cancer in clinical settings, the suggested framework shows promise.

[153] MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

Haotian Xue,Qi Chen,Zhonghao Wang,Xun Huang,Eli Shechtman,Jinrong Xie,Yongxin Chen

Main category: cs.CV

TL;DR: MoGAN是一种无需奖励模型或人类偏好数据的运动中心型后训练框架,通过训练DiT-based光流判别器和分布匹配正则化项,在保持视觉保真度的同时显著提升视频扩散模型的运动真实感。

Details Motivation: 现有视频扩散模型在运动连贯性、动态表现和真实性方面存在不足,标准去噪MSE目标缺乏对时间一致性的直接监督,导致生成结果出现抖动、重影或不合理的动态。 Method: 基于三步蒸馏的视频扩散模型,构建一个基于DiT的光流判别器以区分真实与生成的运动,并引入分布匹配正则化来维持视觉质量。 Result: 在Wan2.1-T2V-1.3B上实验显示,MoGAN在VBench上比50步教师模型提升+7.3%,比3步DMD模型提升+13.3%;在VideoJAM-Bench上分别提升+7.4%和+8.8%,且保持相当甚至更优的美学与图像质量得分。人类研究也表明用户更偏好MoGAN生成的运动质量。 Conclusion: MoGAN能在不牺牲视觉保真度和效率的前提下显著提升视频生成的运动真实感,为快速高质量视频生成提供了实用路径。 Abstract: Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.

[154] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

M. Naseer Subhani

Main category: cs.CV

TL;DR: 提出一种自提示、点监督的框架,通过Refine-Requery-Reinforce循环提升SAM在遥感图像中的分割性能。

Details Motivation: SAM在自然图像上表现良好,但在遥感图像中由于域偏移和密集标注稀缺而表现不佳。 Method: 采用Refine-Requery-Reinforce循环:从初始点生成粗略伪掩码(Refine),利用自构建的框提示改进(Requery),并通过迭代对齐嵌入减少确认偏差(Reinforce)。 Result: 在WHU、HRSID和NWPU VHR-10三个遥感图像基准数据集上,该方法 consistently 超过预训练SAM和近期点监督方法。 Conclusion: 自提示和语义对齐为基于点级标注的基础分割模型在遥感应用中的可扩展适应提供了有效路径。 Abstract: Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM's segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.

[155] Active Learning for GCN-based Action Recognition

Hichem Sahbi

Main category: cs.CV

TL;DR: 提出了一种新的标签高效图卷积网络(GCN)模型,用于骨架动作识别,通过对抗策略选择代表性样本,并引入双向稳定GCN架构以提升性能。

Details Motivation: 现有GCN在骨架动作识别中依赖大量标注数据,而实际中标注数据稀缺,限制了其应用。 Method: 设计了一种新的获取函数,采用对抗策略平衡代表性、多样性和不确定性来选择信息量大的样本;同时提出了双向稳定的GCN架构,增强环境空间与潜在空间之间的映射。 Result: 在两个具有挑战性的骨架动作识别基准上进行了广泛评估,结果表明所提方法相比先前工作显著提升了性能。 Conclusion: 所提出的标签高效GCN模型在减少标注需求的同时提高了识别精度,适用于标注数据有限的实际场景。 Abstract: Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of labeled data, which are frequently scarce in practical settings. To address this limitation, we propose a novel label-efficient GCN model. Our work makes two primary contributions. First, we develop a novel acquisition function that employs an adversarial strategy to identify a compact set of informative exemplars for labeling. This selection process balances representativeness, diversity, and uncertainty. Second, we introduce bidirectional and stable GCN architectures. These enhanced networks facilitate a more effective mapping between the ambient and latent data spaces, enabling a better understanding of the learned exemplar distribution. Extensive evaluations on two challenging skeleton-based action recognition benchmarks reveal significant improvements achieved by our label-efficient GCNs compared to prior work.

[156] Qwen3-VL Technical Report

Shuai Bai,Yuxuan Cai,Ruizhe Chen,Keqin Chen,Xionghui Chen,Zesen Cheng,Lianghao Deng,Wei Ding,Chang Gao,Chunjiang Ge,Wenbin Ge,Zhifang Guo,Qidong Huang,Jie Huang,Fei Huang,Binyuan Hui,Shutong Jiang,Zhaohai Li,Mingsheng Li,Mei Li,Kaixin Li,Zicheng Lin,Junyang Lin,Xuejing Liu,Jiawei Liu,Chenglong Liu,Yang Liu,Dayiheng Liu,Shixuan Liu,Dunjie Lu,Ruilin Luo,Chenxu Lv,Rui Men,Lingchen Meng,Xuancheng Ren,Xingzhang Ren,Sibo Song,Yuchong Sun,Jun Tang,Jianhong Tu,Jianqiang Wan,Peng Wang,Pengfei Wang,Qiuyue Wang,Yuxuan Wang,Tianbao Xie,Yiheng Xu,Haiyang Xu,Jin Xu,Zhibo Yang,Mingkun Yang,Jianxin Yang,An Yang,Bowen Yu,Fei Zhang,Hang Zhang,Xi Zhang,Bo Zheng,Humen Zhong,Jingren Zhou,Fan Zhou,Jing Zhou,Yuanzhi Zhu,Ke Zhu

Main category: cs.CV

TL;DR: Qwen3-VL是Qwen系列中目前最先进的视觉语言模型,支持长达256K token的文本、图像和视频交错输入,在纯文本理解、长上下文建模和多模态推理方面表现卓越,并通过三项架构升级实现领先性能。

Details Motivation: 为了提升视觉语言模型在长上下文、多模态交错内容以及复杂推理任务中的表现,需要更强的时空建模能力、更紧密的图文对齐和精确的时间对齐机制。 Method: 提出Qwen3-VL模型,包含密集型和混合专家型多种规模;引入增强的交错式MRoPE、DeepStack整合多级ViT特征、基于文本的时间对齐方法以改进视频时序建模。 Result: 在MMMU、MathVista、MathVision等多个基准上达到领先性能,具备强大的纯文本理解、256K长上下文处理能力和跨图像与视频的多模态推理能力。 Conclusion: Qwen3-VL在多模态理解与推理方面具有广泛的应用潜力,可作为图像接地推理、智能体决策和多模态代码智能的基石模型。 Abstract: We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

[157] Continual Error Correction on Low-Resource Devices

Kirill Paramonov,Mete Ozay,Aristeidis Mystakidis,Nikolaos Tsalikidis,Dimitrios Sotos,Anastasios Drosou,Dimitrios Tzovaras,Hyunjun Kim,Kiseok Chang,Sangdok Mo,Namwoong Kim,Woojong Yoo,Jijoong Moon,Umberto Michieli

Main category: cs.CV

TL;DR: 提出一种基于原型更新的轻量级AI错误校正系统,结合服务器端知识蒸馏与设备端原型适应,在资源受限设备上实现高效、低开销的少样本错误纠正。

Details Motivation: 现有AI错误检测方法缺乏高效的校正机制,尤其在资源受限设备上难以进行模型重训练,导致用户体验下降。 Method: 采用服务器端基础模型通过知识蒸馏生成紧凑特征表示,并在设备端使用原型分类器,通过少量样本更新原型实现错误纠正,避免完整模型微调。 Result: 在Food-101和Flowers-102数据集上的一次性学习场景中实现超过50%的错误纠正率,遗忘率低于0.02%,计算开销极低,并通过Android应用验证了实用性。 Conclusion: 该系统为资源受限设备提供了高效、实用的AI预测错误校正方案,平衡了性能、存储与实时性需求。 Abstract: The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system's effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system's practicality in real-world scenarios.

[158] CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow

Ruisheng Han,Kanglei Zhou,Shuang Chen,Amir Atapour-Abarghouei,Hubert P. H. Shum

Main category: cs.CV

TL;DR: 提出CaFlow框架,结合反事实去混淆和双向时间条件流,用于动作质量评估,实现SOTA性能。

Details Motivation: 现有方法依赖昂贵的标注或单向时序建模,难以应对长期AQA中的混淆因素和时序建模挑战。 Method: 设计因果反事实正则化(CCR)模块以自监督方式分离因果与混淆特征,并通过反事实干预增强鲁棒性;采用BiT-Flow模块结合前向与后向动态建模,引入循环一致性约束提升表示稳定性。 Result: 在多个长期AQA基准上实验表明,CaFlow显著优于现有方法,取得最优性能。 Conclusion: CaFlow通过双向时序建模与因果去混淆机制,有效提升了长时动作质量评估的准确性和鲁棒性。 Abstract: Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. The Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions, while the BiT-Flow module models forward and backward dynamics with a cycle-consistency constraint to produce smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance. Code is available at https://github.com/Harrison21/CaFlow

[159] Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Tianyi Xiong,Yi Ge,Ming Li,Zuolong Zhang,Pranav Kulkarni,Kaishen Wang,Qi He,Zeying Zhu,Chenxi Liu,Ruibo Chen,Tong Zheng,Yanshuo Chen,Xiyao Wang,Renrui Zhang,Wenhu Chen,Heng Huang

Main category: cs.CV

TL;DR: 本研究提出了Multi-Crit基准,用于评估多模态大模型在遵循多样化、细粒度评价标准方面的能力,揭示了现有模型在多标准一致性、灵活切换和冲突识别方面的不足。

Details Motivation: 尽管大型多模态模型(LMMs)被广泛用作多模态评估中的评判者,但其遵循多样化和细粒度评价标准的能力尚未充分探索。 Method: 构建了一个名为Multi-Crit的基准,包含开放式生成和可验证推理任务,通过严格的数据整理流程收集具有多标准人工标注的挑战性响应对,并提出三个新指标来评估模型在多元标准遵循、标准切换灵活性和偏好冲突识别方面的能力。 Result: 对25个LMM的综合分析表明:1)专有模型在保持对多元标准的一致遵循方面仍有困难,尤其是在开放式评估中;2)开源模型在灵活遵循不同标准方面更落后;3)基于整体判断信号的批评微调虽增强视觉定位,但无法泛化到细粒度标准级判断。此外还分析了推理微调、测试时扩展及开源与专有模型间的边界一致性。 Conclusion: Multi-Crit作为首个系统评估多模态评判者遵循多元标准能力的基准,为构建可靠且可控的多模态AI评估系统奠定了基础。 Abstract: Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.

[160] Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Naifu Zhang,Wei Tao,Xi Xiao,Qianpu Sun,Yuxin Zheng,Wentao Mo,Peiqiang Wang,Nan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为ADVLA的新框架,通过在视觉编码器投影到文本特征空间的特征上直接施加对抗性扰动,高效地破坏视觉-语言-动作(VLA)模型的动作预测。该方法在低幅度和局部稀疏约束下实现了接近100%的攻击成功率,扰动几乎不可见且计算成本低,显著优于传统基于补丁的攻击方法。

Details Motivation: 现有的VLA模型对抗攻击方法需要昂贵的端到端训练,并生成明显的扰动补丁,限制了其实际应用。因此,亟需一种高效、低扰动、低成本的攻击方式来评估VLA模型的安全性。 Method: ADVLA框架将对抗性扰动直接施加于视觉编码器输出并映射到文本特征空间的特征上,利用注意力引导实现扰动的聚焦与稀疏化。提出了三种策略:增强敏感性、强制稀疏性和集中扰动,并结合Top-K掩码机制,在极小像素修改下实现高效攻击。 Result: 在$L_{\infty}=4/255$约束下,ADVLA结合Top-K掩码修改不到10%的图像块,攻击成功率接近100%;扰动集中在关键区域,整体图像中几乎不可察觉,单步迭代仅需约0.06秒,显著优于传统补丁攻击。 Conclusion: ADVLA能有效削弱VLA模型在低幅度和局部稀疏条件下的下游动作预测能力,避免了传统补丁攻击的高训练成本和明显扰动,展现出对VLA特征空间攻击的独特有效性与实用价值。 Abstract: In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.

[161] Revolutionizing Glioma Segmentation & Grading Using 3D MRI - Guided Hybrid Deep Learning Models

Pandiyaraju V,Sreya Mynampati,Abishek Karthik,Poovarasan L,D. Saraswathi

Main category: cs.CV

TL;DR: 提出一种结合U-Net分割与DenseNet-VGG分类的混合深度学习模型,引入多头注意力和空间-通道注意力机制,用于胶质瘤的3D MRI精准分割与分类,实验显示Dice系数达98%,分类准确率达99%,优于传统方法。

Details Motivation: 胶质瘤致死率高,早期准确诊断对治疗至关重要,现有方法在精度和临床相关特征关注方面存在不足,需更高效、可解释性强的自动化诊断模型。 Method: 构建基于U-Net的3D肿瘤分割模型和融合DenseNet与VGG的分类网络,引入多头注意力和空间-通道注意力机制;对MRI数据进行归一化、重采样和数据增强等预处理。 Result: 分割性能达到Dice系数98%、Mean IoU较高,分类准确率达99%,在各项指标上均优于传统CNN和无注意力机制模型,注意力机制有效提升了对临床关键特征的关注与模型可解释性。 Conclusion: 该混合深度学习框架在胶质瘤的自动分割与分类中表现出卓越性能,具有辅助临床及时、可靠诊断和制定治疗方案的巨大潜力。 Abstract: Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors. To address this difficulty, the proposed research will develop a hybrid deep learning model which integrates U-Net based segmentation and a hybrid DenseNet-VGG classification network with multihead attention and spatial-channel attention capabilities. The segmentation model will precisely demarcate the tumors in a 3D volume of MRI data guided by spatial and contextual information. The classification network which combines a branch of both DenseNet and VGG, will incorporate the demarcated tumor on which features with attention mechanisms would be focused on clinically relevant features. High-dimensional 3D MRI data could successfully be utilized in the model through preprocessing steps which are normalization, resampling, and data augmentation. Through a variety of measures the framework is evaluated: measures of performance in segmentation are Dice coefficient and Mean Intersection over Union (IoU) and measures of performance in classification are accuracy precision, recall, and F1-score. The hybrid framework that has been proposed has demonstrated through physical testing that it has the capability of obtaining a Dice coefficient of 98% in tumor segmentation, and 99% on classification accuracy, outperforming traditional CNN models and attention-free methods. Utilizing multi-head attention mechanisms enhances notions of priority in aspects of the tumor that are clinically significant, and enhances interpretability and accuracy. The results suggest a great potential of the framework in facilitating the timely and reliable diagnosis and grading of glioma by clinicians is promising, allowing for better planning of patient treatment.

[162] Seeing without Pixels: Perception from Camera Trajectories

Zihui Xue,Kristen Grauman,Dima Damen,Andrew Zisserman,Tengda Han

Main category: cs.CV

TL;DR: 本文首次系统研究了仅通过相机轨迹(而非像素)感知视频内容的可能性,提出了一种名为CamFormer的对比学习框架,将相机姿态轨迹映射到与自然语言对齐的联合嵌入空间。结果表明,相机运动轨迹是一种高度信息丰富的信号,能够揭示视频中的行为或场景内容,且具有跨模态、分类和时序分析等多种应用潜力。

Details Motivation: 探索在不依赖视频像素的情况下,仅利用相机轨迹是否能感知视频内容,挑战传统视觉理解范式,寻找更轻量、鲁棒的视频表征方式。 Method: 提出CamFormer,一种基于对比学习的编码器框架,将相机姿态序列编码为嵌入表示,并与自然语言描述对齐,实现跨模态学习。 Result: 实验证明相机轨迹能有效反映视频内容(如‘如何移动’揭示‘正在做什么’),所学表示在多种下游任务(如跨模态匹配、分类、时序分析)中表现良好,且对不同相机位姿估计方法具有鲁棒性。 Conclusion: 相机轨迹是一种轻量、鲁棒且通用的模态,可用于感知视频内容,为视频理解提供了新的视角。 Abstract: Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

[163] Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Yusuf Dalva,Guocheng Gordon Qian,Maya Goldenberg,Tsai-Shien Chen,Kfir Aberman,Sergey Tulyakov,Pinar Yanardag,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: Canvas-to-Image 是一个统一的框架,通过将多种控制信号整合到单一画布界面中,实现高保真、多模态的图像生成控制。

Details Motivation: 现有扩散模型在同时处理文本提示、主体参考、空间布局等多种控制时,难以保证生成图像的忠实度和组合性。 Method: 提出将多种控制信号编码为单一复合画布图像,并采用多任务画布训练策略,在统一范式下联合优化模型对异构控制的理解与集成。 Result: 在多任务数据集上实验表明,该方法在身份保持和控制遵循方面显著优于现有最先进方法,适用于多人组合、姿态控制、布局约束和多控制生成等复杂场景。 Conclusion: Canvas-to-Image 实现了对多种控制信号的统一建模,提升了扩散模型在复杂用户意图下的生成保真度与泛化能力。 Abstract: While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.