Skip to content

Table of Contents

cs.CL [Back]

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

Hen-Hsen Huang

Main category: cs.CL

TL;DR: 本文主张在资源有限和专业知识不足的环境中,应将效率的重点从大规模复杂性转向稳健的简单性,提出了一种新的研究议程,以实现无需重新训练的高效架构、轻量级微调、经济化的推理以及动态知识管理,并倡导采用考虑采用成本、可持续性和公平性的“开销感知效率”作为新标准。

Details Motivation: 现有的大模型效率方法(如MoE、推测解码和复杂的RAG)主要服务于拥有庞大基础设施的超大规模提供商,导致中小企业、医院、学校等组织难以负担其开销与复杂性,加剧了技术不平等和碳排放问题。 Method: 提出一个新研究议程:通过无需重新训练的预训练模型改造、保留对齐性的轻量级微调、减少长推理链消耗的方法、摆脱重型RAG的知识管理机制,并引入‘开销感知效率’(OAE)作为评估标准。 Result: 为非巨头机构提供可行的LLM部署路径,降低采用门槛,提升可持续性与公平性。 Conclusion: 重新定义效率标准,纳入采用成本、环境影响与公平性,可推动LLM技术的民主化,使优化真正减少不平等与碳浪费。 Abstract: Large language models (LLMs) have become indispensable, but the most celebrated efficiency methods -- mixture-of-experts (MoE), speculative decoding, and complex retrieval-augmented generation (RAG) -- were built for hyperscale providers with vast infrastructure and elite teams. Outside that context, their benefits collapse into overhead, fragility, and wasted carbon. The result is that a handful of Big Tech companies benefit, while thousands of hospitals, schools, governments, and enterprises are left without viable options. We argue that the next frontier is not greater sophistication at scale, but robust simplicity: efficiency that thrives under modest resources and minimal expertise. We propose a new research agenda: retrofitting pretrained models with more efficient architectures without retraining, inventing lightweight fine-tuning that preserves alignment, making reasoning economical despite long chains of thought, enabling dynamic knowledge management without heavy RAG pipelines, and adopting Overhead-Aware Efficiency (OAE) as a standard benchmark. By redefining efficiency to include adoption cost, sustainability, and fairness, we can democratize LLM deployment -- ensuring that optimization reduces inequality and carbon waste rather than amplifying them.

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

Tcharlies Schmitz

Main category: cs.CL

TL;DR: 本文提出了Harmonic Token Projection (HTP),一种无需训练、词汇表或随机参数的可逆且确定性文本嵌入生成框架。

Details Motivation: 旨在提供一种透明、高效且可解释的替代方案,以克服传统神经网络嵌入对数据驱动和复杂优化的依赖。 Method: 将每个token基于其Unicode整数表示解析为谐波轨迹,建立离散符号与连续向量空间之间的双射映射。 Result: 在STS-B及其多语言扩展上实验显示,HTP在英语中达到Spearman相关系数ρ = 0.68,并在十种语言中保持稳定性能,计算成本极低,每句对延迟低于毫秒级。 Conclusion: 表明有意义的语义关系可以从确定性几何中产生,为数据驱动嵌入提供了透明高效的替代方案。 Abstract: This paper introduces the Harmonic Token Projection (HTP), a reversible and deterministic framework for generating text embeddings without training, vocabularies, or stochastic parameters. Unlike neural embeddings that rely on statistical co-occurrence or optimization, HTP encodes each token analytically as a harmonic trajectory derived from its Unicode integer representation, establishing a bijective and interpretable mapping between discrete symbols and continuous vector space. The harmonic formulation provides phase-coherent projections that preserve both structure and reversibility, enabling semantic similarity estimation from purely geometric alignment. Experimental evaluation on the Semantic Textual Similarity Benchmark (STS-B) and its multilingual extension shows that HTP achieves a Spearman correlation of \r{ho} = 0.68 in English, maintaining stable performance across ten languages with negligible computational cost and sub-millisecond latency per sentence pair. This demonstrates that meaningful semantic relations can emerge from deterministic geometry, offering a transparent and efficient alternative to data-driven embeddings. Keywords: Harmonic Token Projection, reversible embedding, deterministic encoding, semantic similarity, multilingual representation.

[3] A centroid based framework for text classification in itsm environments

Hossein Mohanna,Ali Ait-Bachir

Main category: cs.CL

TL;DR: 提出了一种基于双嵌入质心的分类框架,用于具有层次分类体系的文本分类,在IT服务管理工单分类中实现了与支持向量机相当的性能,同时具备更好的可解释性和显著更快的训练与更新速度。

Details Motivation: 在IT服务管理(ITSM)系统中,需要将支持工单分类到树状层次结构中,现有方法在可解释性和更新效率方面存在不足,因此需要一种兼顾性能、可解释性和高效增量更新的分类方法。 Method: 设计了一个双嵌入质心分类框架,为每个类别维护独立的语义和词汇质心表示,并在推理时通过互逆秩融合结合两者。 Result: 在8,968个工单、123个类别的数据集上,该方法在层次F1分数上达到0.731,略优于SVM的0.727;训练速度快5.9倍,增量更新快达152倍;在排除嵌入计算的情况下,批量处理速度提升8.6-8.8倍。 Conclusion: 该方法在保持高性能的同时,提供了良好的可解释性与极高的运算效率,适用于重视可解释性和操作效率的生产级ITSM系统。 Abstract: Text classification with hierarchical taxonomies is a fundamental requirement in IT Service Management (ITSM) systems, where support tickets must be categorized into tree-structured taxonomies. We present a dual-embedding centroid-based classification framework that maintains separate semantic and lexical centroid representations per category, combining them through reciprocal rank fusion at inference time. The framework achieves performance competitive with Support Vector Machines (hierarchical F1: 0.731 vs 0.727) while providing interpretability through centroid representations. Evaluated on 8,968 ITSM tickets across 123 categories, this method achieves 5.9 times faster training and up to 152 times faster incremental updates. With 8.6-8.8 times speedup across batch sizes (100-1000 samples) when excluding embedding computation. These results make the method suitable for production ITSM environments prioritizing interpretability and operational efficiency.

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

Yongfu Xue

Main category: cs.CL

TL;DR: PIRA是一种新的训练范式,通过偏好指令重构、多任务奖励聚合和dropout下的价值头输出平均化,提升奖励模型的数据效率并缓解奖励过优化问题。

Details Motivation: 传统的判别式奖励模型存在数据效率低和易受奖励过优化影响的问题,限制了大语言模型与人类偏好的对齐效果。 Method: 提出PIRA训练范式:(1) 将问答对重构为基于偏好的指令;(2) 聚合来自多种偏好任务的奖励以减少偏差;(3) 在不同dropout率下对价值头输出进行平均以稳定奖励。 Result: 大量实验表明,PIRA在提升数据效率和增强模型鲁棒性方面显著优于传统方法,有效缓解了奖励过优化问题。 Conclusion: PIRA为构建更高效、稳定的奖励模型提供了可行方案,有助于更好地对齐大语言模型与人类偏好。 Abstract: Reward models are crucial for aligning Large Language Models (LLMs) with human preferences but face two representative challenges. First, traditional discriminative reward models usually concatenate questions and responses directly as input, resulting in low data efficiency. Second, reward models are vulnerable to reward overoptimization. We propose PIRA, a training paradigm addressing these issues through three strategies: (1) Reformulating question-answer pairs into preference-based instructions for clearer and more explicit task specification, (2) aggregating rewards from diverse preference tasks to reduce bias and improve robustness, and (3) averaging value-head outputs under varying dropout rates to stabilize rewards. Extensive experiments have demonstrated the effectiveness of PIRA.

Mann Khatri,Mirza Yusuf,Rajiv Ratn Shah,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: 本文研究了如何通过重构法律文档的修辞角色、定义法律术语和模拟法院推理过程来提升大语言模型在法律任务中的表现,实验表明这些方法能显著提高F1分数。

Details Motivation: 大语言模型在通用领域表现出色,但在法律等专业领域因缺乏领域特定预训练而表现不佳,且法律文本通常冗长复杂,难以有效处理。 Method: 在零样本设置下,通过三种方式改进模型:(i) 根据修辞角色重组文档;(ii) 定义修辞角色以引入法律术语;(iii) 模拟法院逐步推理过程。实验在三个印度法律判决预测数据集上进行。 Result: 组织数据或解释关键法律术语显著提升了模型性能,F1分数相比基线最低提升约1.5%,最高达4.36%。 Conclusion: 通过结构化信息呈现、术语定义和模拟人类推理过程,可有效增强大语言模型在复杂法律任务中的理解和推理能力。 Abstract: Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.

[6] MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

Saad Mankarious,Ayah Zirikly,Daniel Wiechmann,Elma Kerz,Edward Kempa,Yu Qiao

Main category: cs.CL

TL;DR: 本文提出了一个名为MindSET的新基准数据集,用于心理健康分析,该数据集基于Reddit上自我报告的诊断信息构建,包含超过1300万条标注帖子,涵盖七种心理健康状况。相比以往的数据集,MindSET在规模、数据质量和多样性方面均有显著提升。通过严格的预处理步骤(如语言过滤、去除不适宜内容和重复项)保证了数据质量,并利用LIWC进行语言学分析。实验表明,使用MindSET训练的模型在诊断检测任务中显著优于以往基准,自闭症检测F1分数最高提升达18点。

Details Motivation: 现有心理健康分析基准数据集存在数据过时、清洗不足及难以应对社交媒体内容多样性(如多语言和有害内容)等问题,限制了研究进展。因此需要构建一个更大、更高质量、更具代表性的新数据集来推动该领域发展。 Method: 从Reddit平台收集用户自我报告诊断的数据,构建MindSET数据集;对数据进行严格预处理,包括语言过滤、去除NSFW和重复内容;使用LIWC工具进行语言特征分析;并通过基于微调语言模型和词袋模型(BoW)的二分类实验评估数据集有效性。 Result: MindSET包含超过1300万条标注帖子,是此前最大数据集的两倍以上;语言分析揭示了不同心理状态群体间的语言使用差异;分类实验显示,基于MindSET训练的模型在多种精神健康状况检测中表现更优,自闭症检测F1分数最高提升18点。 Conclusion: MindSET是一个大规模、高质量的心理健康分析基准数据集,能够有效支持基于社交媒体的心理健康研究,有助于早期风险识别和新兴心理趋势的深入分析,为未来相关研究提供了坚实基础。 Abstract: Social media data has become a vital resource for studying mental health, offering real-time insights into thoughts, emotions, and behaviors that traditional methods often miss. Progress in this area has been facilitated by benchmark datasets for mental health analysis; however, most existing benchmarks have become outdated due to limited data availability, inadequate cleaning, and the inherently diverse nature of social media content (e.g., multilingual and harmful material). We present a new benchmark dataset, \textbf{MindSET}, curated from Reddit using self-reported diagnoses to address these limitations. The annotated dataset contains over \textbf{13M} annotated posts across seven mental health conditions, more than twice the size of previous benchmarks. To ensure data quality, we applied rigorous preprocessing steps, including language filtering, and removal of Not Safe for Work (NSFW) and duplicate content. We further performed a linguistic analysis using LIWC to examine psychological term frequencies across the eight groups represented in the dataset. To demonstrate the dataset utility, we conducted binary classification experiments for diagnosis detection using both fine-tuned language models and Bag-of-Words (BoW) features. Models trained on MindSET consistently outperformed those trained on previous benchmarks, achieving up to an \textbf{18-point} improvement in F1 for Autism detection. Overall, MindSET provides a robust foundation for researchers exploring the intersection of social media and mental health, supporting both early risk detection and deeper analysis of emerging psychological trends.

[7] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

Zheng Hui,Xiaokai Wei,Reza Shirkavand,Chen Wang,Weizhi Zhang,Alejandro Peláez,Michelle Gong

Main category: cs.CL

TL;DR: 本文提出了FlexCode,一种面向生成式推荐的流行度感知框架,通过自适应分配协同过滤与语义码本的令牌预算,提升推荐准确性与长尾鲁棒性。

Details Motivation: 现有生成式推荐方法使用统一码本编码所有项目,忽视了热门项目与长尾项目在协同信号和语义依赖上的差异,导致表示效率低下和泛化能力受限。 Method: 提出FlexCode框架,采用双码本结构(协同过滤码本和语义码本),通过轻量级MoE动态分配令牌预算,并设计对齐与平滑目标以保持跨流行度的一致性。 Result: 在公开和工业规模数据集上的实验表明,FlexCode持续优于强基线方法,在推荐准确性和长尾项目推荐上表现更优。 Conclusion: FlexCode为生成式推荐中的令牌表示提供了新机制,有效平衡了记忆与泛化,增强了模型的整体性能与鲁棒性。 Abstract: Generative recommendation has recently emerged as a powerful paradigm that unifies retrieval and generation, representing items as discrete semantic tokens and enabling flexible sequence modeling with autoregressive models. Despite its success, existing approaches rely on a single, uniform codebook to encode all items, overlooking the inherent imbalance between popular items rich in collaborative signals and long-tail items that depend on semantic understanding. We argue that this uniform treatment limits representational efficiency and hinders generalization. To address this, we introduce FlexCode, a popularity-aware framework that adaptively allocates a fixed token budget between a collaborative filtering (CF) codebook and a semantic codebook. A lightweight MoE dynamically balances CF-specific precision and semantic generalization, while an alignment and smoothness objective maintains coherence across the popularity spectrum. We perform experiments on both public and industrial-scale datasets, showing that FlexCode consistently outperform strong baselines. FlexCode provides a new mechanism for token representation in generative recommenders, achieving stronger accuracy and tail robustness, and offering a new perspective on balancing memorization and generalization in token-based recommendation models.

[8] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

Saleh Almohaimeed,May Alsofyani,Saad Almohaimeed,Mansour Al Ghanim,Liqiang Wang

Main category: cs.CL

TL;DR: 本文提出了首个阿拉伯语跨领域、上下文相关的text-to-SQL数据集Ar-SParC,并基于GPT-3.5和GPT-4.5进行了40次实验,提出了一种名为GAT corrector的新方法,在零样本和上下文学习设置下均提升了执行和交互准确率。

Details Motivation: 现有的text-to-SQL研究主要集中于英语和中文,缺乏对阿拉伯语的支持,本文旨在填补这一空白。 Method: 构建了包含3,450个问题序列(共10,225个问题)的Ar-SParC数据集,采用GPT-3.5-turbo和GPT-4.5-turbo模型,结合四种问题表示方法和六种上下文学习技术进行实验,并提出GAT corrector方法以提升性能。 Result: GAT corrector在零样本设置下平均提升1.9%的执行准确率(EX)和1.9%的交互准确率(IX),在上下文学习设置下分别提升1.72% EX和0.92% IX。消融实验验证了其优于原有GAT verifier的原因。 Conclusion: Ar-SParC为阿拉伯语text-to-SQL任务提供了重要资源,GAT corrector有效提升了模型表现,尤其适用于阿拉伯语环境。 Abstract: In recent years, the task of cross-domain, context-dependent text-to-SQL has received significant attention. Enables users with no prior knowledge of SQL to have a conversation with databases using natural language. However, most of the available datasets and research have been conducted in English, along with some work in Chinese. To this date, no effort has been made to address this task in the Arabic language. In this paper, we introduce Ar-SParC, the first Arabic cross-domain, context-dependent text-to-SQL dataset. The dataset consists of 3,450 sequences of interrelated questions, each sequence containing an average of approximately three questions, which results in a total of 10225 questions along with their corresponding SQL queries. We conducted 40 experiments on the Ar-SParC dataset using two large language models, GPT-3.5-turbo and GPT-4.5-turbo, applying 10 different prompt engineering techniques, including four question representation methods and six in-context learning techniques. Furthermore, we developed a novel approach named GAT corrector, which enhanced the performance across all 40 experiments, yielding an average improvement of 1.9% in execution accuracy (EX) and 1.9% in interaction accuracy (IX) under zero-shot settings, and an average increase of 1.72% EX and 0.92% IX under in-context learning settings. Finally, we conducted an ablation study with two more experiments to explain why the GAT corrector outperformed the previous GAT verifier technique, particularly for the Arabic language.

[9] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

Matthew W. Kenaston,Umair Ayub,Mihir Parmar,Muhammad Umair Anjum,Syed Arsalan Ahmed Naqvi,Priya Kumar,Samarth Rawal,Aadel A. Chaudhuri,Yousef Zakharia,Elizabeth I. Heath,Tanios S. Bekaii-Saab,Cui Tao,Eliezer M. Van Allen,Ben Zhou,YooJung Choi,Chitta Baral,Irbaz Bin Riaz

Main category: cs.CL

TL;DR: 该研究开发了一个分层分类法来识别GPT-4在真实肿瘤学笔记中的推理错误,发现23%的解释存在推理错误,其中确认偏见和锚定偏见最常见,且与指南不一致及潜在有害建议相关。

Details Motivation: 尽管大型语言模型在临床基准上表现优异,但其可能通过错误的推理得出正确结论,这种安全风险在肿瘤学决策支持中未被现有基于准确性的评估方法捕捉。 Method: 使用CORAL数据集中的乳腺癌和胰腺癌病例,标注600条推理路径,构建三级分类法,并在前列腺癌会诊笔记的822条响应中验证,结合认知偏差框架分析推理错误类型。 Result: 23%的解读中出现推理错误,主要为确认偏见和锚定偏见;这些错误与不符合指南且可能有害的临床建议相关,尤其在晚期疾病管理中更为显著;最新语言模型驱动的自动评估器能检测错误存在但难以准确分类子类型。 Conclusion: 大型语言模型可能因推理缺陷而给出看似流畅但临床不安全的建议;所提出的分类法为评估和提升模型推理保真度提供了可推广的框架,应在临床部署前应用。 Abstract: Despite high performance on clinical benchmarks, large language models may reach correct conclusions through faulty reasoning, a failure mode with safety implications for oncology decision support that is not captured by accuracy-based evaluation. In this two-cohort retrospective study, we developed a hierarchical taxonomy of reasoning errors from GPT-4 chain-of-thought responses to real oncology notes and tested its clinical relevance. Using breast and pancreatic cancer notes from the CORAL dataset, we annotated 600 reasoning traces to define a three-tier taxonomy mapping computational failures to cognitive bias frameworks. We validated the taxonomy on 822 responses from prostate cancer consult notes spanning localized through metastatic disease, simulating extraction, analysis, and clinical recommendation tasks. Reasoning errors occurred in 23 percent of interpretations and dominated overall errors, with confirmation bias and anchoring bias most common. Reasoning failures were associated with guideline-discordant and potentially harmful recommendations, particularly in advanced disease management. Automated evaluators using state-of-the-art language models detected error presence but could not reliably classify subtypes. These findings show that large language models may provide fluent but clinically unsafe recommendations when reasoning is flawed. The taxonomy provides a generalizable framework for evaluating and improving reasoning fidelity before clinical deployment.

[10] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

Bharadwaj Yadavalli

Main category: cs.CL

TL;DR: 本文提出了动态模板选择(DTS)方法,通过根据查询复杂度自适应匹配响应模板,显著降低大语言模型的输出令牌成本,且不损害响应质量。

Details Motivation: 现有大语言模型普遍采用统一提示策略,导致在简单问题上浪费大量输出令牌,而输出令牌成本远高于输入令牌,造成资源与经济成本的浪费。 Method: 提出DTS框架,比较了基于MLP和微调RoBERTa的两种路由方法,使用预计算嵌入或Transformer模型判断查询复杂度,并动态选择合适的响应模板。 Result: 在1,000个MMLU问题上评估显示,MLP路由器准确率达90.5%,略高于RoBERTa;在9,000次生产API调用中验证了跨主流LLM提供商的泛化能力,令牌消耗减少32.6%至33.9%。 Conclusion: DTS能有效实现按需生成,显著降低成本,具有理论与实践价值,并为高效部署大模型提供了可推广的解决方案。 Abstract: Contemporary large language model deployments typically employ uniform prompting strategies across diverse query types, applying verbose response patterns to both complex analytical tasks and straightforward factual questions. This one-size-fits-all methodology leads to substantial token inefficiency, a concern amplified by the significant cost differential between input and output tokens--the latter commanding 4-8x higher prices across major providers. We present Dynamic Template Selection (DTS), which adaptively matches response templates to query complexity, achieving significant cost reductions without compromising response quality. We compared two routing approaches: a simple MLP that uses pre-computed embeddings and a more complex fine-tuned RoBERTa transformer. Through comprehensive evaluation on 1,000 MMLU questions, we find that the MLP router achieves 90.5% routing accuracy on held-out test data, marginally exceeding RoBERTa's performance (89.5%) despite utilizing 125M fewer parameters. Notably, our empirical analysis reveals provider-agnostic behavior in template selection--routing decisions generalize effectively across 3 major LLM providers (OpenAI GPT-4, Google Gemini, and Anthropic Claude), as validated through 9,000 production API calls. While routing accuracy remains consistent at 90.5% across providers, observed token reductions vary from 32.6% to 33.9%, reflecting provider-specific generation characteristics. This work contributes several key elements: formal problem formulation with theoretical grounding in machine learning, four algorithms with corresponding complexity analyses, and extensive empirical validation across production systems.

[11] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

Lijun Shang,Yadong Yu,Wenqiang Kang,Jian Zhou,Dongyue Gao,Pan Xiang,Zhe Liu,Mengyan Dai,Zhonglu Guo,Zhimei Sun

Main category: cs.CL

TL;DR: 二维材料因其独特的物理化学和电子特性在能源存储和转换中具有广泛应用,但其合成信息分散于大量文献中,亟需系统化提取与整合。

Details Motivation: 由于二维材料的相关合成方法和性能信息分散在大量已发表的研究论文中,难以高效获取和利用,因此需要对这些信息进行系统分析和总结。 Method: 通过分析已发表的研究论文,提取二维材料的制备方法、性能及其在能源存储与转换中的应用信息,并进行归纳整理。 Result: 总结了二维材料的关键性能和制备方法,揭示了其在能源领域应用的结构-性能关系。 Conclusion: 系统梳理文献中的二维材料信息有助于加速材料发现与应用开发,推动能源技术的发展。 Abstract: Two-dimensional (2D) materials have showed widespread applications in energy storage and conversion owning to their unique physicochemical, and electronic properties. Most of the valuable information for the materials, such as their properties and preparation methods, is included in the published research papers. However, due to the dispersion of synthe

[12] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

Trung Cuong Dang,David Mohaisen

Main category: cs.CL

TL;DR: 提出了一种新的多前缀记忆化框架,用于检测大语言模型中训练数据的记忆化现象,通过多样化的检索路径衡量记忆的鲁棒性。

Details Motivation: 现有记忆化定义在全面捕捉对齐模型中的记忆现象方面存在不足,需要更有效的定义来评估隐私和版权风险。 Method: 定义一个序列被记忆的标准是外部对抗性搜索能找到大量不同的前缀来触发该序列,强调从单一提取路径转向记忆鲁棒性的量化。 Result: 在开源和对齐聊天模型上的实验表明,该方法能有效区分记忆化与非记忆化内容。 Conclusion: 多前缀记忆化框架为审计大语言模型中的数据泄露提供了可靠且实用的工具。 Abstract: Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.

[13] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

Jiaojiao Han,Wujiang Xu,Mingyu Jin,Mengnan Du

Main category: cs.CL

TL;DR: 本文提出了SAGE,一种基于智能体的稀疏自编码器特征解释框架,通过主动、迭代的解释过程显著提升了对大语言模型内部特征的理解准确性和可解释性。

Details Motivation: 大语言模型的内部机制不透明,稀疏自编码器(SAE)虽有助于分解其表征,但SAE所捕获特征的解释仍具挑战性,需更系统、可靠的解释方法。 Method: 提出SAGE框架,将特征解释转化为智能体驱动的主动过程:系统生成多个假设解释,设计针对性实验验证,并基于激活反馈迭代优化解释。 Result: 在多种语言模型的SAE特征上实验表明,SAGE在生成性和预测性准确性上均显著优于现有最先进基线方法。 Conclusion: SAGE通过引入主动推理与实证验证机制,有效提升了对LLM内部特征的解释质量,为模型可解释性研究提供了新范式。 Abstract: Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

[14] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

Asad Aali,Muhammad Ahmed Mohsin,Vasiliki Bikia,Arnav Singhvi,Richard Gaus,Suhana Bedi,Hejie Cui,Miguel Fuentes,Alyssa Unell,Yifan Mai,Jordan Cahoon,Michael Pfeffer,Roxana Daneshjou,Sanmi Koyejo,Emily Alsentzer,Percy Liang,Christopher Potts,Nigam H. Shah,Akshay S. Chaudhari

Main category: cs.CL

TL;DR: 本文提出了一种结合DSPy与HELM的可复现框架,通过结构化提示方法(如推理链)改进语言模型的基准评估,发现传统固定提示会低估性能、导致排名失真,而结构化提示能更准确估计性能上限,提升评估可靠性。

Details Motivation: 现有语言模型评估框架(如HELM)依赖固定提示,无法泛化到不同模型,导致性能估计不准确,可能低估真实能力并误导部署决策。 Method: 构建一个可复现的DSPy+HELM集成框架,采用四种结构化提示方法,在七个通用和医学领域的基准任务上对四个前沿语言模型进行评估,并与原始HELM结果对比。 Result: 发现无结构化提示时:(i) HELM平均低估性能4%;(ii) 跨基准标准差增加2%;(iii) 3/7个基准出现排行榜反转;(iv) 引入推理链可降低模型对提示设计的敏感性。 Conclusion: 结构化提示能更准确估计语言模型的性能上限,使基准测试更具决策参考价值,本文是首个大规模系统性评估提示方法对基准影响的研究,并开源了相关工具。 Abstract: As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we estimate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (+2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing reasoning (chain-of-thought) reduces LM sensitivity to prompt design (smaller Δ across prompts). To our knowledge, this is the first large-scale benchmarking study to empirically characterize LM behavior across benchmarks and prompting methods, showing that scalable performance ceiling estimation enables more decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).

[15] Length-MAX Tokenizer for Language Models

Dong Dong,Weijie Su

Main category: cs.CL

TL;DR: 本文提出了一种名为Length-MAX的新分词器,通过最小化平均字符token数来减少语言模型训练和推理中的token数量,相较于BPE在多种场景下减少13-18%的token,并显著提升训练效率、推理速度和下游任务表现。

Details Motivation: 传统分词方法(如BPE)主要基于子词频率进行优化,可能未充分考虑文本表示的长度效率。本文旨在通过优化平均token长度而非仅频率,实现更高效的语言建模。 Method: 将最小化平均token长度的目标转化为图划分问题,设计一种贪心近似算法来构建词汇表,从而得到Length-MAX分词器。 Result: 在FineWeb等多领域数据上,Length-MAX比BPE减少14-18%的token(64K时为13%),训练GPT-2模型收敛步数减少17-19%,推理延迟降低12.7-13.7%,吞吐量提升16%,LAMBADA困惑度下降11.7%,HellaSwag准确率提高4.3%,词汇覆盖率达99.62%,OOV率为0.12%。 Conclusion: 优化平均token长度是一种有效提升语言模型效率的方法,在减少计算资源消耗的同时,不牺牲甚至提升下游任务性能,且具备生产环境部署能力。 Abstract: We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.

[16] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei,Noveen Sachdeva,Benjamin Coleman,Zhankui He,Yuanchen Bei,Xuying Ning,Mengting Ai,Yunzhe Li,Jingrui He,Ed H. Chi,Chi Wang,Shuo Chen,Fernando Pereira,Wang-Cheng Kang,Derek Zhiyuan Cheng

Main category: cs.CL

TL;DR: 本文提出了Evo-Memory,一个用于评估LLM代理在连续任务流中自我演化记忆能力的流式基准和框架,强调记忆的动态积累与更新,并引入了ExpRAG和ReMem方法以提升经验复用和持续改进。

Details Motivation: 现有记忆评估主要集中在静态对话场景,忽略了在动态任务流中累积和复用经验的能力,而实际应用中的LLM代理需要在部署过程中持续检索、整合和更新记忆。 Method: 构建了一个包含10个多样化多轮目标导向及单轮推理与问答数据集的连续任务流基准;统一实现了十多种代表性记忆模块;提出了ExpRAG作为经验复用基线方法,以及ReMem——一种结合推理、动作与记忆更新的闭环流程。 Result: Evo-Memory能够有效评估LLM代理在连续交互中搜索、适应和演化记忆的能力;实验表明ReMem在经验利用和持续性能提升方面优于现有方法。 Conclusion: 通过Evo-Memory框架和ReMem方法,实现了对LLM代理自演化记忆能力的系统评估与增强,推动了状态化智能体在复杂动态环境中的长期规划与问题解决能力的发展。 Abstract: Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

[17] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

Ali Jahan,Masood Ghayoomi,Annette Hautli-Janisz

Main category: cs.CL

TL;DR: 本文提出了一种跨语言方法来解决低资源语言中的论点挖掘问题,通过构建三种训练场景并在英语和波斯语上进行实验,结果表明轻量级的跨语言模型在波斯语上的表现优于基于大语言模型增强的方法。

Details Motivation: 由于低资源语言缺乏足够的标注数据,传统的论点挖掘方法难以有效应用,因此需要探索能够缓解数据稀缺问题的新方法。 Method: 研究设计了三种训练场景:零样本迁移(仅用英语数据训练)、基于大语言模型生成合成样本增强的英语训练,以及结合英文数据与人工翻译波斯文的跨语言模型。使用英语Microtext语料库及其波斯平行语料进行评估。 Result: 零样本迁移模型在英语和波斯语测试集上F1分别为50.2%和50.7%;LLM增强模型提升至59.2%和69.3%;跨语言模型在波斯语测试集上达到74.8%的F1,表现最优。 Conclusion: 轻量级跨语言训练策略能有效克服低资源语言的数据短缺问题,在论点挖掘任务中优于资源消耗更大的数据增强方法,为低资源语言处理提供了实用路径。 Abstract: Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus \citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2\% on the English test set and 50.7\% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2\% on English and 69.3\% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8\%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.

[18] Emergence and Localisation of Semantic Role Circuits in LLMs

Nura Aljaafari,Danilo S. Carvalho,André Freitas

Main category: cs.CL

TL;DR: 本论文提出了一种结合角色交叉最小对、时间演化分析和跨模型比较的方法,用于研究大语言模型如何实现语义角色。

Details Motivation: 尽管大语言模型表现出语义能力,但其内部支撑抽象语义结构的机制仍不清楚。 Method: 引入角色交叉最小对、时间演化分析和跨模型比较来识别和分析语义角色电路。 Result: 发现高度集中的电路(28个节点内占89-94%归因)、渐进式结构优化而非相变现象,以及中等程度的跨尺度保守性(组件重叠24-59%)但高谱相似性。 Conclusion: 大语言模型形成了紧凑且因果隔离的抽象语义结构机制,这些机制在不同规模和架构之间部分可迁移。 Abstract: Despite displaying semantic competence, large language models' internal mechanisms that ground abstract semantic structure remain insufficiently characterised. We propose a method integrating role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study how LLMs implement semantic roles. Our analysis uncovers: (i) highly concentrated circuits (89-94% attribution within 28 nodes); (ii) gradual structural refinement rather than phase transitions, with larger models sometimes bypassing localised circuits; and (iii) moderate cross-scale conservation (24-59% component overlap) alongside high spectral similarity. These findings suggest that LLMs form compact, causally isolated mechanisms for abstract semantic structure, and these mechanisms exhibit partial transfer across scales and architectures.

[19] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

Reham Omar,Abdelghny Orogat,Ibrahim Abdelaziz,Omij Mangukiya,Panos Kalnis,Essam Mansour

Main category: cs.CL

TL;DR: Chatty-KG 是一个模块化的多智能体系统,用于知识图谱上的对话式问答,结合了检索增强生成与结构化执行,通过任务专用的大型语言模型智能体生成 SPARQL 查询,在单轮和多轮设置中均显著优于现有方法。

Details Motivation: 现有的知识图谱问答系统在处理多轮对话、上下文跟踪和动态知识图谱时存在局限,而大语言模型缺乏对私有和动态知识图谱的直接访问,因此需要一种能结合两者优势的解决方案。 Method: 提出 Chatty-KG,一种模块化多智能体系统,利用专门的 LLM 智能体协作完成上下文理解、对话追踪、实体与关系链接及查询规划,并通过生成 SPARQL 查询实现自然语言到可执行查询的准确转换。 Result: 在多个大规模且多样化的知识图谱上实验表明,Chatty-KG 在单轮和多轮场景下均显著优于现有基线方法,具有更高的 F1 和 P@1 分数,且兼容多种商业和开源大语言模型,性能稳定。 Conclusion: Chatty-KG 成功融合了对话灵活性与知识图谱的结构化基础,提供了一种可扩展、可扩展且无需微调即可适应演化知识图谱的可靠多轮 KGQA 方法。 Abstract: Conversational Question Answering over Knowledge Graphs (KGs) combines the factual grounding of KG-based QA with the interactive nature of dialogue systems. KGs are widely used in enterprise and domain applications to provide structured, evolving, and reliable knowledge. Large language models (LLMs) enable natural and context-aware conversations, but lack direct access to private and dynamic KGs. Retrieval-augmented generation (RAG) systems can retrieve graph content but often serialize structure, struggle with multi-turn context, and require heavy indexing. Traditional KGQA systems preserve structure but typically support only single-turn QA, incur high latency, and struggle with coreference and context tracking. To address these limitations, we propose Chatty-KG, a modular multi-agent system for conversational QA over KGs. Chatty-KG combines RAG-style retrieval with structured execution by generating SPARQL queries through task-specialized LLM agents. These agents collaborate for contextual interpretation, dialogue tracking, entity and relation linking, and efficient query planning, enabling accurate and low-latency translation of natural questions into executable queries. Experiments on large and diverse KGs show that Chatty-KG significantly outperforms state-of-the-art baselines in both single-turn and multi-turn settings, achieving higher F1 and P@1 scores. Its modular design preserves dialogue coherence and supports evolving KGs without fine-tuning or pre-processing. Evaluations with commercial (e.g., GPT-4o, Gemini-2.0) and open-weight (e.g., Phi-4, Gemma 3) LLMs confirm broad compatibility and stable performance. Overall, Chatty-KG unifies conversational flexibility with structured KG grounding, offering a scalable and extensible approach for reliable multi-turn KGQA.

[20] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

Ioana Buhnila,Aman Sinha,Mathieu Constant

Main category: cs.CL

TL;DR: 该研究利用TrackList分析框架和新构建的RefoMed-EN数据集,评估大语言模型在不同类型医学查询上的表现,发现模型在定义类问题上表现最佳,而在举例类问题上最弱,且更倾向于复述高频常见知识而非低频专业内容。

Details Motivation: 大语言模型在回答定义类问题上表现良好,但在其他类型回答(如举例、释义)上表现不佳,研究旨在探究其性能下降原因及训练数据的影响。 Method: 提出TrackList分析流程和RefoMed-EN英文医学数据集(含6170个人工标注术语及其定义、命名、示例、解释或释义),通过句法与语义相似性指标、统计相关性和嵌入表示,分析概念频率(头部 vs. 尾部)对模型输出质量的影响。 Result: 模型在定义类问题上表现最好,举例类最差;对于定义类问题,模型更倾向于复述高频流行知识,而对低频专业技术知识复述较少,尤其在专家文本中更为明显。 Conclusion: 大语言模型在处理多样化语言查询时存在局限,尤其在低频和专业领域知识的表达上表现不足,反映出预训练数据分布对模型输出行为的显著影响。 Abstract: Large Language Models (LLMs) have proven efficient in giving definition-type answers to user input queries. While for humans giving various types of answers, such as examples and paraphrases, is an easy task, LLMs struggle to provide correct answers for other than definition-type queries. In this study, we evaluated this drop in performance using TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also introduce RefoMed-EN, an English dataset consisting of 6170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We studied whether the high frequency of a concept (head) or low frequency (tail) impacts the language model's performance. We evaluated the quality of the LLM's output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM's task performance for definition type questions is the highest, while for the exemplification type it is the lowest. Additionally, we showed that for definition-type questions, large language models are prone to paraphrase more on popular and frequent knowledge and less on tail and technical knowledge, especially in the expert texts.

[21] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

Anantha Padmanaban Krishna Kumar

Main category: cs.CL

TL;DR: 本文研究了上下文学习(ICL)是否能够覆盖预训练模型中的标签语义,还是仅仅在其基础上进行微调。通过将大语言模型视为提示诱导的分类器,并比较其在“自然”和“反转”标签演示下的表现,作者提出了一种语义锚定观点:ICL主要依赖于预训练中形成的稳定语义方向,而无法真正反转标签含义。实验涵盖八项分类任务和八种开源大模型(1–12B参数),结果表明模型难以学习反语义分类器,语义覆盖率为零。

Details Motivation: 探究上下文学习(ICL)的本质机制,明确其是重构语义映射还是依赖预训练语义骨架,以理解大模型在少样本提示下的基本能力边界。 Method: 将大语言模型视为提示诱导的分类器,设计‘自然’与‘反转’两种演示设置,引入三种对齐度量(真实、先验、提示对齐)和语义覆盖率指标,在八个分类任务和八个开源LLM上进行实验分析。 Result: 在自然演示下,ICL提升准确率且保持高先验对齐;多数正确预测与零样本行为一致。在反转演示下,模型无法建立连贯的反语义分类器,提示对齐提升以牺牲准确性为代价,语义覆盖率为零。 Conclusion: 上下文学习主要调整输入到预训练语义方向的投影,而非灵活重映射标签含义,支持‘语义锚定’观点,表明当前规模下仅靠ICL无法覆盖预训练语义,需更深层干预才能实现语义反转。 Abstract: Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1--12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1--12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl.

[22] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

Michael Iskandardinata,William Christian,Derwin Suhartono

Main category: cs.CL

TL;DR: 提出一种检索感知的讽刺检测方法,结合外部知识检索和模型内部知识提取,显著提升大语言模型在多数据集上的表现。

Details Motivation: 现有大语言模型在处理具有文化差异、俚语或未知词汇的讽刺文本时表现不稳定,缺乏足够的上下文支持。 Method: 基于Pragmatic Metacognitive Prompting (PMP) 方法,引入两种上下文增强策略:使用网络检索补充非参数化知识,以及激发模型自身的内部知识以实现自我认知意识。 Result: 在Twitter Indonesia Sarcastic数据集上,非参数检索使macro-F1提升9.87%;自我知识检索在SemEval和MUStARD上分别提升3.29%和4.08%。 Conclusion: 上下文信息对提升大语言模型在讽刺检测中的性能至关重要,尤其是针对文化特定表达和模型未知术语。 Abstract: Detecting sarcasm remains a challenging task in the areas of Natural Language Processing (NLP) despite recent advances in neural network approaches. Currently, Pre-trained Language Models (PLMs) and Large Language Models (LLMs) are the preferred approach for sarcasm detection. However, the complexity of sarcastic text, combined with linguistic diversity and cultural variation across communities, has made the task more difficult even for PLMs and LLMs. Beyond that, those models also exhibit unreliable detection of words or tokens that require extra grounding for analysis. Building on a state-of-the-art prompting method in LLMs for sarcasm detection called Pragmatic Metacognitive Prompting (PMP), we introduce a retrieval-aware approach that incorporates retrieved contextual information for each target text. Our pipeline explores two complementary ways to provide context: adding non-parametric knowledge using web-based retrieval when the model lacks necessary background, and eliciting the model's own internal knowledge for a self-knowledge awareness strategy. We evaluated our approach with three datasets, such as Twitter Indonesia Sarcastic, SemEval-2018 Task 3, and MUStARD. Non-parametric retrieval resulted in a significant 9.87% macro-F1 improvement on Twitter Indonesia Sarcastic compared to the original PMP method. Self-knowledge retrieval improves macro-F1 by 3.29% on Semeval and by 4.08% on MUStARD. These findings highlight the importance of context in enhancing LLMs performance in sarcasm detection task, particularly the involvement of culturally specific slang, references, or unknown terms to the LLMs. Future work will focus on optimizing the retrieval of relevant contextual information and examining how retrieval quality affects performance. The experiment code is available at: https://github.com/wllchrst/sarcasm-detection_pmp_knowledge-base.

[23] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

Thura Aung,Eaint Kay Khaing Kyaw,Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi

Main category: cs.CL

TL;DR: 该论文研究了在低资源语言(如缅甸语)分类任务中,使用Kolmogorov-Arnold网络(KANs)作为分类头替代传统的MLP,实验表明KAN在多种嵌入表示上具有竞争力甚至更优的表现,兼具高效性与表达能力。

Details Motivation: 在低资源语言中,通常仅微调分类层而冻结编码器权重,但MLP的固定非线性限制了模型表达能力和效率,因此需要更具表达力且高效的替代方案。 Method: 采用KAN(包括FourierKAN、EfficientKAN和FasterKAN)作为分类头,结合TF-IDF、fastText和多语言Transformer(如mBERT、Distil-mBERT)等嵌入,在分类任务上进行实验评估。 Result: EfficientKAN结合fastText取得了最高的F1分数(0.928),FasterKAN在速度与准确率之间表现最佳平衡;在Transformer嵌入上,EfficientKAN与MLP相当或略优(mBERT下F1为0.917)。 Conclusion: KAN是比MLP更具表达力且高效的分类头替代方案,适用于低资源语言的文本分类任务。 Abstract: In low-resource languages like Burmese, classification tasks often fine-tune only the final classification layer, keeping pre-trained encoder weights frozen. While Multi-Layer Perceptrons (MLPs) are commonly used, their fixed non-linearity can limit expressiveness and increase computational cost. This work explores Kolmogorov-Arnold Networks (KANs) as alternative classification heads, evaluating Fourier-based FourierKAN, Spline-based EfficientKAN, and Grid-based FasterKAN-across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT). Experimental results show that KAN-based heads are competitive with or superior to MLPs. EfficientKAN with fastText achieved the highest F1-score (0.928), while FasterKAN offered the best trade-off between speed and accuracy. On transformer embeddings, EfficientKAN matched or slightly outperformed MLPs with mBERT (0.917 F1). These findings highlight KANs as expressive, efficient alternatives to MLPs for low-resource language classification.

[24] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

Bryan E. Tuck,Rakesh M. Verma

Main category: cs.CL

TL;DR: 该研究评估了28种大语言模型在58个字符级约束满足任务(如字谜)上的表现,发现架构差异对性能的影响远大于参数规模的影响,且高容量模型在增加推理预算时表现更好,而中等规模模型则趋于饱和或下降。此外,模型在常见但拼写特殊的词上错误率极高,暴露出其过度依赖统计规律而忽视拼写合法性的缺陷,表明仅靠扩大参数或计算资源不足以解决硬性正字约束,可能需要专门的架构创新。

Details Motivation: 大语言模型在生成受严格拼写约束的文本时表现不佳,但目前缺乏跨架构的系统性评估。研究旨在揭示不同模型架构在字符级约束满足任务中的性能差异及其背后的原因。 Method: 在涵盖Qwen3、Claude Haiku-4.5和GPT-5-mini三个模型家族的28种配置上,测试其在58个需满足字符级约束的字谜任务中的表现,并结合人类解题数据(每题1万人评分)进行难度校准与错误分析。 Result: 模型架构差异导致显著性能差距(F1值0.761 vs. 0.343,相差2.0-2.2倍),远超同一家族内八倍参数扩展带来的83%增益;高容量模型在增加‘思考预算’时性能提升明显,而中等模型则饱和或退化;模型对‘data’、‘poop’、‘loll’等常见但拼写特殊的词失败率高达89-96%,尽管人类成功率达86-95%。 Conclusion: 仅靠扩大模型参数或计算预算无法有效提升字符级约束满足能力,当前模型过度依赖分布统计特性而难以处理非常规但合法的拼写模式,未来可能需要引入专用架构设计或训练目标来改进此类任务的表现。 Abstract: Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography ("data", "poop", "loll": 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.

[25] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Ye Bhone Lin,Thura Aung,Ye Kyaw Thu,Thazin Myint Oo

Main category: cs.CL

TL;DR: 本文首次研究了低资源缅甸语的ASR错误纠正,提出结合IPA和对齐特征的序列到序列Transformer模型,显著提升了词和字符级别的识别准确率。

Details Motivation: 缅甸语在自动语音识别(ASR)方面资源匮乏,现有ASR系统输出存在较多错误,亟需有效的错误纠正方法以提升识别性能。 Method: 采用序列到序列的Transformer模型,探索多种特征融合策略(包括国际音标IPA和对齐信息),在五种ASR骨干模型上进行ASR错误纠正(AEC)实验。 Result: 结合IPA和对齐特征的AEC模型将ASR的平均WER从51.56降至39.82(未增强时),chrF++分数从0.5864提升至0.627,均显著优于基线。 Conclusion: ASR错误纠正能有效提升低资源语言的识别性能,特征设计(如IPA与对齐信息)对模型改进具有重要作用。 Abstract: This paper investigates sequence-to-sequence Transformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IPA and alignment information. To our knowledge, this is the first study addressing ASR error correction specifically for Burmese. We evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word- and character-level accuracy over baseline outputs. The proposed AEC model, combining IPA and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.

[26] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

Manish Jain,Satheesh Kumar Ponnambalam,Salman Faroz,Chandrakanth Lns,Vinay Sharma

Main category: cs.CL

TL;DR: MortgageLLM 是一个面向房贷金融领域的双专家大语言模型,通过指令残差技术在领域适应后恢复指令遵循能力,兼顾结构化任务与对话问答性能。

Details Motivation: 在专业领域如房贷金融中,通用大模型缺乏领域知识,而单一多任务微调会导致结构化任务与对话能力之间的性能权衡。 Method: 基于 LLaMA-3.1-8B 构建双轨专业化框架:一个专家负责对话问答(DPO 优化),另一个负责分类与摘要等结构化任务(SFT);采用指令残差技术恢复指令遵循能力,并通过专家自身实现的少样本任务路由机制智能分配任务。 Result: 在领域基准测试中,MortgageLLM(MLM v2)显著优于基线模型:摘要评分从 3.99 提升至 4.58,Q&A 从 4.0 到 4.09,分类从 1.2 到 2.6;BERTScore 在三项任务上分别达到 0.77、0.68 和 0.75,均优于基线。 Conclusion: 双专家架构结合指令残差技术和智能路由机制,有效解决了专业领域中结构化任务与自然对话之间的性能冲突,为垂直领域大模型设计提供了可行方案。 Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across general domains, yet their application to specialized sectors such as mortgage finance requires domain-specific knowledge augmentation while preserving instruction-following fidelity. We present MortgageLLM, a novel domain-specific large language model that addresses this dual challenge. It is developed using a dual-track specialization framework from a single base model (LLaMA-3.1-8B). We opted for this dual-expert approach as a single multi-task model suffers from performance trade-offs, where optimizing for structured tasks (via SFT) degrades conversational fidelity (via DPO). Our dual-track method solves this by creating two specialists, allowing each to be optimally trained for its distinct capability. Our approach applies the instruction residual technique to restore instruction-following capabilities post-domain adaptation without supervised fine-tuning. We contribute: (1) application of this residual technique to the highly specialized mortgage finance domain; (2) a dual-expert architecture combining a conversational Q&A model and a structured task model for classification and summarization; and (3) an intelligent task routing mechanism using few-shot classification performed by one of the expert models itself. We validate our approach on domain-specific benchmarks, where our final model (MLM v2) significantly outperforms the base LLaMA-3.1-8B-Instruct, achieving an LLM-as-a-Judge summarization score of 4.58 (vs. 3.99), a Q&A score of 4.09 (vs. 4.0), and a classification score of 2.6 (vs. 1.2). On semantic similarity, our model achieved a BERTScore of 0.77 for summarization (vs. 0.74), 0.68 for Q&A (vs. 0.58), and 0.75 for classification (vs. 0.73), substantially outperforming baseline approaches.

[27] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang,Yanxu Zhu,Dongyuan Lu,Jitao Sang

Main category: cs.CL

TL;DR: 本文提出了一个名为SGASA的自适应安全对齐框架,通过合成指南和增强训练提升推理模型对抗恶意越狱提示的安全性。

Details Motivation: 由于对抗性越狱提示具有隐蔽性和欺骗性,容易绕过现有安全机制,因此需要一种能够自适应强化防御的安全对齐方法。 Method: SGASA框架包含两个阶段:数据预合成阶段生成安全准则和增强提示;对齐微调阶段使用监督微调(SFT)和直接偏好优化(DPO)将这些准则嵌入模型。 Result: 在多个数据集上的实验表明,SGASA显著提升了模型的安全性,有效抵御有害对抗提示,同时减少对良性请求的误拒。 Conclusion: SGASA是一种可扩展且自适应的安全对齐方法,能有效增强推理模型在面对对抗攻击时的安全鲁棒性。 Abstract: Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.

[28] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

Steven Wang,Kyle Hunt,Shaojie Tang,Kenneth Joseph

Main category: cs.CL

TL;DR: 该研究探讨了通过在小规模人类调查数据上微调大语言模型(LLM)是否能使其更真实地模拟人类行为。研究发现,微调可提升模型在多样性、子群体对齐和信念-行为一致性方面的表现,但微调后的模型仍无法复现原始研究的回归系数,表明其尚不能替代人类参与推论性分析。

Details Motivation: 当前关于大语言模型能否替代人类参与者存在争议,现有研究表明LLM在模拟人类行为时存在系统性偏差,尤其是在少数群体和行为一致性方面表现不佳。因此需要探索改进方法。 Method: 通过一个关于信息披露行为的实验,比较人类与LLM生成的回答在分布差异、子群体对齐、信念-行为一致性和回归系数还原等多个维度的表现,评估在小样本人类数据上微调LLM的效果。 Result: 微调显著提升了LLM在异质性、群体对齐和信念-行为一致性方面的表现,但在恢复原始研究的回归系数方面仍然失败,即使是最优微调模型也无法达到可接受水平。 Conclusion: 尽管在小样本数据上微调可以改善LLM模拟人类行为的能力,但其生成的数据仍不足以替代人类参与者用于正式的统计推断分析。 Abstract: There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.

[29] Developing an Open Conversational Speech Corpus for the Isan Language

Adisai Na-Thalang,Chanakan Wittayasakpan,Kritsadha Phatcharoen,Supakit Buakaw

Main category: cs.CL

TL;DR: 本文介绍了首个开放的伊桑语会话语音数据集的开发,该数据集包含自然口语,捕捉了方言中的真实语言现象,并解决了因缺乏标准正字法带来的转录挑战。

Details Motivation: 推动对泰国使用最广泛的区域方言伊桑语的研究,弥补现有语音语料库多基于朗读语句而缺乏自然对话的不足,并促进包容性人工智能的发展。 Method: 设计适用于伊桑语的实用转录协议,平衡语言真实性与计算处理需求,并构建一个包含自发对话、代码转换和方言特征的开放语音数据集。 Result: 成功构建并公开发布了首个伊桑语自然会话语音数据集,解决了由于泰语与伊桑语间声调差异导致的正字法不统一所带来的转录难题。 Conclusion: 该数据集为低资源语言的语音研究提供了重要资源,有助于推进会话语音建模,并支持少数语言在人工智能中的代表性。 Abstract: This paper introduces the development of the first open conversational speech dataset for the Isan language, the most widely spoken regional dialect in Thailand. Unlike existing speech corpora that are primarily based on read or scripted speech, this dataset consists of natural speech, thereby capturing authentic linguistic phenomena such as colloquials, spontaneous prosody, disfluencies, and frequent code-switching with central Thai. A key challenge in building this resource lies in the lack of a standardized orthography for Isan. Current writing practices vary considerably, due to the different lexical tones between Thai and Isan. This variability complicates the design of transcription guidelines and poses questions regarding consistency, usability, and linguistic authenticity. To address these issues, we establish practical transcription protocols that balance the need for representational accuracy with the requirements of computational processing. By releasing this dataset as an open resource, we aim to contribute to inclusive AI development, support research on underrepresented languages, and provide a basis for addressing the linguistic and technical challenges inherent in modeling conversational speech.

[30] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

Robert Belanec,Branislav Pecher,Ivan Srba,Maria Bielikova

Main category: cs.CL

TL;DR: 本文提出了PEFT-Bench,一个用于评估自回归大语言模型上多种参数高效微调(PEFT)方法的统一端到端基准,并引入了综合考虑训练参数、推理速度和训练内存的PSCP评分指标。

Details Motivation: 尽管大型语言模型在许多任务上表现出色,但其庞大的规模带来了高昂的计算和环境成本,限制了可访问性。现有的PEFT方法评估存在覆盖范围有限且难以复现的问题,因此需要一个统一、可复现的评估基准。 Method: 构建了一个名为PEFT-Bench的统一评估框架,涵盖27个NLP数据集和6种PEFT方法,并提出了一种新的评估指标PSCP,综合衡量可训练参数量、推理速度和训练内存消耗。 Result: 成功展示了PEFT-Bench在多个数据集和PEFT方法上的应用效果,并通过PSCP指标实现了更全面的性能评估。 Conclusion: PEFT-Bench为PEFT方法提供了可复现、多维度的评估平台,有助于推动高效微调技术的发展与比较。 Abstract: Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-efficient fine-tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the increased development in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 6 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Score Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.

[31] Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

Kai Kugler

Main category: cs.CL

TL;DR: 首次系统研究神经语言模型训练过程中生成文本的Martin定律(词频与多义性关系),发现其遵循非单调发展轨迹,存在最优语义窗口。

Details Motivation: 探究神经语言模型在训练过程中是否以及如何遵循人类语言中的统计规律——Martin定律,以理解模型内部语义结构的演化。 Method: 使用DBSCAN聚类上下文化词向量作为词义的操作化定义,分析四个不同规模Pythia模型(70M-1B)在30个训练检查点上的词频-多义性关系演变。 Result: Martin定律在约第100个检查点出现,第104个达到峰值(r > 0.6),之后迅速退化;小模型后期出现灾难性语义崩溃,大模型则表现平缓退化;词频特异性权衡在整个训练中保持稳定(r ≈ -0.3)。 Conclusion: 大型语言模型对语言规律的遵从并非随训练单调增强,而是存在一个短暂的最优语义窗口,揭示了语义发展的动态平衡过程。 Abstract: We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.

[32] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

Joshua Fonseca Rivera

Main category: cs.CL

TL;DR: 通过微调,一个7B参数的语言模型被训练成能够可靠地检测和报告注入的‘思维’,准确率达到85%,且无假阳性,满足了Lindsey提出的三个标准。

Details Motivation: 研究是否可以通过直接训练提升语言模型的内省能力,而不是依赖其自然涌现。 Method: 在瞬时单token注入上对模型进行微调,使其学会检测并报告注入的语义内容,并测试其在未见概念上的泛化能力。 Result: 模型从近乎完全失败(0.4%准确率)提升到85%准确率,假阳性率为0%,并在未见概念上有良好泛化表现。 Conclusion: 至少一部分内省行为可以通过训练直接诱导,为实现内置AI透明性提供了可行路径。 Abstract: Lindsey (2025) investigates introspective awareness in language models through four experiments, finding that models can sometimes detect and identify injected activation patterns -- but unreliably (~20% success in the best model). We focus on the first of these experiments -- self-report of injected "thoughts" -- and ask whether this capability can be directly trained rather than waiting for emergence. Through fine-tuning on transient single-token injections, we transform a 7B parameter model from near-complete failure (0.4% accuracy, 6.7% false positive rate) to reliable detection (85% accuracy on held-out concepts at α=40, 0% false positives). Our model detects fleeting "thoughts" injected at a single token position, retains that information, and reports the semantic content across subsequent generation steps. On this task, our trained model satisfies three of Lindsey's criteria: accuracy (correct identification), grounding (0/60 false positives), and internality (detection precedes verbalization). Generalization to unseen concept vectors (7.5pp gap) demonstrates the model learns a transferable skill rather than memorizing specific vectors, though this does not establish metacognitive representation in Lindsey's sense. These results address an open question raised by Lindsey: whether "training for introspection would help eliminate cross-model differences." We show that at least one component of introspective behavior can be directly induced, offering a pathway to built-in AI transparency.

[33] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Antonín Jarolím,Martin Fajčík,Lucia Makaiová

Main category: cs.CL

TL;DR: 本文研究了在捷克语和斯洛伐克语的新闻评论中提取细粒度证据以验证或反驳虚假信息的问题,构建了一个由付费标注者完成的双向标注数据集,并评估了多个大语言模型与人类标注的一致性。

Details Motivation: 在线新闻评论中常传播错误信息,需要有效方法识别支持或反驳主张的具体文本证据,尤其针对捷克语和斯洛伐克语缺乏相关资源与研究。 Method: 构建了一个新的双语(捷克语和斯洛伐克语)细粒度证据标注数据集,由付费专业标注人员完成;在此基础上评估多个大语言模型(LLMs)在精确复制源文本证据方面的表现,并分析其错误率。 Result: 实验结果显示,LLMs常常无法从源文本中逐字复制证据,导致输出无效;其中llama3.1:8b模型尽管参数较少但表现优异,gpt-oss-120b模型则表现不佳;qwen3:14b、deepseek-r1:32b和gpt-oss:20b在模型大小与人类标注一致性之间实现了较好平衡。 Conclusion: 当前大语言模型在细粒度证据提取任务上仍存在挑战,特别是对源文本的忠实复制能力不足;较小的模型也可能优于更大模型,模型性能不仅取决于参数规模,还与训练数据和架构有关。 Abstract: Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task -- fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.

[34] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

Zhifeng Hao,Qibin Song,Ruichu Cai,Boyan Xu

Main category: cs.CL

TL;DR: DSR-SQL是一种双状态推理框架,通过建模上下文和生成状态的交互来提升大模型在复杂数据库上的Text-to-SQL性能,无需后训练或示例即可实现较高准确率。

Details Motivation: 现有基于思维链的Text-to-SQL方法在处理复杂企业数据库时因上下文容量有限、模式链接不可靠及语义基础薄弱而难以保持连贯推理。 Method: 提出DSR-SQL框架,包含自适应上下文状态(用于精炼和选择相关模式结构)和渐进生成状态(通过反馈引导的状态转换进行SQL生成),实现模型自我修正与用户意图对齐。 Result: 在Spider 2.0-Snow上达到35.28%执行准确率,在BIRD开发集上达到68.32%,表现具有竞争力。 Conclusion: DSR-SQL有效提升了大模型在复杂数据库场景下的Text-to-SQL推理能力,且无需额外训练或上下文示例,具备实用性和可扩展性。 Abstract: Recent divide-and-conquer reasoning approaches, particularly those based on Chain-of-Thought (CoT), have substantially improved the Text-to-SQL capabilities of Large Language Models (LLMs). However, when applied to complex enterprise databases, such methods struggle to maintain coherent reasoning due to limited context capacity, unreliable schema linking, and weak grounding in database semantics. To overcome these issues, we introduce DSR-SQL, a \textbf{D}ual-\textbf{S}tate \textbf{R}easoning framework that models Text-to-SQL as an interaction between an adaptive context state and a progressive generation state. The first constructs a compact, semantically faithful environment by refining large schemas and selecting relevant structures, while the second formalizes SQL synthesis as feedback-guided state transitions, enabling the model to self-correct and align with user intent. Without any post-training or in-context examples, DSR-SQL achieves competitive performance, reaching 35.28\% execution accuracy on Spider 2.0-Snow and 68.32\% on BIRD development set. Our implementation will be open-sourced at: https://github.com/DMIRLAB-Group/DSR-SQL.

[35] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

Kaifeng Hong,Yinglong Zhang,Xiaoying Hong,Xuewen Xia,Xing Xu

Main category: cs.CL

TL;DR: 本文提出了Odin,一种通过定向双模块机制将图结构注入Transformer的新架构,实现了文本与图结构的有效融合,并避免了过平滑问题;同时提出轻量级版本Light Odin,在保持性能的同时显著降低计算成本。

Details Motivation: 现有方法在处理文本图时存在局限:GNN受限于过平滑和_hop依赖的扩散,而Transformer忽略图拓扑结构。需要一种能有效结合强文本理解与结构推理的模型。 Method: 提出Odin架构,通过在特定Transformer层注入多跳图结构,实现与语义层次对齐的低、中、高层结构抽象;使用全局[CLS]表示进行聚合,避免过平滑;进一步设计Light Odin以提升效率。 Result: 在多个文本丰富的图基准上,Odin达到最先进精度,Light Odin在显著降低计算成本下仍具竞争力。 Conclusion: Odin及其轻量版Light Odin构成了一种统一、无hop的结构-文本融合框架,兼具表达力强、高效、可扩展的优点。 Abstract: Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs--limited by over-smoothing and hop-dependent diffusion--or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism.Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model's semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin's expressive power strictly contains that of both pure Transformers and GNNs.To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.

[36] A Systematic Study of Model Merging Techniques in Large Language Models

Oğuz Kağan Hitit,Leander Girrbach,Zeynep Akata

Main category: cs.CL

TL;DR: 本文对六种最先进的模型合并方法在大语言模型(LLM)上的表现进行了大规模系统评估,发现最简单的方法Task Arithmetic是唯一能稳定提升性能的方法,其他复杂方法常导致性能下降,表明现有合并技术难以直接适用于现代LLM,需设计针对LLM的专用合并算法和微调方法。

Details Motivation: 目前尚不清楚在小模型和分类器上有效的模型合并方法是否能推广到大语言模型(LLM),因此需要系统评估现有方法在LLM上的有效性。 Method: 对六种先进的模型合并方法(包括近期的子空间方法)在四个开源LLM、每个基础模型十二个微调检查点以及十六个标准LLM基准上进行了大规模、系统性的评估,使用标准化基准衡量合并模型相对于基础模型和最佳单个检查点的性能增益。 Result: 实验结果表明,最古老且最简单的Task Arithmetic方法是唯一能可靠地在LLM上带来性能提升的方案;其他考虑干扰或基于子空间的合并方法通常导致显著的性能下降。 Conclusion: 当前的模型合并技术不能直接迁移到现代大语言模型上,未来应设计专门针对LLM的合并算法以及支持合并的微调方法。 Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.

[37] Hierarchical Ranking Neural Network for Long Document Readability Assessment

Yurui Zheng,Yijun Chen,Shaohong Zhang

Main category: cs.CL

TL;DR: 本文提出了一种双向可读性评估机制和成对排序算法,用于改进文本可读性评估,兼顾句子级和文档级预测,并在中英文数据集上取得了优于基线模型的效果。

Details Motivation: 现有深度学习方法在可读性评估中往往忽略文本长度或可读性标签的有序关系,导致评估效果受限。 Method: 提出双向可读性评估机制,利用上下文信息识别文本中语义丰富区域以进行句子级可读性预测,并将句子级标签用于辅助文档级可读性预测;引入基于标签相减的成对排序算法建模可读性等级间的有序关系。 Result: 在中文和英文数据集上的实验表明,所提模型性能具有竞争力,且优于其他基线模型。 Conclusion: 该方法有效提升了可读性评估的准确性,尤其通过融合句子级预测与有序关系建模,增强了模型对文本整体难度的判断能力。 Abstract: Readability assessment aims to evaluate the reading difficulty of a text. In recent years, while deep learning technology has been gradually applied to readability assessment, most approaches fail to consider either the length of the text or the ordinal relationship of readability labels. This paper proposes a bidirectional readability assessment mechanism that captures contextual information to identify regions with rich semantic information in the text, thereby predicting the readability level of individual sentences. These sentence-level labels are then used to assist in predicting the overall readability level of the document. Additionally, a pairwise sorting algorithm is introduced to model the ordinal relationship between readability levels through label subtraction. Experimental results on Chinese and English datasets demonstrate that the proposed model achieves competitive performance and outperforms other baseline models.

[38] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

Lina Conti,Dennis Fucci,Marco Gaido,Matteo Negri,Guillaume Wisniewski,Luisa Bentivogli

Main category: cs.CL

TL;DR: 研究探讨了语音翻译模型在处理语言中性别指代时的机制,发现模型不仅依赖训练数据中的性别关联,还利用声学信息和第一人称代词来推断说话者性别,超越了单纯的语言模型男性偏好。

Details Motivation: 语音翻译中可能因声学特征导致对说话者性别的误判,现有模型如何进行性别分配尚不明确,因此需要探究其内在机制以减少模态特异性偏见。 Method: 通过三个语言对(en-es/fr/it)分析训练数据模式、内部语言模型偏差与声学信息的交互作用,并使用对比特征归因分析频谱图,研究模型如何分配性别。 Result: 模型未简单复制训练数据中的性别关联,而是学习到更广泛的男性主导模式;尽管语言模型有强烈男性偏向,但模型可依据声学输入覆盖该倾向;高准确率模型利用第一人称代词将性别词汇与说话者关联,并从整个频谱而非仅音高获取性别信息。 Conclusion: 语音翻译模型通过结合上下文代词与分布式声学特征进行性别分配,揭示了一种超越音高的新机制,有助于理解并缓解多模态偏见问题。 Abstract: Unlike text, speech conveys information about the speaker, such as gender, through acoustic cues like pitch. This gives rise to modality-specific bias concerns. For example, in speech translation (ST), when translating from languages with notional gender, such as English, into languages where gender-ambiguous terms referring to the speaker are assigned grammatical gender, the speaker's vocal characteristics may play a role in gender assignment. This risks misgendering speakers, whether through masculine defaults or vocal-based assumptions. Yet, how ST models make these decisions remains poorly understood. We investigate the mechanisms ST models use to assign gender to speaker-referring terms across three language pairs (en-es/fr/it), examining how training data patterns, internal language model (ILM) biases, and acoustic information interact. We find that models do not simply replicate term-specific gender associations from training data, but learn broader patterns of masculine prevalence. While the ILM exhibits strong masculine bias, models can override these preferences based on acoustic input. Using contrastive feature attribution on spectrograms, we reveal that the model with higher gender accuracy relies on a previously unknown mechanism: using first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across the frequency spectrum rather than concentrated in pitch.

[39] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

Husne Ara Rubaiyeat,Hasan Mahmud,Md Kamrul Hasan

Main category: cs.CL

TL;DR: 本文介绍了IsharaKhobor数据集及其子集,旨在推动孟加拉手语翻译(BdSLT)的研究,提供了公开可用的句子级数据集,并通过基准测试和词汇规范化进行了消融实验。

Details Motivation: 由于孟加拉手语资源极度匮乏,缺乏标准的句子级别数据集,限制了面向听障人群的AI辅助工具的发展,因此需要构建高质量的数据集以促进相关研究。 Method: 提出了IsharaKhobor数据集及两个子集(IsharaKhobor_small和IsharaKhobor_canonical_small),采用基于地标的位置信息进行原始嵌入和RQE嵌入的基准测试,并对词汇限制和规范化进行了消融研究。 Result: 成功构建并公开发布了IsharaKhobor系列数据集,通过不同设置下的实验验证了词汇规范化和限制对模型性能的影响,为BdSLT任务提供了有效的基准。 Conclusion: IsharaKhobor数据集填补了孟加拉手语翻译领域的资源空白,支持未来相关研究,并展示了数据预处理在低资源手语翻译中的重要性。 Abstract: Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creation for BdSLT is of immense importance for developing AI based assistive tools for deaf and hard of hearing people of Bangla speaking community. In this paper, we present a dataset, IsharaKhobor , and two subset of it for enabling research. We also present the challenges towards developing the dataset and present some way forward by benchmarking with landmark based raw and RQE embedding. We do some ablation on vocabulary restriction and canonicalization of the same within the dataset, which resulted in two more datasets, IsharaKhobor_small and IsharaKhobor_canonical_small. The dataset is publicly available at: www.kaggle.com/datasets/hasanssl/isharakhobor [1].

[40] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

Minjoon Choi

Main category: cs.CL

TL;DR: 本文提出了RoParQ基准和XParaCon评估指标,用于衡量大语言模型在回答改写问题时的一致性,并通过一种基于推理的监督微调方法提升模型对语义不变性的对齐,实验表明该方法能显著增强模型的鲁棒性。

Details Motivation: 大语言模型在面对同义改写的问题时表现不一致,表明其依赖表面模式而非真正理解语义,因此需要更有效的评估手段和训练策略来提升模型的语义一致性。 Method: 构建了RoParQ基准,利用专有模型生成标准数据集的改写问题,并筛选出导致判断模型置信度不一致的样本;提出XParaCon指标,通过多个问题变体的准确率标准差来量化模型鲁棒性;采用基于推理的、感知改写的监督微调(SFT)策略进行模型对齐。 Result: 实验显示,经过所提SFT策略微调的轻量级模型在跨改写一致性上表现显著提升,其一致性水平可媲美更大规模的预训练模型。 Conclusion: 所提出的RoParQ、XParaCon和SFT方法有效提升了模型在多选问答中的语义理解能力和鲁棒性,减少了对表面模式的依赖,推动了更可靠的大语言模型发展。 Abstract: Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions, suggesting a reliance on surface-level patterns rather than true semantic understanding. To address this limitation, we introduce RoParQ, a benchmark specifically constructed to evaluate cross-paraphrase consistency in closed-book multiple-choice QA. This benchmark is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. We further propose XParaCon, a novel evaluation metric that quantifies a model's robustness by measuring the standard deviation of accuracies across question variants. Additionally, we implement a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy designed to align models toward semantic invariance. Our experiments demonstrate that this targeted alignment significantly enhances robustness. Notably, fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models. These results highlight the efficacy of our approach in mitigating superficial memorization and fostering more robust, reliable LLMs.

[41] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

Yixiu Zhao,Xiaozhi Wang,Zijun Yao,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 本文提出了一种轻量级且通用的方法,通过关联神经元激活与外部标签或模型置信度等辅助指标,来识别大语言模型中编码特定技能的神经元,并在多种任务上验证了其有效性。

Details Motivation: 大语言模型虽能力强,但内部机制不透明,难以理解哪些神经元负责哪些技能。因此需要一种可解释的方法来定位与特定技能相关的神经元。 Method: 基于软提示训练和辅助指标(如外部标签、模型自信度)的相关性分析,自动识别与特定技能相关的神经元,无需人工聚合token。方法扩展到了多技能复杂场景。 Result: 在开放生成、自然语言推理和BigBench算术推理任务上成功检测到与技能相关的神经元,并发现了之前未被识别的算术推理捷径。 Conclusion: 所提方法简单有效,能揭示大语言模型中任务特定且可解释的神经元行为,有助于提升模型可解释性。 Abstract: Large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, yet their internal mechanisms remain largely opaque. In this paper, we introduce a simple, lightweight, and broadly applicable method with a focus on isolating neurons that encode specific skills. Building upon prior work that identified "skill neurons" via soft prompt training on classification tasks, our approach extends the analysis to complex scenarios involving multiple skills. We correlate neuron activations with auxiliary metrics -- such as external labels and the model's own confidence score -- thereby uncovering interpretable and task-specific behaviors without the need for manual token aggregation. We empirically validate our method on tasks spanning open-ended text generation and natural language inference, demonstrating its ability to detect neurons that not only drive known skills but also reveal previously unidentified shortcuts in arithmetic reasoning on BigBench.

[42] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Dongyang Fan,Diba Hashemi,Sai Praneeth Karimireddy,Martin Jaggi

Main category: cs.CL

TL;DR: 本研究探讨了在大语言模型预训练中引入多种元数据(如文档质量细粒度指标)以提升训练效率的方法,发现细粒度编码的元数据更有效,并提出通过附加元数据作为辅助任务和可学习meta-token来加速训练,提供了改进LLM预训练效率的实用指南。

Details Motivation: 先前工作仅利用URL这一种元数据信号来加速LLM预训练,尚未探索其他类型元数据的潜力,本文旨在系统研究更广泛的元数据类型及其对训练效率的影响。 Method: 研究评估了多种元数据类型(如文档质量指标)在预训练中的作用,比较了元数据前置与后置的效果,引入了以预测元数据为辅助任务的附加机制,并使用掩码损失训练可学习meta-token,通过探针分析潜在表征以理解元数据如何影响学习过程。 Result: 发现细粒度的元数据(如文档质量指标)能更有效地加速预训练;元数据后置作为辅助任务可提升训练效率;可学习meta-token能部分恢复加速效果并形成质量感知的潜在结构;探针分析揭示了元数据对模型学习的积极影响。 Conclusion: 不同类型的元数据,尤其是具有细粒度信息的,能够显著提升LLM预训练的效率;通过设计合理的元数据整合方式(如附加任务和可学习token),可以有效改善模型训练的速度与性能,为实际应用提供了可行策略。 Abstract: Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.

[43] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

Anna Marklová,Ondřej Vinš,Martina Vokáčová,Jiří Milička

Main category: cs.CL

TL;DR: 该研究探讨了捷克语AI生成诗歌与人类创作诗歌在母语者中的识别度和审美评价,发现人们难以区分两者,且对认为是AI创作的诗歌评价更低,存在明显的作者身份偏见。

Details Motivation: 探究在非英语、低资源语言(如捷克语)中,大语言模型生成诗歌的质量及其接受度,尤其是读者对作者身份的认知如何影响审美判断。 Method: 通过让捷克语母语者辨别AI与人类创作的诗歌,并进行审美评分,分析其识别准确率与评价差异,使用逻辑回归模型探索喜好与作者归属判断之间的关系。 Result: 参与者平均仅45.8%正确识别诗歌作者,表明AI生成诗歌与人类创作难以区分;当人们认为诗歌由AI创作时,评分更负面,尽管实际上AI诗歌获得相等甚至更高的评分;喜爱程度越高,越难正确判断作者身份;诗歌熟悉度或文学背景不影响识别能力。 Conclusion: AI能够在形态复杂的低资源斯拉夫语言(如捷克语)中生成具有说服力的诗歌,且读者的审美评价受其对作者身份信念的影响,而非文本本身质量。 Abstract: Large language models are increasingly capable of producing creative texts, yet most studies on AI-generated poetry focus on English -- a language that dominates training data. In this paper, we examine the perception of AI- and human-written Czech poetry. We ask if Czech native speakers are able to identify it and how they aesthetically judge it. Participants performed at chance level when guessing authorship (45.8\% correct on average), indicating that Czech AI-generated poems were largely indistinguishable from human-written ones. Aesthetic evaluations revealed a strong authorship bias: when participants believed a poem was AI-generated, they rated it as less favorably, even though AI poems were in fact rated equally or more favorably than human ones on average. The logistic regression model uncovered that the more the people liked a poem, the less probable was that they accurately assign the authorship. Familiarity with poetry or literary background had no effect on recognition accuracy. Our findings show that AI can convincingly produce poetry even in a morphologically complex, low-resource (with respect of the training data of AI models) Slavic language such as Czech. The results suggest that readers' beliefs about authorship and the aesthetic evaluation of the poem are interconnected.

[44] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Dong Wang,Yang Li,Ansong Ni,Ching-Feng Yeh,Youssef Emad,Xinjie Lei,Liam Robbins,Karthik Padthe,Hu Xu,Xian Li,Asli Celikyilmaz,Ramya Raghavendra,Lifei Huang,Carole-Jean Wu,Shang-Wen Li

Main category: cs.CL

TL;DR: 本文提出了一种名为Matrix的去中心化框架,用于大规模生成合成数据,通过分布式消息队列实现多智能体工作流的高效协作,显著提升了数据生成吞吐量。

Details Motivation: 现有的多智能体合成数据生成框架依赖中心化协调器或特定领域硬编码,存在可扩展性和灵活性不足的问题。 Method: Matrix采用基于Ray的去中心化架构,将控制流和数据流表示为通过分布式队列传递的序列化消息,任务由轻量级代理独立推进,计算密集型操作由分布式服务处理。 Result: 在多种数据生成场景中,Matrix在相同硬件资源下实现了2到15倍的数据生成吞吐量提升,且不牺牲输出质量。 Conclusion: Matrix提供了一个模块化、可配置的框架,能够高效支持多样化的合成数据生成任务,具备良好的可扩展性和应用前景。 Abstract: Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.

[45] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Hongjin Su,Shizhe Diao,Ximing Lu,Mingjie Liu,Jiacheng Xu,Xin Dong,Yonggan Fu,Peter Belcak,Hanrong Ye,Hongxu Yin,Yi Dong,Evelina Bakhturina,Tao Yu,Yejin Choi,Jan Kautz,Pavlo Molchanov

Main category: cs.CL

TL;DR: 本文提出了一种名为ToolOrchestra的方法,用于训练小型协调模型(Orchestrator),通过强化学习协调多种智能工具,在解决复杂任务时实现了更高的准确性和效率,优于GPT-5等大模型。

Details Motivation: 大型语言模型虽强大但成本高且在处理复杂问题(如人类最后考试HLE)时仍面临挑战,因此需要更高效、低成本的解决方案。 Method: 采用强化学习方法,结合结果、效率和用户偏好奖励机制,训练一个8B的小型协调模型来管理多个工具和子模型,实现高效的工具调用与任务分解。 Result: Orchestrator在HLE上达到37.1%的得分,超过GPT-5的35.1%,效率提高2.5倍;在tau2-Bench和FRAMES上以约30%的成本大幅超越GPT-5,并展现出对新工具的良好泛化能力。 Conclusion: 使用轻量级协调模型组合多样化工具的方法在性能和成本之间取得了最佳平衡,为实用且可扩展的工具增强型推理系统提供了新路径。 Abstract: Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.

[46] Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Yeganeh Kordi,Nihal V. Nayak,Max Zuo,Ilana Nguyen,Stephen H. Bach

Main category: cs.CL

TL;DR: 本研究通过大规模、细粒度的分析,利用数千个大语言模型和项目反应理论(IRT)对六个数据集中的样本难度进行排序,系统评估了大语言模型在不同任务难度下的泛化能力。结果表明,跨难度泛化能力有限,仅训练简单或困难数据无法在所有难度上取得一致提升,强调了训练和评估中包含多种难度的重要性。

Details Motivation: 现有研究对于训练数据难易程度如何影响模型在不同难度测试数据上的表现存在分歧,且缺乏客观、细粒度的难度衡量方式。本文旨在通过更客观、大规模的方法系统性地探究大语言模型在不同任务难度间的泛化能力,以指导数据构建与模型评估。 Method: 采用项目反应理论(IRT),基于数千个大语言模型在六个数据集上的输出结果自动计算每个样本的难度得分,从而排除人类主观判断的影响;随后系统评估模型在不同难度分组上的训练与测试表现,分析跨难度泛化模式。 Result: 1. 基于LLM输出的IRT评分提供了更客观、细粒度的样本难度估计;2. 模型在不同难度间的泛化能力有限:在较易数据上训练通常只提升对容易样本的表现,在较难数据上训练也难以稳定提升对简单样本的性能;3. 不存在单一难度训练策略能在所有测试难度上持续改进。 Conclusion: 训练和评估数据应涵盖广泛的难度层次,依赖单一难度的数据存在风险;当前大语言模型在任务难度间的迁移能力有限,需谨慎设计数据筛选与评估方案。 Abstract: We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.

cs.CV [Back]

[47] Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

David Amebley,Sayanton Dibbo

Main category: cs.CV

TL;DR: 本文提出了一种神经科学启发的拓扑正则化框架(tau),用于增强多模态视觉-语言模型(VLMs)对黑盒成员推断攻击(MIA)的隐私抗性,实验表明该方法显著降低了攻击成功率,同时保持了模型效用。

Details Motivation: 现有的隐私攻击研究主要集中在单模态系统,而多模态模型(尤其是视觉-语言模型)在实际部署中面临新的隐私风险,且神经科学启发的结构是否能提升其隐私抗性尚不明确。 Method: 提出一种基于神经科学的拓扑正则化(tau)框架,构建具有更强拓扑鲁棒性的NEURO-VLM变体,并在BLIP、PaliGemma 2和ViT-GPT2三种VLM上,使用COCO、CC3M和NoCaps数据集进行成员推断攻击评估,比较基线与NEURO版本的攻击成功率和模型性能。 Result: 在BLIP+COCO上的实验显示,NEURO-VLM的MIA攻击成功率平均ROC-AUC下降24%,且MPNet和ROUGE-2指标表明生成文本质量与原始模型相当;在其他模型和数据集上的结果也验证了该方法的稳定性和普适性。 Conclusion: 神经科学启发的拓扑正则化可有效提升多模态视觉-语言模型对成员推断攻击的隐私抗性,同时不显著影响模型效用,为构建更安全的多模态AI系统提供了可行路径。 Abstract: In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.

[48] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Inferix Team,Tianyu Feng,Yizeng Han,Jiahao He,Yuanyu He,Xi Lin,Teng Liu,Hanfeng Lu,Jiasheng Tang,Wei Wang,Zhiyuan Wang,Jichao Wu,Mingyang Yang,Yinghao Yu,Zeyu Zhang,Bohan Zhuang

Main category: cs.CV

TL;DR: Inferix是一个专为沉浸式世界合成设计的下一代推理引擎,通过优化半自回归解码过程,支持高效、可变长度和高质量的视频生成,适用于代理AI、具身AI和游戏等领域。

Details Motivation: 现有的视频扩散模型在生成长序列、物理真实且可交互的高质量视频方面存在局限性,而标准的LLM推理系统并不专注于世界模拟任务。因此需要一个专门针对世界模型特点设计的高效推理引擎。 Method: 提出Inferix,采用半自回归(块扩散)解码范式,在块内使用扩散模型并跨块自回归地传递信息,并引入LLM风格的KV缓存管理机制以提升效率和生成质量;同时集成LV-Bench进行细粒度评估,并支持交互式视频流与性能分析。 Result: Inferix实现了更连贯、稳定和高效的长视频生成,支持实时交互和真实感模拟,并通过LV-Bench实现对分钟级视频生成场景的有效基准测试。 Conclusion: Inferix作为专为世界模型打造的推理引擎,在生成质量和效率之间取得了良好平衡,推动了以视觉为中心的基础模型向具备感知、理解和推理能力的新范式发展。 Abstract: World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.

[49] Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?

Kun Guo,Yun Shen,Xijun Wang,Chaoqun You,Yun Rui,Tony Q. S. Quek

Main category: cs.CV

TL;DR: 提出了一种基于深度强化学习的自适应算法LTED-Ada,用于在边缘计算环境中优化视频对象识别中的本地跟踪与边缘检测之间的选择。

Details Motivation: 在资源受限设备上实现快速准确的视频对象识别存在挑战,尤其是在决定何时进行边缘检测或本地跟踪方面。 Method: 通过构建单设备和多设备场景下的长期优化问题,结合深度强化学习设计了LTED-Ada算法,并在多设备场景中引入联邦学习以提升策略的泛化能力。 Result: 硬件在环实验表明,LTED-Ada在不同帧率和性能需求下均优于基线方法,有效平衡了精度、延迟和计算开销。 Conclusion: LTED-Ada通过自适应决策机制,在移动边缘计算环境下显著提升了视频对象识别的效率与准确性,适用于交通摄像头等实际应用场景。 Abstract: Fast and accurate video object recognition, which relies on frame-by-frame video analytics, remains a challenge for resource-constrained devices such as traffic cameras. Recent advances in mobile edge computing have made it possible to offload computation-intensive object detection to edge servers equipped with high-accuracy neural networks, while lightweight and fast object tracking algorithms run locally on devices. This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking. To address this, we formulate two long-term optimization problems for both single-device and multi-device scenarios, taking into account the temporal correlation of consecutive frames and the dynamic conditions of mobile edge networks. Based on the formulation, we propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection, according to the frame rate as well as recognition accuracy and delay requirement. In multi-device setting, we further enhance LTED-Ada using federated learning to enable collaborative policy training across devices, thereby improving its generalization to unseen frame rates and performance requirements. Finally, we conduct extensive hardware-in-the-loop experiments using multiple Raspberry Pi 4B devices and a personal computer as the edge server, demonstrating the superiority of LTED-Ada.

[50] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

Haibo HU,Lianming Huang,Nan Guan,Chun Jason Xue

Main category: cs.CV

TL;DR: DeeAD是一种无需训练的动作引导早期退出框架,通过评估中间轨迹的物理可行性来加速视觉-语言动作(VLA)模型的推理,显著降低延迟并保持规划质量。

Details Motivation: VLA模型在自动驾驶中融合感知、推理与轨迹生成,但深层Transformer结构导致显著的推理延迟,限制了实时应用。 Method: 提出DeeAD框架,利用轻量级规划先验(如导航或低精度规划)判断中间轨迹是否满足物理可行性(偏差<2m),从而决定是否提前退出推理;引入多跳控制器,根据得分变化率自适应跳过冗余网络层。 Result: 在Bench2Drive基准上实验表明,DeeAD可实现最高28%的Transformer层稀疏化和29%的延迟降低,同时保持规划性能与安全性。 Conclusion: DeeAD是一种即插即用、无需重训练的加速方案,有效提升VLA模型的推理效率,适用于对实时性要求高的自动驾驶系统。 Abstract: Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.

[51] Foundry: Distilling 3D Foundation Models for the Edge

Guillaume Letellier,Siddharth Srivastava,Frédéric Jurie,Gaurav Sharma

Main category: cs.CV

TL;DR: 本文提出了一种新的基础模型蒸馏范式FMD,用于压缩大规模自监督学习模型,在保持其通用表征能力的同时提升部署效率;并实现了首个面向3D点云的FMD方法Foundry,通过学习重建教师模型令牌级表示的SuperTokens,使小型化学生模型在多种下游任务中仍具备良好迁移性能。

Details Motivation: 大型基础模型虽具有强大的通用特征提取能力,但因其规模庞大、计算成本高,难以部署在边缘设备上;现有压缩技术往往牺牲了基础模型关键的通用性,限制了其广泛应用。 Method: 提出Foundation Model Distillation(FMD)框架,并实现具体方法Foundry:通过让学生模型学习一组紧凑的SuperTokens来重建教师模型的令牌级表示,从而保留其潜在空间的核心结构和通用表达能力。 Result: 单个蒸馏后的Foundry模型在分类、部件分割和少样本等多样化下游任务中均表现出接近原始基础模型的性能,同时显著减少使用的tokens数量和计算量(FLOPs),提升了在资源受限设备上的部署可行性。 Conclusion: FMD为压缩基础模型提供了一个有效的新方向,能够在大幅降低模型开销的同时保留其下游任务无关的通用性,Foundry的成功实现验证了该方法在3D点云领域的可行性和潜力。 Abstract: Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient 'specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher's token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.

[52] DinoLizer: Learning from the Best for Generative Inpainting Localization

Minh Thong Doi,Jan Butora,Vincent Itier,Jérémie Boulanger,Patrick Bas

Main category: cs.CV

TL;DR: 本文提出DinoLizer,一种基于DINOv2的模型,用于定位生成式图像修复中的篡改区域。该方法在B-Free数据集上预训练以检测合成图像,并通过线性分类头在ViT的patch嵌入上预测篡改区域。采用滑动窗口策略处理大图,结合后处理生成二值化篡改掩码。实验表明,DinoLizer在多种数据集和后处理操作下优于现有最先进方法,平均IoU提升12%。

Details Motivation: 现有的图像篡改检测方法在面对生成式图像修复时性能有限,尤其是在处理语义修改与非语义编辑的区分以及常见后处理操作(如压缩、加噪)时鲁棒性不足。因此,需要一种能够精确定位篡改区域且具备强泛化能力的方法。 Method: 基于DINOv2模型,在其ViT的patch embeddings上添加线性分类头,以14×14 patch分辨率预测篡改区域;训练时让模型聚焦于语义变化区域,忽略非语义编辑;使用滑动窗口策略处理大于模型输入尺寸的图像,并对输出热图进行后处理以获得最终的二值化篡改掩码。 Result: DinoLizer在多个基于不同生成模型构建的图像修复数据集上超越了当前最先进的局部篡改检测器;在平均IoU指标上比次优模型高出12%,在经历后处理操作后优势更明显;对缩放、噪声、JPEG压缩等操作具有强鲁棒性;消融实验验证了DINOv2相较于DINOv3在此任务上的优越性。 Conclusion: DinoLizer有效利用了DINOv2强大的视觉表示能力,实现了高精度的篡改区域定位,且对实际场景中的后处理干扰具有良好的鲁棒性,展示了ViT在图像取证任务中的巨大潜力。 Abstract: We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer's patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12\% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer's superiority. The code will be publicly available upon acceptance of the paper.

[53] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

Daeheon Jeong,Seoyeon Byun,Kihoon Son,Dae Hyun Kim,Juho Kim

Main category: cs.CV

TL;DR: 本文介绍了CANVAS,首个用于评估视觉语言模型(VLM)在基于工具的用户界面设计中性能的基准测试,包含598个任务,旨在衡量VLM在设计复制和修改任务中通过工具调用进行UI设计的能力。

Details Motivation: 现有的视觉语言模型虽能通过工具调用操作设计软件,但缺乏评估其在真实设计环境中迭代UI设计能力的基准,因此需要构建一个标准化的评估框架来衡量和提升其设计协作潜力。 Method: 提出CANVAS基准,包含来自30类功能、3.3K个移动UI设计中采样的598个基于工具的任务,涵盖设计复制与设计修改两类任务,要求模型通过上下文感知的工具调用来逐步修改UI设计,并提供真值参考以评估性能。 Result: 实验结果表明,先进的VLM能够进行更具策略性的工具调用,从而提高设计质量,同时研究识别出模型常见的错误模式,为后续改进提供方向。 Conclusion: CANVAS为评估VLM在真实设计工具环境中的UI设计能力提供了有效基准,揭示了当前模型在工具使用上的潜力与不足,推动VLM在人机协同设计中的应用发展。 Abstract: User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs' potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.

[54] Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru,Xiaochuang Han,Bhuwan Dhingra,Emily Dinan,Maha Elbayad

Main category: cs.CV

TL;DR: 提出文本引导的语义图像编码器TIE,通过文本条件优化图像编码,提升视觉-语言模型在图像到文本任务上的性能与推理效率。

Details Motivation: 传统图像编码器独立于文本查询进行预训练,无法针对特定下游任务或文本内容优化,导致语义不匹配和效率低下。 Method: 设计文本引导的语义图像编码器(TIE),使图像表示生成过程依赖于输入的文本查询,实现文本条件下的图像特征提取。 Result: TIE在1B和3B规模下平均提升+1.5和+1.3个点,在DocVQA等任务上最高提升6点;同时仅用一半图像块即实现更优性能,显著提升推理效率,并展现出良好的通用查询泛化能力。 Conclusion: TIE通过文本引导的图像编码有效增强了视觉-语言模型的语义对齐、推理效率和可解释性,为图像编码器的设计提供了新范式。 Abstract: Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.

[55] One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

Sindhuja Penchala,Gavin Money,Gabriel Marques,Samuel Wood,Jessica Kirschman,Travis Atkison,Shahram Rahimi,Noorbakhsh Amiri Golilarz

Main category: cs.CV

TL;DR: 本文提出SMARC模型,仅通过图像的10%连续区域即可实现表面材质的重建与分类,结合部分卷积U-Net与分类头,在极端稀疏视觉输入下实现了最先进的RGB重建和材质分类性能。

Details Motivation: 现有方法依赖密集或全场景观测,难以应用于视野受限或部分观测的场景,因此需要一种能在极稀疏视觉输入下有效理解材料表面的方法。 Method: 提出SMARC模型,采用部分卷积U-Net结合分类头,利用单个10%的连续图像块进行空间修复与材质分类,实现统一的表面重建与识别。 Result: 在Touch and Go数据集上,SMARC达到17.55 dB的PSNR和85.10%的材质分类准确率,优于包括ViT、MAE、Swin Transformer等在内的五种主流模型。 Conclusion: 部分卷积在处理缺失数据下的空间推理具有显著优势,SMARC为极简视觉输入下的表面理解提供了有效且强大的解决方案。 Abstract: Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation, and material perception. However, most existing methods rely on dense or full-scene observations, limiting their effectiveness in constrained or partial view environment. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders [17], Vision Transformer (ViT) [13], Masked Autoencoder (MAE) [5], Swin Transformer [9], and DETR [2] using Touch and Go dataset [16] of real-world surface textures. SMARC achieves state-of-the-art results with a PSNR of 17.55 dB and a material classification accuracy of 85.10%. Our findings highlight the advantages of partial convolution in spatial reasoning under missing data and establish a strong foundation for minimal-vision surface understanding.

[56] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang,Sudong Wang,Kaichen Zhang,Keming Wu,Sicong Leng,Yifan Zhang,Chengwei Qin,Shijian Lu,Xingxuan Li,Lidong Bing

Main category: cs.CV

TL;DR: 提出LongVT框架,通过多模态链式工具思维实现长视频的端到端推理,利用LMM的时序定位能力进行全局到局部的视频理解,并发布VideoSIAH数据集以推动研究。

Details Motivation: 现有大视觉模型在处理长视频时易产生幻觉,因证据稀疏且分散,缺乏有效机制整合全局概览与局部细节分析。 Method: 设计LongVT框架,结合LMM的时序定位能力作为原生视频裁剪工具,实现多轮全局概览与局部细化的交错推理;构建并发布VideoSIAH数据集用于训练与评估;采用三阶段训练策略(冷启动监督微调、代理强化学习与微调)。 Result: 在四个具挑战性的长视频理解与推理基准上,LongVT持续优于现有强基线方法。 Conclusion: LongVT通过模仿人类观看长视频的方式,实现了更可靠、基于视觉证据的长视频推理,为减少幻觉提供了新范式。 Abstract: Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

[57] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models

Souradeep Dutta,Keshav Bulia,Neena S Nair

Main category: cs.CV

TL;DR: 本文重新审视了Facebook AI Research提出的KRISP模型,提出了一种参数更少的轻量级复现版本。尽管性能约为原模型的75%,但揭示了原设计中的一些缺陷和实际应用中的问题。通过消融实验,研究了在资源受限下知识增强型VQA架构的可扩展性和有效性,并验证了其在合成VQA数据和DAQUAR数据集上的表现。该轻量模型可防止AI幻觉,适用于智能手机和AR-VR等边缘设备,支持离线视觉推理。

Details Motivation: 原KRISP模型计算开销大、依赖大型骨干网络,难以部署于资源受限环境。本文旨在探索其轻量化复现的可能性,并揭示原论文未充分讨论的设计缺陷与现实问题。 Method: 提出一种轻量级的KRISP复现模型,减少参数量并限制外部知识图谱的使用范围;通过系统性消融研究评估不同配置下的性能;在合成VQA数据和DAQUAR数据集上进行验证。 Result: 轻量模型达到原模型约75%的性能;有效防止AI幻觉;可在边缘设备如手机和AR-VR上运行;支持离线视觉推理。 Conclusion: 知识增强型VQA模型在减少参数规模的情况下仍能保持较优性能,且具备更强的实用性和部署灵活性,尤其适合资源受限场景。 Abstract: Facebook AI Research introduced KRISP [4], which integrates structured external knowledge into pipelines for vision-language reasoning. Despite its effectiveness, the original model has been developed for industrial-scale training, is computationally demanding, and is tightly connected to a large backbone. In this work, we reexamine KRISP from a different angle and offer a lightweight reproduction with significantly fewer parameters. Even though our replicated model performs about 75 % of the original, the replication process uncovers a number of design flaws, real-world pitfalls, and implicit problems that were not fully covered in the original paper. We offer insights into the scalability and efficacy of knowledge-enhanced VQA architectures under resource constraints through systematic ablation studies, which include a proof-of-concept on synthetic VQA data and evaluation on the DAQUAR dataset. Our model, configured with a low parameter setup and constrained by the external Knowledge graph domain, prevents AI hallucinations and generates outputs solely within that domain. Minimal parameters allow us to function on edge devices like smartphones and AR-VR, further improving offline visual reasoning.

[58] Intriguing Properties of Dynamic Sampling Networks

Dario Morle,Reid Zaffino

Main category: cs.CV

TL;DR: 本文提出了一种名为“warping”的新型算子,统一了深度学习中各种动态采样方法,并对其进行了理论分析,揭示了其在前向与反向传播中的不对称性及其与传统卷积算子的本质区别,同时探讨了动态采样网络稳定训练的条件和离散化效应。

Details Motivation: 现有的动态采样机制在多个计算机视觉模型中表现出色,但缺乏统一的理论分析框架。为了建立统一视角并深入理解这些方法的性质,需要一种更基础且可分析的通用算子。 Method: 提出了“warping”算子作为动态采样的通用形式,通过统计建模输入为独立同分布变量和齐次随机场进行理论分析,并结合数值实验研究前向与反向传播特性、离散化影响及训练稳定性。此外引入一种基于梯度更新的新颖损失景观可视化方法。 Result: 证明了warping可重构可变形卷积、主动卷积单元和空间变换网络等结构;发现了动态采样机制在前向与反向传播之间的独特不对称性;指出其属于不同于传统平移不变卷积的一类正交算子;给出了确保训练稳定的条件;分析了离散化带来的统计效应;提出了新的损失景观可视化技术。 Conclusion: 动态采样机制可通过warping算子统一建模,其具有独特的数学结构和训练动力学特性,区别于传统卷积,本文为设计和训练此类模型提供了理论基础和实用工具。 Abstract: Dynamic sampling mechanisms in deep learning architectures have demonstrated utility across many computer vision models, though the theoretical analysis of these structures has not yet been unified. In this paper we connect the various dynamic sampling methods by developing and analyzing a novel operator which generalizes existing methods, which we term "warping". Warping provides a minimal implementation of dynamic sampling which is amenable to analysis, and can be used to reconstruct existing architectures including deformable convolutions, active convolutional units, and spatial transformer networks. Using our formalism, we provide statistical analysis of the operator by modeling the inputs as both IID variables and homogeneous random fields. Extending this analysis, we discover a unique asymmetry between the forward and backward pass of the model training. We demonstrate that these mechanisms represent an entirely different class of orthogonal operators to the traditional translationally invariant operators defined by convolutions. With a combination of theoretical analysis and empirical investigation, we find the conditions necessary to ensure stable training of dynamic sampling networks. In addition, statistical analysis of discretization effects are studied. Finally, we introduce a novel loss landscape visualization which utilizes gradient update information directly, to better understand learning behavior.

[59] $Δ$-NeRF: Incremental Refinement of Neural Radiance Fields through Residual Control and Knowledge Transfer

Kriti Ghosh,Devjyoti Chakraborty,Lakshmish Ramaswamy,Suchendra M. Bhandarkar,In Kee Kim,Nancy O'Hare,Deepak Mishra

Main category: cs.CV

TL;DR: 本文提出了一种名为$Δ$-NeRF的模块化残差框架,用于在不访问历史数据的情况下对NeRF进行增量式优化,适用于卫星遥感等连续观测场景。

Details Motivation: 现有NeRF方法在新视角加入时需重新训练,难以应对数据流式到达的场景(如卫星地形分析),且易发生灾难性遗忘。 Method: $Δ$-NeRF采用冻结的基础NeRF,并引入残差控制器注入每层修正;结合不确定性感知的门控机制自适应融合预测结果;设计视图选择策略减少训练数据量,并利用知识蒸馏压缩模型。 Result: 在卫星图像上实验表明,$Δ$-NeRF性能媲美联合训练,训练时间减少30-42%;相比朴素微调PSNR最高提升43.5%,部分指标优于联合训练,且模型可压缩至原大小的20%。 Conclusion: $Δ$-NeRF有效解决了NeRF在增量学习中的灾难性遗忘问题,实现了高效、紧凑的持续优化,具备在遥感等实际场景中长期部署的潜力。 Abstract: Neural Radiance Fields (NeRFs) have demonstrated remarkable capabilities in 3D reconstruction and novel view synthesis. However, most existing NeRF frameworks require complete retraining when new views are introduced incrementally, limiting their applicability in domains where data arrives sequentially. This limitation is particularly problematic in satellite-based terrain analysis, where regions are repeatedly observed over time. Incremental refinement of NeRFs remains underexplored, and naive approaches suffer from catastrophic forgetting when past data is unavailable. We propose $Δ$-NeRF, a unique modular residual framework for incremental NeRF refinement. $Δ$-NeRF introduces several novel techniques including: (1) a residual controller that injects per-layer corrections into a frozen base NeRF, enabling refinement without access to past data; (2) an uncertainty-aware gating mechanism that prevents overcorrection by adaptively combining base and refined predictions; and (3) a view selection strategy that reduces training data by up to 47\% while maintaining performance. Additionally, we employ knowledge distillation to compress the enhanced model into a compact student network (20\% of original size). Experiments on satellite imagery demonstrate that $Δ$-NeRF achieves performance comparable to joint training while reducing training time by 30-42\%. $Δ$-NeRF consistently outperforms existing baselines, achieving an improvement of up to 43.5\% in PSNR over naive fine-tuning and surpassing joint training on some metrics.

[60] Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara,Yujia Chen,Ming-Hsuan Yang,James M. Rehg,Wen-Sheng Chu,Du Tran

Main category: cs.CV

TL;DR: 提出Split-then-Merge(StM)框架,通过自分解和重组无标签视频提升生成视频合成的控制能力与数据利用效率。

Details Motivation: 解决生成视频合成中对标注数据或手工规则的依赖以及数据稀缺问题。 Method: 将大量无标签视频拆分为动态前景和背景层,进行自组合训练;引入变换感知训练流程、多层融合增强和身份保持损失。 Result: 在定量基准和人类/VLLM定性评估中均优于当前最先进方法。 Conclusion: StM能有效学习动态主体与场景的交互,实现更真实的视频生成。 Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam,Saksham Aggarwal,Justin Yang Chae,Nidhi Rastogi

Main category: cs.CV

TL;DR: Sphinx是一个用于视觉感知和推理的合成环境,包含25种任务类型,评估显示即使最先进的GPT-5模型准确率也仅为51.1%,远低于人类表现;采用可验证奖励的强化学习(RLVR)能显著提升模型性能。

Details Motivation: 旨在构建一个具有可验证真值解的可控环境,以系统评估和提升视觉语言模型在核心认知任务上的推理能力。 Method: 通过程序化生成包含多种视觉元素(如图案、图表、几何图形等)的谜题,构建包含25类任务的基准测试,并使用可验证奖励的强化学习(RLVR)来优化模型性能。 Result: 当前最先进的大视觉语言模型(如GPT-5)在Sphinx上仅达到51.1%的准确率,显著低于人类水平;而RLVR方法能有效提升模型在此类任务及外部视觉推理基准上的表现。 Conclusion: Sphinx为视觉推理提供了可扩展、可评估的测试平台,揭示了现有LVLMs在认知推理上的不足,并表明RLVR是提升多模态模型推理能力的有效途径。 Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell'Erba,Andrew D. Bagdanov

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、无需数据的优化方法OVI,用于替代文本到图像生成中昂贵的扩散先验网络,并通过新提出的约束机制提升生成图像质量,同时揭示了当前评估基准存在的缺陷。

Details Motivation: 现有的文本到图像扩散模型依赖计算成本高昂且需大量数据训练的先验网络,本文旨在挑战这一必要性,探索更高效、轻量化的替代方案。 Method: 提出基于优化的视觉反演(OVI),通过随机伪标记初始化潜在视觉表示,并迭代优化使其与文本提示嵌入的余弦相似性最大化;引入马氏距离和最近邻损失两种新约束来正则化优化过程。 Result: 在Kandinsky 2.2上的实验表明,OVI可有效替代传统先验;分析发现当前T2I-CompBench++等基准存在缺陷——仅用文本嵌入作先验也能得高分;所提约束方法尤其是最近邻法显著提升视觉保真度,定量指标媲美甚至超过现有最先进轻量级先验。 Conclusion: OVI为文本到图像生成提供了一种无需训练的先验替代方案,具有潜力;同时呼吁对现有评估基准进行反思和改进。 Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

[63] RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerline Graphs

Roman Naeem,David Hagerman,Jennifer Alvén,Fredrik Kahl

Main category: cs.CV

TL;DR: RefTr是一种基于Transformer的3D图像到图模型,用于血管树中心线生成,通过递归优化融合轨迹实现高召回率和精确的树状拓扑结构。

Details Motivation: 准确检测具有正确树状拓扑的管状结构中心线对临床诊断和手术导航至关重要,尤其是高召回率可避免因遗漏小分支导致的致命错误。 Method: 提出RefTr模型,采用Producer-Refiner架构,其中Producer生成初始融合轨迹,Refiner通过Transformer解码器递归优化这些轨迹;引入高效的非极大值抑制算法以合并重复分支。 Result: 在多个公开中心线数据集上,RefTr实现了优于现有方法的召回率和相当的精度,同时推理速度更快、解码器参数减少2.4倍。 Conclusion: RefTr在保持有效树状拓扑的同时显著提升性能,展现出成为3D医学影像中血管树分析新SOTA框架的潜力。 Abstract: Tubular trees, such as blood vessels and lung airways, are essential for material transport within the human body. Accurately detecting their centerlines with correct tree topology is critical for clinical tasks such as diagnosis, treatment planning, and surgical navigation. In these applications, maintaining high recall is crucial, as missing small branches can result in fatal mistakes caused by incomplete assessments or undetected abnormalities. We present RefTr, a 3D image-to-graph model for centerline generation of vascular trees via recurrent refinement of confluent trajectories. RefTr uses a Producer-Refiner architecture based on a Transformer decoder, where the Producer proposes a set of initial confluent trajectories that are recurrently refined by the Refiner to produce final trajectories, which forms the centerline graph. The confluent trajectory representation enables refinement of complete trajectories while explicitly enforcing a valid tree topology. The recurrent refinement scheme improves precision and reuses the same Refiner block across multiple steps, yielding a 2.4x reduction in decoder parameters compared to previous SOTA. We also introduce an efficient non-maximum suppression algorithm for spatial tree graphs to merge duplicate branches and boost precision. Across multiple public centerline datasets, RefTr achieves superior recall and comparable precision to previous SOTA, while offering faster inference and substantially fewer parameters, demonstrating its potential as a new state-of-the-art framework for vascular tree analysis in 3D medical imaging.

[64] MODEST: Multi-Optics Depth-of-Field Stereo Dataset

Nisarg K. Trivedi,Vinayak A. Belludi,Li-Yun Wang,Pardis Taghavi,Dante Lok

Main category: cs.CV

TL;DR: 本文提出了首个高分辨率立体DSLR数据集,包含18000张真实场景图像,系统性地变化焦距和光圈,用于提升深度估计、浅景深渲染等任务在真实光学条件下的泛化能力。

Details Motivation: 现有深度估计研究受限于缺乏大规模、高保真的真实立体DSLR数据集,导致模型在真实光学条件下表现不佳,难以评估合成数据训练模型的现实迁移能力。 Method: 采集了9个复杂场景,在10种焦距和5种光圈下拍摄,共50种光学配置,每场景2000张图像,分辨率为5472×3648px,并提供标定图像集、校准文件和评估代码。 Result: 数据集包含了多尺度光学错觉、反射表面、透明玻璃、精细细节和光照变化等挑战性元素,揭示了当前单目、双目深度估计和景深方法在真实光学条件下的局限性。 Conclusion: 该数据集有效弥合了合成数据与真实相机光学之间的现实差距,为深度估计及相关任务提供了可靠的现实世界评估基准。 Abstract: Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.

[65] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

Sree Bhattacharyya,Yaman Kumar Singla,Sudhir Yarram,Somesh Kumar Singh,Harini S,James Z. Wang

Main category: cs.CV

TL;DR: 本文提出首个大规模无监督视觉内容可记忆性数据集,包含8.2万多个视频及其回忆描述,利用Reddit等平台的“舌尖现象”(ToT)检索查询,捕捉开放回忆中的细粒度可记忆信号。基于该数据集训练的视觉-语言模型在生成可记忆性描述和多模态ToT检索任务上超越现有模型,推动了视觉可记忆性研究的发展。

Details Motivation: 现有视觉可记忆性数据集依赖昂贵的人工标注,规模有限且仅提供聚合评分,缺乏对开放回忆中细粒度记忆信号的刻画,限制了模型对人类记忆机制的理解与应用。 Method: 收集来自Reddit等平台的“舌尖现象”(ToT)检索查询作为无监督信号,构建包含82,000多个视频及对应回忆描述的大规模数据集;采用视觉-语言模型进行微调以生成回忆描述,并使用对比学习策略训练多模态ToT检索模型。 Result: 基于该数据集训练的模型在开放回忆生成任务上优于GPT-4o等SOTA模型,并首次实现了多模态ToT检索;实验证明数据集能有效支持两类记忆相关任务。 Conclusion: 该工作提供了首个大规模无监督视觉记忆信号数据集及配套模型,为视觉内容可记忆性研究提供了新方向,显著提升了回忆生成与ToT检索的能力。 Abstract: Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.

[66] Estimating Fog Parameters from a Sequence of Stereo Images

Yining Ding,João F. C. Mota,Andrew M. Wallace,Sen Wang

Main category: cs.CV

TL;DR: 提出一种基于序列立体雾天图像的雾参数动态估计方法,通过联合优化解决传统方法误差累积问题,并构建首个真实雾天驾驶数据集SDIRF用于验证。

Details Motivation: 现有雾参数估计方法多为逐帧串行估计,易产生误差累积,且难以应对全局非均匀的真实雾况,缺乏高质量标注数据制约了视觉雾天感知研究。 Method: 提出一种联合优化框架,同时估计所有雾参数;假设雾局部均匀以处理全局非均匀雾;利用立体图像序列和光度标定参数构建大气散射模型;开发可集成到SLAM/里程计系统的即插即用模块。 Result: 在合成数据上达到最高精度,在真实SDIRF数据上表现出更强适应性;成功实现动态参数更新,优于先前方法;发布了包含34k帧、光度标定和对应晴天数据的SDIRF数据集。 Conclusion: 所提方法能更准确、鲁棒地估计真实雾中的参数,显著提升雾天视觉系统性能,SDIRF数据集为后续雾天视觉研究提供了重要资源。 Abstract: We propose a method which, given a sequence of stereo foggy images, estimates the parameters of a fog model and updates them dynamically. In contrast with previous approaches, which estimate the parameters sequentially and thus are prone to error propagation, our algorithm estimates all the parameters simultaneously by solving a novel optimisation problem. By assuming that fog is only locally homogeneous, our method effectively handles real-world fog, which is often globally inhomogeneous. The proposed algorithm can be easily used as an add-on module in existing visual Simultaneous Localisation and Mapping (SLAM) or odometry systems in the presence of fog. In order to assess our method, we also created a new dataset, the Stereo Driving In Real Fog (SDIRF), consisting of high-quality, consecutive stereo frames of real, foggy road scenes under a variety of visibility conditions, totalling over 40 minutes and 34k frames. As a first-of-its-kind, SDIRF contains the camera's photometric parameters calibrated in a lab environment, which is a prerequisite for correctly applying the atmospheric scattering model to foggy images. The dataset also includes the counterpart clear data of the same routes recorded in overcast weather, which is useful for companion work in image defogging and depth reconstruction. We conducted extensive experiments using both synthetic foggy data and real foggy sequences from SDIRF to demonstrate the superiority of the proposed algorithm over prior methods. Our method not only produces the most accurate estimates on synthetic data, but also adapts better to real fog. We make our code and SDIRF publicly available\footnote{https://github.com/SenseRoboticsLab/estimating-fog-parameters} to the community with the aim of advancing the research on visual perception in fog.

[67] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Jiancheng Pan,Runze Wang,Tianwen Qian,Mohammad Mahdi,Yanwei Fu,Xiangyang Xue,Xiaomeng Huang,Luc Van Gool,Danda Pani Paudel,Yuqian Fu

Main category: cs.CV

TL;DR: 本文提出V^2-SAM,一个统一的跨视角目标关联框架,通过两个互补的提示生成器将SAM2从单视角分割扩展到跨视角关联,在多种任务上实现了最先进的性能。

Details Motivation: 由于显著的视角和外观变化,现有分割模型(如SAM2)难以直接应用于跨视角目标关联任务,因此需要一种能够适应这种复杂变化的新框架。 Method: 提出了V^2-SAM框架,包含两个提示生成器:基于DINOv3特征的跨视角锚点提示生成器(V^2-Anchor)实现几何感知的坐标提示,以及跨视角视觉提示生成器(V^2-Visual)通过新设计的视觉匹配器从特征和结构层面对齐自我与外部视角表示;并采用多专家设计和后处理循环一致性选择器(PCCS)自适应选择最可靠的预测结果。 Result: 在Ego-Exo4D、DAVIS-2017和HANDAL-X等多个数据集上进行了广泛实验,验证了V^2-SAM的有效性,并在这些基准上达到了最先进的性能。 Conclusion: V^2-SAM成功地将SAM2扩展至跨视角场景,首次实现了基于坐标的提示机制用于跨视角关联,结合视觉提示与循环一致性筛选策略,显著提升了不同视角下对象对应关系的准确性与鲁棒性。 Abstract: Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).

[68] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation

Taehoon Kim,Henry Gouk,Timothy Hospedales

Main category: cs.CV

TL;DR: 提出Null-Text Test-Time Alignment (Null-TTA),通过优化无条件文本嵌入实现扩散模型在测试时的对齐,避免奖励劫持并保持语义一致性。

Details Motivation: 现有测试时对齐方法容易欠优化或过优化(奖励劫持),缺乏在语义流形上进行对齐的有效机制。 Method: 在分类器自由引导中优化无条件文本嵌入,而非潜变量或噪声,利用文本嵌入空间的结构化语义特性实现语义一致的对齐。 Result: Null-TTA在目标测试时对齐上达到SOTA,同时保持强跨奖励泛化能力,有效防止奖励劫持。 Conclusion: 语义空间优化是一种有效且原则性的新范式,适用于测试时对齐。 Abstract: Test-time alignment (TTA) aims to adapt models to specific rewards during inference. However, existing methods tend to either under-optimise or over-optimise (reward hack) the target reward function. We propose Null-Text Test-Time Alignment (Null-TTA), which aligns diffusion models by optimising the unconditional embedding in classifier-free guidance, rather than manipulating latent or noise variables. Due to the structured semantic nature of the text embedding space, this ensures alignment occurs on a semantically coherent manifold and prevents reward hacking (exploiting non-semantic noise patterns to improve the reward). Since the unconditional embedding in classifier-free guidance serves as the anchor for the model's generative distribution, Null-TTA directly steers model's generative distribution towards the target reward rather than just adjusting the samples, even without updating model parameters. Thanks to these desirable properties, we show that Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalisation. This establishes semantic-space optimisation as an effective and principled novel paradigm for TTA.

[69] GaINeR: Geometry-Aware Implicit Network Representation

Weronika Jakubowska,Mikołaj Zieliński,Rafał Tobiasz,Krzysztof Byrski,Maciej Zięba,Dominik Belter,Przemysław Spurek

Main category: cs.CV

TL;DR: 提出GaINeR,一种结合可学习高斯分布与神经隐式表示的几何感知图像表示框架,支持连续表示、可解释结构和局部编辑。

Details Motivation: 传统隐式神经表示(INRs)缺乏显式几何结构,难以进行局部编辑和物理交互,限制了其在动态或交互场景中的应用。 Method: 提出GaINeR,将可训练的高斯分布与基于神经网络的INR结合;对给定图像坐标检索K个最近高斯,聚合距离加权嵌入,并通过神经网络预测RGB值。 Result: 实现了连续图像表示、显式的几何结构建模,并支持灵活的局部编辑和物理感知操作。 Conclusion: GaINeR在保持高保真重建能力的同时,增强了结构可解释性和交互性,为动态和可编辑图像建模提供了新方向。 Abstract: Implicit Neural Representations (INRs) have become an essential tool for modeling continuous 2D images, enabling high-fidelity reconstruction, super-resolution, and compression. Popular architectures such as SIREN, WIRE, and FINER demonstrate the potential of INR for capturing fine-grained image details. However, traditional INRs often lack explicit geometric structure and have limited capabilities for local editing or integration with physical simulation, restricting their applicability in dynamic or interactive settings. To address these limitations, we propose GaINeR: Geometry-Aware Implicit Network Representation, a novel framework for 2D images that combines trainable Gaussian distributions with a neural network-based INR. For a given image coordinate, the model retrieves the K nearest Gaussians, aggregates distance-weighted embeddings, and predicts the RGB value via a neural network. This design enables continuous image representation, interpretable geometric structure, and flexible local editing, providing a foundation for physically aware and interactive image manipulation. The official implementation of our method is publicly available at https://github.com/WJakubowska/GaINeR.

[70] A deep learning model to reduce agent dose for contrast-enhanced MRI of the cerebellopontine angle cistern

Yunjie Chen,Rianne A. Weber,Olaf M. Neve,Stephan R. Romeijn,Erik F. Hensen,Jelmer M. Wolterink,Qian Tao,Marius Staring,Berit M. Verbist

Main category: cs.CV

TL;DR: 该研究开发了一种深度学习模型,用于从极低剂量(10%-30%)的对比剂T1加权MRI中恢复标准剂量图像质量,显著提升图像质量和肿瘤分割性能,支持在听小脑角池病变中实现大幅减量扫描。

Details Motivation: 减少MRI检查中对比剂的使用剂量以降低潜在副作用,同时维持足够的图像质量和诊断准确性。 Method: 基于多中心回顾性数据,利用深度学习模型从模拟的低剂量T1ce MRI重建标准剂量图像,并评估其图像质量与分割性能;由头颈放射科医生进行主观评分。 Result: 在10%输入剂量下,DL恢复图像的结构相似性指数达0.993,PSNR达41.4 dB;肿瘤分割的Dice系数提升至0.734,表面距离显著改善;10%和30%剂量下的恢复图像均显示优良质量,其中30%更受青睐。 Conclusion: 深度学习模型可有效提升低剂量T1ce MRI的图像质量,使仅用10%-30%标准剂量即可实现可靠的病灶检测与诊断评估,具有临床应用潜力。 Abstract: Objectives: To evaluate a deep learning (DL) model for reducing the agent dose of contrast-enhanced T1-weighted MRI (T1ce) of the cerebellopontine angle (CPA) cistern. Materials and methods: In this multi-center retrospective study, T1 and T1ce of vestibular schwannoma (VS) patients were used to simulate low-dose T1ce with varying reductions of contrast agent dose. DL models were trained to restore standard-dose T1ce from the low-dose simulation. The image quality and segmentation performance of the DL-restored T1ce were evaluated. A head and neck radiologist was asked to rate DL-restored images in multiple aspects, including image quality and diagnostic characterization. Results: 203 MRI studies from 72 VS patients (mean age, 58.51 \pm 14.73, 39 men) were evaluated. As the input dose increased, the structural similarity index measure of the restored T1ce increased from 0.639 \pm 0.113 to 0.993 \pm 0.009, and the peak signal-to-noise ratio increased from 21.6 \pm 3.73 dB to 41.4 \pm 4.84 dB. At 10% input dose, using DL-restored T1ce for segmentation improved the Dice from 0.673 to 0.734, the 95% Hausdorff distance from 2.38 mm to 2.07 mm, and the average surface distance from 1.00 mm to 0.59 mm. Both DL-restored T1ce from 10% and 30% input doses showed excellent images, with the latter being considered more informative. Conclusion: The DL model improved the image quality of low-dose MRI of the CPA cistern, which makes lesion detection and diagnostic characterization possible with 10% - 30% of the standard dose.

[71] Smooth regularization for efficient video recognition

Gil Goldman,Raja Giryes,Mahadev Satyanarayanan

Main category: cs.CV

TL;DR: 提出一种基于Gaussian Random Walk的平滑正则化方法,通过增强视频帧间表征的连续性来提升轻量级视频识别模型的性能,在多个模型上显著超越现有方法。

Details Motivation: 轻量级视频识别模型难以有效捕捉复杂的时间动态,需要引入更强的时间归纳偏置以提升性能。 Method: 提出一种平滑正则化技术,将连续帧中间层嵌入的变化建模为高斯随机游走(GRW),惩罚剧烈的表征变化,促进低加速度、更符合视频自然时序连贯性的解。 Result: 在Kinetics-600上,轻量级模型准确率提升3.8%至6.4%;MoViNets系列在相同FLOPs约束下超越现有SOTA 3.8%~6.1%,MobileNetV3和MoViNets-Stream也取得显著增益。 Conclusion: 该正则化方法通过引入时间平滑先验,有效提升了轻量级模型对视频时序动态的建模能力,显著改善了性能,且适用于多种架构。 Abstract: We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/gilgoldm/grw-smoothing.

[72] Open Vocabulary Compositional Explanations for Neuron Alignment

Biagio La Rosa,Leilani H. Gilpin

Main category: cs.CV

TL;DR: 提出了一种用于视觉领域的开放词汇组合解释框架,通过开放词汇语义分割生成的掩码来分析神经元对任意概念的响应,提升了可解释性方法的灵活性和适用范围。

Details Motivation: 现有的组合解释方法依赖人工标注数据,限制了其在特定领域和预定义概念外的应用,本文旨在突破这一限制。 Method: 框架包含三个步骤:定义任意概念、使用开放词汇语义分割模型生成掩码、基于掩码推导出组合解释。 Result: 相比传统方法,该框架在定量指标和人类可理解性方面表现相当或更优,且支持跨任务和属性的灵活解释。 Conclusion: 该框架成功实现了无需人工标注的开放词汇神经元解释,扩展了组合解释在视觉模型可解释性中的应用边界。 Abstract: Neurons are the fundamental building blocks of deep neural networks, and their interconnections allow AI to achieve unprecedented results. Motivated by the goal of understanding how neurons encode information, compositional explanations leverage logical relationships between concepts to express the spatial alignment between neuron activations and human knowledge. However, these explanations rely on human-annotated datasets, restricting their applicability to specific domains and predefined concepts. This paper addresses this limitation by introducing a framework for the vision domain that allows users to probe neurons for arbitrary concepts and datasets. Specifically, the framework leverages masks generated by open vocabulary semantic segmentation to compute open vocabulary compositional explanations. The proposed framework consists of three steps: specifying arbitrary concepts, generating semantic segmentation masks using open vocabulary models, and deriving compositional explanations from these masks. The paper compares the proposed framework with previous methods for computing compositional explanations both in terms of quantitative metrics and human interpretability, analyzes the differences in explanations when shifting from human-annotated data to model-annotated data, and showcases the additional capabilities provided by the framework in terms of flexibility of the explanations with respect to the tasks and properties of interest.

[73] UruDendro4: A Benchmark Dataset for Automatic Tree-Ring Detection in Cross-Section Images of Pinus taeda L

Henry Marichal,Joaquin Blanco,Diego Passarella,Gregory Randall

Main category: cs.CV

TL;DR: 本文提出了UruDendro4数据集,包含102个火炬松(Pinus taeda L.)的木材横截面图像,并提供了人工标注的年轮信息,支持基于多高度采样进行年轮体积建模。同时提供了年轮自动检测的基准性能评估,DeepCS-TRD方法表现最优,并验证了该数据集能提升模型在年轮检测任务中的泛化能力。

Details Motivation: 现有木材年轮数据集样本有限且缺乏多高度采样,难以支持树木年轮的三维或体积建模;手动测量耗时且精度低,需要更高质量的数据集和自动化方法来提高年轮检测的准确性和应用潜力。 Method: 构建UruDendro4数据集,采集同一树干不同高度的102张Pinus taeda L.横截面图像并人工标注年轮边界;采用当前最先进的深度学习方法(如DeepCS-TRD)进行年轮自动检测,并通过ablation实验优化参数配置;评估模型在加入本数据集后的泛化性能提升。 Result: UruDendro4是首个支持基于多高度横截面图像进行年轮体积建模的公开数据集;DeepCS-TRD在该数据集上达到0.838的mAP、0.782的mAR和0.084的Adapted Rand Error;实验证明使用该数据集训练可显著提升模型在跨数据集任务中的泛化能力。 Conclusion: UruDendro4数据集有效填补了木材年轮研究中多高度采样数据的空白,支持更精确的年轮自动检测与体积建模,且有助于提升相关模型的泛化性能,推动林学研究与可持续森林管理的发展。 Abstract: Tree-ring growth represents the annual wood increment for a tree, and quantifying it allows researchers to assess which silvicultural practices are best suited for each species. Manual measurement of this growth is time-consuming and often imprecise, as it is typically performed along 4 to 8 radial directions on a cross-sectional disc. In recent years, automated algorithms and datasets have emerged to enhance accuracy and automate the delineation of annual rings in cross-sectional images. To address the scarcity of wood cross-section data, we introduce the UruDendro4 dataset, a collection of 102 image samples of Pinus taeda L., each manually annotated with annual growth rings. Unlike existing public datasets, UruDendro4 includes samples extracted at multiple heights along the stem, allowing for the volumetric modeling of annual growth using manually delineated rings. This dataset (images and annotations) allows the development of volumetric models for annual wood estimation based on cross-sectional imagery. Additionally, we provide a performance baseline for automatic ring detection on this dataset using state-of-the-art methods. The highest performance was achieved by the DeepCS-TRD method, with a mean Average Precision of 0.838, a mean Average Recall of 0.782, and an Adapted Rand Error score of 0.084. A series of ablation experiments were conducted to empirically validate the final parameter configuration. Furthermore, we empirically demonstrate that training a learning model including this dataset improves the model's generalization in the tree-ring detection task.

[74] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

Rawa Mohammed,Mina Attin,Bryar Shareef

Main category: cs.CV

TL;DR: 提出BUSTR,一种无需配对图像-报告监督的多任务视觉-语言框架,用于乳腺超声自动报告生成,通过结合标记级交叉熵和表示对齐损失提升临床有效性和自然语言生成性能。

Details Motivation: 现有的乳腺超声自动报告生成受限于缺乏配对的图像-报告数据集,且大语言模型存在幻觉风险,因此需要一种无需配对监督且能提高临床准确性的方法。 Method: BUSTR利用结构化描述符(如BI-RADS、病理、组织学)和放射组学特征构建报告,采用多头Swin编码器学习描述符感知的视觉表示,并通过结合标记级交叉熵和输入输出表示间的余弦相似性对齐损失的双层目标实现视觉与文本标记对齐。 Result: 在BrEaST和BUS-BRA两个公开数据集上,BUSTR在标准自然语言生成指标和临床效能指标上均取得一致提升,尤其在BI-RADS类别和病理判断等关键目标上表现更优。 Conclusion: 该研究表明,基于描述符感知的视觉模型结合标记级与表示对齐损失,可在无需配对图像-报告数据的情况下,有效提升自动报告的质量和临床实用性。 Abstract: Automated radiology report generation (RRG) for breast ultrasound (BUS) is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models. We propose BUSTR, a multitask vision-language framework that generates BUS reports without requiring paired image-report supervision. BUSTR constructs reports from structured descriptors (e.g., BI-RADS, pathology, histology) and radiomics features, learns descriptor-aware visual representations with a multi-head Swin encoder trained using a multitask loss over dataset-specific descriptor sets, and aligns visual and textual tokens via a dual-level objective that combines token-level cross-entropy with a cosine-similarity alignment loss between input and output representations. We evaluate BUSTR on two public BUS datasets, BrEaST and BUS-BRA, which differ in size and available descriptors. Across both datasets, BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics, particularly for key targets such as BI-RADS category and pathology. Our results show that this descriptor-aware vision model, trained with a combined token-level and alignment loss, improves both automatic report metrics and clinical efficacy without requiring paired image-report data. The source code can be found at https://github.com/AAR-UNLV/BUSTR

[75] Beyond Realism: Learning the Art of Expressive Composition with StickerNet

Haoming Lu,David Kocharian,Humphrey Shi

Main category: cs.CV

TL;DR: 本文提出了表达性图像合成任务(expressive composition),强调风格多样性和松散的布局逻辑,以反映用户在真实创作平台上的编辑行为,并提出了StickerNet框架,通过两阶段方法预测贴纸的透明度、位置、掩码和缩放,利用来自真实用户编辑行为的数据集进行训练,在用户研究和定量评估中表现出优于基线模型的效果。

Details Motivation: 传统图像合成注重视觉真实性和语义合理性,但现实中许多用户更关注艺术性、趣味性和社交互动,因此需要一种新的图像合成范式来反映这种表达性意图。 Method: 提出StickerNet,一个两阶段框架:第一阶段确定合成类型,第二阶段根据类型预测贴纸的放置参数(如位置、尺度、透明度和掩码);使用从匿名在线平台收集的180万条真实用户编辑操作构建数据集。 Result: 实验表明StickerNet在用户研究和定量指标上优于常见基线方法,能更好地模拟人类的贴纸放置行为,验证了基于真实编辑行为学习的有效性。 Conclusion: 该工作开创了以表达性和用户意图为导向的图像合成新方向,强调从真实用户行为中学习,而非追求视觉 realism,为视觉理解提供了新的视角。 Abstract: As a widely used operation in image editing workflows, image composition has traditionally been studied with a focus on achieving visual realism and semantic plausibility. However, in practical editing scenarios of the modern content creation landscape, many compositions are not intended to preserve realism. Instead, users of online platforms motivated by gaining community recognition often aim to create content that is more artistic, playful, or socially engaging. Taking inspiration from this observation, we define the expressive composition task, a new formulation of image composition that embraces stylistic diversity and looser placement logic, reflecting how users edit images on real-world creative platforms. To address this underexplored problem, we present StickerNet, a two-stage framework that first determines the composition type, then predicts placement parameters such as opacity, mask, location, and scale accordingly. Unlike prior work that constructs datasets by simulating object placements on real images, we directly build our dataset from 1.8 million editing actions collected on an anonymous online visual creation and editing platform, each reflecting user-community validated placement decisions. This grounding in authentic editing behavior ensures strong alignment between task definition and training supervision. User studies and quantitative evaluations show that StickerNet outperforms common baselines and closely matches human placement behavior, demonstrating the effectiveness of learning from real-world editing patterns despite the inherent ambiguity of the task. This work introduces a new direction in visual understanding that emphasizes expressiveness and user intent over realism.

[76] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

Md Adnan Arefeen,Biplob Debnath,Srimat Chakradhar

Main category: cs.CV

TL;DR: 本文提出了一种名为TrafficLens的算法,用于高效处理多摄像头交通路口的视频数据,通过序列化应用视觉-语言模型并利用对象级相似性检测减少冗余计算,显著缩短了视频到文本的转换时间。

Details Motivation: 由于交通摄像头产生的大量视频数据难以高效分析,现有方法在实时性方面存在不足,尤其是在结合大语言模型进行文本生成时需要耗时的视频转文本过程。 Method: TrafficLens采用一种序列化的方法,利用摄像头覆盖区域的重叠特性,迭代使用不同token限制的视觉-语言模型,并将前一个摄像头的输出作为后续摄像头的提示;同时引入对象级相似性检测来跳过冗余的VLM调用。 Result: 实验结果表明,TrafficLens在真实世界数据集上可将视频到文本的转换时间最多减少4倍,同时保持信息的准确性。 Conclusion: TrafficLens为多摄像头交通场景下的视频分析提供了一种高效、准确的解决方案,显著提升了视频数据的利用效率和响应速度。 Abstract: Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.

[77] Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI

Al Amin,Kamrul Hasan,Liang Hong,Sharif Ullah

Main category: cs.CV

TL;DR: 本文提出了一种结合Vision Transformer(ViT)与同态加密(HE)的隐私保护联邦学习框架,用于跨医疗机构的组织病理学图像分类。该方法通过加密ViT的CLS token实现安全聚合,显著降低通信开销并抵御重构攻击。

Details Motivation: 由于HIPAA等隐私法规限制医疗数据共享,传统联邦学习中模型梯度仍易受重建攻击威胁,因此需要更安全且高效的隐私保护机制。 Method: 采用Vision Transformer提取CLS token作为紧凑特征表示,并使用CKKS同态加密算法对CLS token进行加密后传输,在服务器端实现密文上的安全聚合与推理。 Result: 相比梯度加密,通信量减少30倍;在三客户端设置下,原始梯度易受逆向攻击(PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741),而本方法可有效防御此类攻击,每轮聚合仅需326 KB加密数据传输;全局分类准确率达96.12%(明文)和90.02%(密文)。 Conclusion: CLS token加密结合同态加密在保障强隐私的同时实现了高效通信与可用性,适用于多机构协作的医学图像分析场景。 Abstract: Collaborative machine learning across healthcare institutions promises improved diagnostic accuracy by leveraging diverse datasets, yet privacy regulations such as HIPAA prohibit direct patient data sharing. While federated learning (FL) enables decentralized training without raw data exchange, recent studies show that model gradients in conventional FL remain vulnerable to reconstruction attacks, potentially exposing sensitive medical information. This paper presents a privacy-preserving federated learning framework combining Vision Transformers (ViT) with homomorphic encryption (HE) for secure multi-institutional histopathology classification. The approach leverages the ViT CLS token as a compact 768-dimensional feature representation for secure aggregation, encrypting these tokens using CKKS homomorphic encryption before transmission to the server. We demonstrate that encrypting CLS tokens achieves a 30-fold communication reduction compared to gradient encryption while maintaining strong privacy guarantees. Through evaluation on a three-client federated setup for lung cancer histopathology classification, we show that gradients are highly susceptible to model inversion attacks (PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741), enabling near-perfect image reconstruction. In contrast, the proposed CLS-protected HE approach prevents such attacks while enabling encrypted inference directly on ciphertexts, requiring only 326 KB of encrypted data transmission per aggregation round. The framework achieves 96.12 percent global classification accuracy in the unencrypted domain and 90.02 percent in the encrypted domain.

[78] Inversion-Free Style Transfer with Dual Rectified Flows

Yingying Deng,Xiangyu He,Fan Tang,Weiming Dong,Xucheng Yin

Main category: cs.CV

TL;DR: 提出一种无需反演的基于双校正流的风格迁移框架,仅通过前向传播实现高效、高质量的图像风格化。

Details Motivation: 现有基于扩散模型的无训练风格迁移方法依赖计算成本高的反演过程,且反演不准确时会导致视觉失真,限制了效率与效果。 Method: 提出双校正流框架,平行预测内容与风格的轨迹,并通过动态中点插值融合两者速度场;结合注意力注入机制引导风格融合,全程无需反演,仅需前向推理。 Result: 在多种风格和内容组合上实现了优异的风格迁移效果,兼顾高视觉保真度、强内容保持性和高计算效率。 Conclusion: 该方法为风格迁移提供了一种高效、稳健的无反演解决方案,优于现有的扩散模型基线方法。 Abstract: Style transfer, a pivotal task in image processing, synthesizes visually compelling images by seamlessly blending realistic content with artistic styles, enabling applications in photo editing and creative design. While mainstream training-free diffusion-based methods have greatly advanced style transfer in recent years, their reliance on computationally inversion processes compromises efficiency and introduces visual distortions when inversion is inaccurate. To address these limitations, we propose a novel \textit{inversion-free} style transfer framework based on dual rectified flows, which tackles the challenge of finding an unknown stylized distribution from two distinct inputs (content and style images), \textit{only with forward pass}. Our approach predicts content and style trajectories in parallel, then fuses them through a dynamic midpoint interpolation that integrates velocities from both paths while adapting to the evolving stylized image. By jointly modeling the content, style, and stylized distributions, our velocity field design achieves robust fusion and avoids the shortcomings of naive overlays. Attention injection further guides style integration, enhancing visual fidelity, content preservation, and computational efficiency. Extensive experiments demonstrate generalization across diverse styles and content, providing an effective and efficient pipeline for style transfer.

[79] RefOnce: Distilling References into a Prototype Memory for Referring Camouflaged Object Detection

Yu-Huan Wu,Zi-Xuan Zhu,Yan Wang,Liangli Zhen,Deng-Ping Fan

Main category: cs.CV

TL;DR: 提出一种新的Ref-COD框架,通过在训练时将参考信息蒸馏到类别原型记忆中,在推理时无需参考图像即可生成引导向量,实现高效、简洁的指定伪装目标检测。

Details Motivation: 现有Ref-COD方法依赖测试时的参考图像,导致部署困难、延迟高和数据收集负担重,本文旨在消除对测试时参考图像的需求。 Method: 采用类原型记忆机制,利用指数移动平均(EMA)为每类维护一个原型;通过查询条件的原型混合预测权重生成指导向量,并引入双向注意力对齐模块缩小参考统计与伪装查询特征之间的表示差距。 Result: 在R2C7K大规模基准上进行了评估,实验表明所提方法在无需测试时参考图像的情况下,性能达到或优于当前最先进的方法。 Conclusion: 本文提出的框架实现了无需测试时参考图像的Ref-COD,具备良好的实用性与高效性,为实际部署提供了简单有效的解决方案。 Abstract: Referring Camouflaged Object Detection (Ref-COD) segments specified camouflaged objects in a scene by leveraging a small set of referring images. Though effective, current systems adopt a dual-branch design that requires reference images at test time, which limits deployability and adds latency and data-collection burden. We introduce a Ref-COD framework that distills references into a class-prototype memory during training and synthesizes a reference vector at inference via a query-conditioned mixture of prototypes. Concretely, we maintain an EMA-updated prototype per category and predict mixture weights from the query to produce a guidance vector without any test-time references. To bridge the representation gap between reference statistics and camouflaged query features, we propose a bidirectional attention alignment module that adapts both the query features and the class representation. Thus, our approach yields a simple, efficient path to Ref-COD without mandatory references. We evaluate the proposed method on the large-scale R2C7K benchmark. Extensive experiments demonstrate competitive or superior performance of the proposed method compared with recent state-of-the-arts. Code is available at https://github.com/yuhuan-wu/RefOnce.

[80] Wavefront-Constrained Passive Obscured Object Detection

Zhiwen Zheng,Yiwei Ouyang,Zhao Huang,Tao Zhang,Xiaoshuai Zhang,Huiyu Zhou,Wenwen Tang,Shaowei Jiang,Jin Liu,Xingru Huang

Main category: cs.CV

TL;DR: 提出了一种新的物理驱动的WavePCNet网络,用于通过模拟波前传播来增强遮挡物体的感知能力,在四个真实数据集上表现出优于现有方法的准确性和鲁棒性。

Details Motivation: 现有方法在非视域成像中难以准确捕捉相干光传播的物理特性,且在低信噪比下容易收敛到非物理解,影响成像稳定性与可靠性。 Method: 提出了WavePCNet,包含Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP)模块以建模复振幅传播并约束相干行为,引入动量记忆机制抑制扰动累积,并设计高频跨层补偿增强模块以多尺度感受野和频率选择路径提升结构一致性和鲁棒性。 Result: 在四个物理采集的数据集上实验表明,WavePCNet在定位与分割遮挡物体方面显著优于现有最先进方法,具有更高的精度和抗干扰能力。 Conclusion: WavePCNet通过深度融合物理先验与深度学习,有效解决了非视域成像中的多重散射和噪声干扰问题,提升了复杂环境下对遮蔽物体的感知性能。 Abstract: Accurately localizing and segmenting obscured objects from faint light patterns beyond the field of view is highly challenging due to multiple scattering and medium-induced perturbations. Most existing methods, based on real-valued modeling or local convolutional operations, are inadequate for capturing the underlying physics of coherent light propagation. Moreover, under low signal-to-noise conditions, these methods often converge to non-physical solutions, severely compromising the stability and reliability of the observation. To address these challenges, we propose a novel physics-driven Wavefront Propagating Compensation Network (WavePCNet) to simulate wavefront propagation and enhance the perception of obscured objects. This WavePCNet integrates the Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) to incorporate complex amplitude transfer operators to precisely constrain coherent propagation behavior, along with a momentum memory mechanism to effectively suppress the accumulation of perturbations. Additionally, a High-frequency Cross-layer Compensation Enhancement is introduced to construct frequency-selective pathways with multi-scale receptive fields and dynamically model structural consistency across layers, further boosting the model's robustness and interpretability under complex environmental conditions. Extensive experiments conducted on four physically collected datasets demonstrate that WavePCNet consistently outperforms state-of-the-art methods across both accuracy and robustness.

[81] GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision

Yuxiao Xiang,Junchi Chen,Zhenchao Jin,Changtao Miao,Haojie Yuan,Qi Chu,Tao Gong,Nenghai Yu

Main category: cs.CV

TL;DR: 本文提出了GuardTrace-VL,一种能够监控多模态大推理模型(MLRM)中视觉-语言任务推理全过程的安全审计工具,通过联合分析图像和文本内容,在中间推理阶段检测潜在的不安全内容。

Details Motivation: 现有的多模态安全防护方法主要关注输入问题和最终答案,忽略了中间推理过程可能包含的有害内容(如偏见推断或违反政策的视觉上下文使用),存在部署风险。因此需要一种能覆盖完整推理链的安全检测机制。 Method: 提出GuardTrace-VL,采用联合图像-文本分析来监控问题-思考-回答(QTA)全流程;构建GuardTrace数据集,结合多样化提示策略与基于大模型及人工的投票验证流程生成高质量标注数据;设计三阶段渐进式训练方案,结合数据 refinement,使模型能根据不同风险等级学习细粒度的安全偏好。 Result: 在涵盖领域内和领域外场景的测试集上,GuardTrace-VL在检测不安全推理内容任务中达到93.1%的F1分数,相比此前最强的多模态安全防御方法提升了13.5%。 Conclusion: GuardTrace-VL有效解决了现有安全防护忽略中间推理风险的问题,显著提升了多模态大推理模型在复杂场景下的安全性,具备良好的泛化能力和实际应用前景。 Abstract: Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.

[82] From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

Jingxi Chen,Yixiao Zhang,Xiaoye Qian,Zongxia Li,Cornelia Fermuller,Caren Chen,Yiannis Aloimonos

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的轻量级微调方法,用于单幅图像的图层分解,通过合成数据训练,在对象移除和遮挡恢复方面表现优越。

Details Motivation: 由于现有方法和数据有限,从单幅图像中分解出图层仍然具有挑战性,而分层表示对于图像编辑至关重要。 Method: 利用扩散-based修复模型,并引入一种具有线性注意力复杂度的多模态上下文融合模块,进行轻量级微调以实现图层分解。 Result: 模型在合成数据上训练后,在物体移除和遮挡恢复任务上表现出色,优于现有方法。 Conclusion: 该方法有效实现了单图像到图层的分解,提升了细节保留能力,为图像编辑和创意应用提供了新可能。 Abstract: Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.

[83] Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

Xiaoxing You,Qiang Huang,Lingyu Li,Chi Zhang,Xiaopeng Liu,Min Zhang,Jun Yu

Main category: cs.CV

TL;DR: 提出MERGE框架,首个用于新闻图像描述的多模态实体感知检索增强生成模型,通过构建实体中心的多模态知识库提升信息覆盖、跨模态对齐和视觉实体定位,在多个数据集上显著优于现有方法。

Details Motivation: 现有新闻图像描述方法在信息覆盖不全、跨模态对齐弱和视觉实体定位不佳三方面存在挑战。 Method: 提出MERGE框架,构建实体中心多模态知识库(EMKB),结合文本、视觉与结构化知识,采用多阶段假设-描述策略改善跨模态对齐,并通过图像引导的动态检索增强视觉实体匹配。 Result: 在GoodNews和NYTimes800k上CIDEr分别提升6.84和1.16,F1-score在NER上提高4.14和2.64;在未见的Visual News数据集上CIDEr提升20.17,F1-score提高6.22。 Conclusion: MERGE有效提升了新闻图像描述的信息完整性、跨模态对齐和实体定位能力,具备强鲁棒性和领域适应性。 Abstract: News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.

[84] MetaRank: Task-Aware Metric Selection for Model Transferability Estimation

Yuhang Liu,Wenjie Zhao,Yunhui Guo

Main category: cs.CV

TL;DR: 本文提出了MetaRank,一种基于元学习的自动任务感知模型迁移性评估(MTE)指标选择框架,通过将数据集和MTE指标的文本描述嵌入共享语义空间,并使用列表级目标训练元预测器,实现对新目标数据集的高效MTE指标排序。

Details Motivation: 现有MTE指标选择方法通常依赖于历史平均性能,缺乏任务适应性,而不同MTE指标在不同目标任务上的表现差异显著,因此需要一种任务感知的自动MTE指标选择机制。 Method: 将MTE指标选择建模为学习排序问题,利用预训练语言模型编码数据集和MTE指标的文本描述,在共享语义空间中进行表示;通过多样化的元任务离线训练一个元预测器,采用列表级损失函数优化其对高性能MTE指标的排序能力。 Result: 在11个预训练模型和11个目标数据集上的实验表明,MetaRank能有效识别并优先推荐最适合特定任务的MTE指标,显著优于基于固定或启发式选择的方法。 Conclusion: MetaRank实现了任务感知的自动化MTE指标选择,提升了迁移学习中模型选择的效率与准确性,验证了语义驱动元学习在该场景下的有效性。 Abstract: Selecting an appropriate pre-trained source model is a critical, yet computationally expensive, task in transfer learning. Model Transferability Estimation (MTE) methods address this by providing efficient proxy metrics to rank models without full fine-tuning. In practice, the choice of which MTE metric to use is often ad hoc or guided simply by a metric's average historical performance. However, we observe that the effectiveness of MTE metrics is highly task-dependent and no single metric is universally optimal across all target datasets. To address this gap, we introduce MetaRank, a meta-learning framework for automatic, task-aware MTE metric selection. We formulate metric selection as a learning-to-rank problem. Rather than relying on conventional meta-features, MetaRank encodes textual descriptions of both datasets and MTE metrics using a pretrained language model, embedding them into a shared semantic space. A meta-predictor is then trained offline on diverse meta-tasks to learn the intricate relationship between dataset characteristics and metric mechanisms, optimized with a listwise objective that prioritizes correctly ranking the top-performing metrics. During the subsequent online phase, MetaRank efficiently ranks the candidate MTE metrics for a new, unseen target dataset based on its textual description, enabling practitioners to select the most appropriate metric a priori. Extensive experiments across 11 pretrained models and 11 target datasets demonstrate the strong effectiveness of our approach.

[85] Structure-Aware Prototype Guided Trusted Multi-View Classification

Haojian Huang,Jiahao Shi,Zhe Liu,Harold Haodong Chen,Han Fang,Hao Sun,Zhongjiang He

Main category: cs.CV

TL;DR: 提出一种新的多视图分类框架,通过引入原型表示各视图的邻域结构,简化了视图内关系的学习,并实现了视图间结构的动态对齐,提升了分类的可靠性和效率。

Details Motivation: 现有可信多视图分类方法依赖全局密集邻居关系建模,计算成本高且难以保证跨视图一致性,同时使用手动赋权聚合证据,缺乏类空间内邻域结构一致性的保障,影响分类可信度。 Method: 引入原型来表示每个视图的邻域结构,简化视图内邻居关系学习,并通过动态对齐机制实现视图内与视图间结构的一致性优化,提升跨视图共识发现的效率与可靠性。 Result: 在多个公开多视图数据集上的实验表明,该方法在下游任务性能和鲁棒性方面优于或媲美现有的主流可信多视图分类方法。 Conclusion: 所提框架有效解决了传统方法在计算效率、跨视图一致性及信任保障方面的不足,为可信多视图分类提供了更高效且可靠的解决方案。 Abstract: Trustworthy multi-view classification (TMVC) addresses the challenge of achieving reliable decision-making in complex scenarios where multi-source information is heterogeneous, inconsistent, or even conflicting. Existing TMVC approaches predominantly rely on globally dense neighbor relationships to model intra-view dependencies, leading to high computational costs and an inability to directly ensure consistency across inter-view relationships. Furthermore, these methods typically aggregate evidence from different views through manually assigned weights, lacking guarantees that the learned multi-view neighbor structures are consistent within the class space, thus undermining the trustworthiness of classification outcomes. To overcome these limitations, we propose a novel TMVC framework that introduces prototypes to represent the neighbor structures of each view. By simplifying the learning of intra-view neighbor relations and enabling dynamic alignment of intra- and inter-view structure, our approach facilitates more efficient and consistent discovery of cross-view consensus. Extensive experiments on multiple public multi-view datasets demonstrate that our method achieves competitive downstream performance and robustness compared to prevalent TMVC methods.

[86] CameraMaster: Unified Camera Semantic-Parameter Control for Photography Retouching

Qirui Yang,Yang Yang,Ying Zeng,Xiaobin Hu,Bo Li,Huanjing Yue,Jingyu Yang,Peng-Tao Jiang

Main category: cs.CV

TL;DR: 本文提出了一种名为CameraMaster的统一相机感知框架,用于图像润饰,通过解耦相机指令并融合摄影师意图与精确相机参数,实现了物理一致且参数可控的图像编辑。

Details Motivation: 现有基于文本引导的扩散模型在图像润饰中难以实现对曝光、白平衡、变焦等相机参数的精确控制,且依赖模糊的文本提示或独立调整模块,缺乏可扩展性和对细微变化的敏感性。 Method: 提出CameraMaster框架,显式解耦相机指令,引入相机参数嵌入来调制指令和内容语义,并通过交叉注意力将调制后的指令注入内容特征;同时将指令和参数嵌入作为条件和门控信号注入时间嵌入,实现去噪过程中的统一逐层调制。 Result: 构建了包含78K图像-提示对的大规模数据集进行训练和评估,实验表明CameraMaster对参数变化具有单调且近线性的响应,支持多参数无缝组合,并显著优于现有方法。 Conclusion: CameraMaster实现了精确、可预测且可组合的相机参数控制,提升了图像润饰的物理一致性与可控性,为文本引导扩散模型在专业级图像编辑中的应用提供了新方向。 Abstract: Text-guided diffusion models have greatly advanced image editing and generation. However, achieving physically consistent image retouching with precise parameter control (e.g., exposure, white balance, zoom) remains challenging. Existing methods either rely solely on ambiguous and entangled text prompts, which hinders precise camera control, or train separate heads/weights for parameter adjustment, which compromises scalability, multi-parameter composition, and sensitivity to subtle variations. To address these limitations, we propose CameraMaster, a unified camera-aware framework for image retouching. The key idea is to explicitly decouple the camera directive and then coherently integrate two critical information streams: a directive representation that captures the photographer's intent, and a parameter embedding that encodes precise camera settings. CameraMaster first uses the camera parameter embedding to modulate both the camera directive and the content semantics. The modulated directive is then injected into the content features via cross-attention, yielding a strongly camera-sensitive semantic context. In addition, the directive and camera embeddings are injected as conditioning and gating signals into the time embedding, enabling unified, layer-wise modulation throughout the denoising process and enforcing tight semantic-parameter alignment. To train and evaluate CameraMaster, we construct a large-scale dataset of 78K image-prompt pairs annotated with camera parameters. Extensive experiments show that CameraMaster produces monotonic and near-linear responses to parameter variations, supports seamless multi-parameter composition, and significantly outperforms existing methods.

[87] CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang,Yunong Liu,Bohan Zhai,Ximeng Sun,Zicheng Liu,Emad Barsoum,Manling Li,Chenfeng Xu

Main category: cs.CV

TL;DR: 本文提出了一个基于实用性的图像描述评估基准CaptionQA,通过衡量生成的描述在下游任务中的表现来评估其质量,覆盖多个领域并提供细粒度分类和大量标注问题,揭示了现有模型在描述实用性方面的显著不足。

Details Motivation: 现有的图像描述评估方法未能回答一个核心问题:描述是否能在真实下游任务中有效替代图像?因此需要一种基于实际用途的评估方式。 Method: 提出CaptionQA基准,包含四个领域(自然、文档、电商、具身AI),构建25个顶层和69个子类别的细粒度分类体系,并基于33,027个密集标注的多选题进行评估;使用LLM仅依赖描述回答问题,直接测量描述对图像效用的保留程度。 Result: 评估显示,当前最先进的多模态大模型在传统图像问答基准上表现相近,但在描述实用性上差距显著,性能下降高达32%。 Conclusion: CaptionQA能有效揭示现有图像描述在实际任务中的局限性,强调需关注描述的实用性而非仅语言质量,并提供了可扩展的开源框架以支持新领域的扩展。 Abstract: Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.

[88] FlowerDance: MeanFlow for Efficient and Refined 3D Dance Generation

Kaixing Yang,Xulong Tang,Ziqiao Peng,Xiangyue Zhang,Puwei Wang,Jun He,Hongyan Liu

Main category: cs.CV

TL;DR: FlowerDance是一种高效的音乐到舞蹈生成模型,能够在保证物理合理性和艺术表现力的同时,实现高质量且快速的3D舞蹈生成。

Details Motivation: 现有音乐到舞蹈生成方法生成效率低,难以满足高保真3D渲染对计算资源的需求,限制了实际应用中的表现力。 Method: 提出FlowerDance模型,结合MeanFlow与物理一致性约束,采用基于BiMamba的骨干网络和通道级跨模态融合策略,以非自回归方式高效生成舞蹈动作,并支持交互式动作编辑。 Result: 在AIST++和FineDance数据集上的实验表明,FlowerDance在生成质量和效率方面均达到SOTA水平,显著提升了推理速度和内存利用率。 Conclusion: FlowerDance通过结构创新实现了高效、高质量的音乐驱动舞蹈生成,具备良好的实际应用潜力,尤其适用于需要实时交互和高渲染质量的场景。 Abstract: Music-to-dance generation aims to translate auditory signals into expressive human motion, with broad applications in virtual reality, choreography, and digital entertainment. Despite promising progress, the limited generation efficiency of existing methods leaves insufficient computational headroom for high-fidelity 3D rendering, thereby constraining the expressiveness of 3D characters during real-world applications. Thus, we propose FlowerDance, which not only generates refined motion with physical plausibility and artistic expressiveness, but also achieves significant generation efficiency on inference speed and memory utilization . Specifically, FlowerDance combines MeanFlow with Physical Consistency Constraints, which enables high-quality motion generation with only a few sampling steps. Moreover, FlowerDance leverages a simple but efficient model architecture with BiMamba-based backbone and Channel-Level Cross-Modal Fusion, which generates dance with efficient non-autoregressive manner. Meanwhile, FlowerDance supports motion editing, enabling users to interactively refine dance sequences. Extensive experiments on AIST++ and FineDance show that FlowerDance achieves state-of-the-art results in both motion quality and generation efficiency. Code will be released upon acceptance.

[89] LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules

Cheng Yang,Hui Jin,Xinlei Yu,Zhipeng Wang,Yaoqun Liu,Fenglei Fan,Dajiang Lei,Gangyong Jia,Changmiao Wang,Ruiquan Ge

Main category: cs.CV

TL;DR: 提出LungNoduleAgent,一种用于肺部CT扫描分析的协作式多智能体系统,显著提升肺结节描述与恶性分级的准确性。

Details Motivation: 现有视觉-语言模型在肺结节形态描述和医学专业知识整合方面存在不足,影响临床可靠性;多智能体系统在病理学中的潜力尚未充分探索。 Method: 构建包含Nodule Spotter、Radiologist和Doctor Agent System的三模块系统:分别实现结节检测、局部图像描述生成报告、结合知识库进行恶性推理。 Result: 在两个私有数据集和公开LIDC-IDRI数据集上测试,性能优于主流视觉-语言模型、智能体系统和专家模型,验证了区域语义对齐与多智能体协作的有效性。 Conclusion: LungNoduleAgent通过多智能体协同与病理知识融合,为肺结节临床分析提供高精度支持,具有重要临床应用前景。 Abstract: Diagnosing lung cancer typically involves physicians identifying lung nodules in Computed tomography (CT) scans and generating diagnostic reports based on their morphological features and medical expertise. Although advancements have been made in using multimodal large language models for analyzing lung CT scans, challenges remain in accurately describing nodule morphology and incorporating medical expertise. These limitations affect the reliability and effectiveness of these models in clinical settings. Collaborative multi-agent systems offer a promising strategy for achieving a balance between generality and precision in medical applications, yet their potential in pathology has not been thoroughly explored. To bridge these gaps, we introduce LungNoduleAgent, an innovative collaborative multi-agent system specifically designed for analyzing lung CT scans. LungNoduleAgent streamlines the diagnostic process into sequential components, improving precision in describing nodules and grading malignancy through three primary modules. The first module, the Nodule Spotter, coordinates clinical detection models to accurately identify nodules. The second module, the Radiologist, integrates localized image description techniques to produce comprehensive CT reports. Finally, the Doctor Agent System performs malignancy reasoning by using images and CT reports, supported by a pathology knowledge base and a multi-agent system framework. Extensive testing on two private datasets and the public LIDC-IDRI dataset indicates that LungNoduleAgent surpasses mainstream vision-language models, agent systems, and advanced expert models. These results highlight the importance of region-level semantic alignment and multi-agent collaboration in diagnosing nodules. LungNoduleAgent stands out as a promising foundational tool for supporting clinical analyses of lung nodules.

[90] PG-ControlNet: A Physics-Guided ControlNet for Generative Spatially Varying Image Deblurring

Hakki Motorcu,Mujdat Cetin

Main category: cs.CV

TL;DR: 本文提出了一种新的图像去模糊框架,通过将强大的生成先验与显式的密集物理约束相结合,解决了空间变化的复杂模糊问题,在物理准确性和感知真实性之间取得了良好平衡。

Details Motivation: 现有的基于学习的去模糊方法在物理约束和感知质量之间难以兼顾:模型驱动方法虽有较好物理一致性但纹理过平滑,生成模型虽视觉效果好却容易产生虚构细节。本文旨在融合二者优势。 Method: 将退化场建模为高维压缩核的密集连续体,并利用该描述场作为条件引导ControlNet架构下的扩散采样过程,从而在保持强物理约束的同时生成高质量细节。 Result: 实验表明,该方法在严重模糊且噪声较大的复杂场景下,优于当前最先进的模型驱动和生成式去模糊方法。 Conclusion: 所提框架成功融合了模型驱动与生成模型的优点,在保持物理合理性的前提下生成了真实自然的纹理,有效提升了空间变化去模糊的效果。 Abstract: Spatially varying image deblurring remains a fundamentally ill-posed problem, especially when degradations arise from complex mixtures of motion and other forms of blur under significant noise. State-of-the-art learning-based approaches generally fall into two paradigms: model-based deep unrolling methods that enforce physical constraints by modeling the degradations, but often produce over-smoothed, artifact-laden textures, and generative models that achieve superior perceptual quality yet hallucinate details due to weak physical constraints. In this paper, we propose a novel framework that uniquely reconciles these paradigms by taming a powerful generative prior with explicit, dense physical constraints. Rather than oversimplifying the degradation field, we model it as a dense continuum of high-dimensional compressed kernels, ensuring that minute variations in motion and other degradation patterns are captured. We leverage this rich descriptor field to condition a ControlNet architecture, strongly guiding the diffusion sampling process. Extensive experiments demonstrate that our method effectively bridges the gap between physical accuracy and perceptual realism, outperforming state-of-the-art model-based methods as well as generative baselines in challenging, severely blurred scenarios.

[91] MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization

Yingjie Xia,Xi Wang,Jinglei Shi,Vicky Kalogeiton,Jian Yang

Main category: cs.CV

TL;DR: MUSE是首个统一的图像情感合成框架,能够同时实现情感生成与编辑,无需额外训练扩散模型或专用数据集,通过梯度优化和语义相似性指导情感控制,在情感准确性、语义多样性和文本一致性之间取得良好平衡。

Details Motivation: 现有图像情感合成方法将生成与编辑任务分离,导致效率低下且限制了在治疗、叙事等场景中的应用,因此需要一个统一高效的框架来整合这两类任务。 Method: 提出MUSE框架,采用类测试时扩展(TTS)策略,利用现成的情绪分类器进行梯度优化以操控情感标记;通过语义相似性确定最佳情感引导时机;设计多情绪损失函数以减少情绪干扰,从而统一实现情感生成与编辑。 Result: 实验表明MUSE在情感生成与编辑任务上均优于现有方法,显著提升情感准确性和语义多样性,同时保持内容合理性和对文本提示的遵循。 Conclusion: MUSE建立了情感合成的新范式,为图像情感调控提供了统一、高效且无需额外训练的解决方案,具有广泛的应用前景。 Abstract: Images evoke emotions that profoundly influence perception, often prioritized over content. Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling. In this work, we introduce MUSE, the first unified framework capable of both emotional generation and editing. By adopting a strategy conceptually aligned with Test-Time Scaling (TTS) that widely used in both LLM and diffusion model communities, it avoids the requirement for additional updating diffusion model and specialized emotional synthesis datasets. More specifically, MUSE addresses three key questions in emotional synthesis: (1) HOW to stably guide synthesis by leveraging an off-the-shelf emotion classifier with gradient-based optimization of emotional tokens; (2) WHEN to introduce emotional guidance by identifying the optimal timing using semantic similarity as a supervisory signal; and (3) WHICH emotion to guide synthesis through a multi-emotion loss that reduces interference from inherent and similar emotions. Experimental results show that MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining an optimal balance between desired content, adherence to text prompts, and realistic emotional expression. It establishes a new paradigm for emotion synthesis.

[92] Long-Term Alzheimers Disease Prediction: A Novel Image Generation Method Using Temporal Parameter Estimation with Normal Inverse Gamma Distribution on Uneven Time Series

Xin Hong,Xinze Sun,Yinhao Li,Yen-Wei Chen

Main category: cs.CV

TL;DR: 提出了一种基于时间参数化正态逆伽马分布(T-NIG)的模型,用于在不规则时间间隔下生成脑图像并长期预测阿尔茨海默病(AD),该模型在保持疾病相关特征方面表现优异。

Details Motivation: 现有方法在处理不规则时间间隔的序列数据时,难以在长期图像生成中保持与疾病相关的特征,因此需要一种能建模时间分布特性的新方法。 Method: T-NIG模型利用两个时间点的脑图像,通过引入时间参数到正态逆伽马分布中,并结合坐标邻域特征和不确定性估计,生成中间及未来图像,以实现对不规则时间序列的建模。 Result: T-NIG在短期和长期预测任务中均达到最先进的性能,能够有效减少由时间数据不足引起的认知和随机不确定性,并准确预测疾病进展。 Conclusion: T-NIG能有效应对不规则时间间隔下的脑图像生成挑战,在长期AD预测中保持疾病相关特征,具有临床应用潜力。 Abstract: Image generation can provide physicians with an imaging diagnosis basis in the prediction of Alzheimer's Disease (AD). Recent research has shown that long-term AD predictions by image generation often face difficulties maintaining disease-related characteristics when dealing with irregular time intervals in sequential data. Considering that the time-related aspects of the distribution can reflect changes in disease-related characteristics when images are distributed unevenly, this research proposes a model to estimate the temporal parameter within the Normal Inverse Gamma Distribution (T-NIG) to assist in generating images over the long term. The T-NIG model employs brain images from two different time points to create intermediate brain images, forecast future images, and predict the disease. T-NIG is designed by identifying features using coordinate neighborhoods. It incorporates a time parameter into the normal inverse gamma distribution to understand how features change in brain imaging sequences that have varying time intervals. Additionally, T-NIG utilizes uncertainty estimation to reduce both epistemic and aleatoric uncertainties in the model, which arise from insufficient temporal data. In particular, the T-NIG model demonstrates state-of-the-art performance in both short-term and long-term prediction tasks within the dataset. Experimental results indicate that T-NIG is proficient in forecasting disease progression while maintaining disease-related characteristics, even when faced with an irregular temporal data distribution.

[93] MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Ziyun Zeng,Hang Hua,Jiebo Luo

Main category: cs.CV

TL;DR: MIRA是一种轻量级、即插即用的多模态推理代理,通过迭代感知-推理-行动循环,逐步解析复杂自然语言指令并利用视觉反馈进行图像编辑,显著提升了语义一致性和感知质量。

Details Motivation: 扩散模型在处理复杂的自然语言编辑指令时,常因难以准确理解组合关系、上下文信息或指代表达而导致语义偏移或编辑失败,因此需要更强大的指令解析与执行机制。 Method: 提出MIRA(Multimodal Iterative Reasoning Agent),采用迭代的感知-推理-行动框架,结合15万规模的MIRA-Editing数据集和两阶段SFT+GRPO训练流程,逐步生成原子化编辑指令,并利用视觉反馈调整决策。 Result: MIRA在多个开源图像编辑模型(如Flux.1-Kontext、Step1X-Edit、Qwen-Image-Edit)上显著提升编辑效果,在语义一致性和视觉质量方面达到甚至优于GPT-Image和Nano-Banana等专有系统的表现。 Conclusion: MIRA通过模拟人类多轮交互式的编辑过程,有效解决了复杂指令下的图像编辑难题,具备良好的通用性和实用性。 Abstract: Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

[94] CLRecogEye : Curriculum Learning towards exploiting convolution features for Dynamic Iris Recognition

Geetanjali Sharma,Gaurav Jaswal,Aditya Nigam,Raghavendra Ramachandra

Main category: cs.CV

TL;DR: 提出一种基于3D-CNN和课程学习的新型虹膜认证匹配框架,能够有效捕捉虹膜特征的时空结构,提升在旋转、尺度、反光和模糊等干扰下的识别鲁棒性与判别性。

Details Motivation: 现有虹膜识别方法多依赖点对点距离比较,缺乏对虹膜模式时空结构的有效建模,且在面对旋转、尺度变化、反光和散焦模糊时鲁棒性不足。 Method: 将虹膜图像沿一维分割成子图像序列,作为3D-CNN输入以捕获空间和时空特征;采用课程学习策略训练模型,并结合三元组损失和ArcFace损失进行端到端优化,增强特征空间中的时序依赖性和判别能力。 Result: 所提方法在复杂干扰条件下展现出更强的特征判别性和匹配鲁棒性,显著提升了虹膜识别性能。 Conclusion: 该框架通过引入3D-CNN和课程学习,成功将时空动态信息嵌入虹膜特征表示,为虹膜认证提供了一个更鲁棒、更具泛化能力的解决方案。 Abstract: Iris authentication algorithms have achieved impressive recognition performance, making them highly promising for real-world applications such as border control, citizen identification, and both criminal investigations and commercial systems. However, their robustness is still challenged by variations in rotation, scale, specular reflections, and defocus blur. In addition, most existing approaches rely on straightforward point-to-point comparisons, typically using cosine or L2 distance, without effectively leveraging the spatio-spatial-temporal structure of iris patterns. To address these limitations, we propose a novel and generalized matching pipeline that learns rich spatio-spatial-temporal representations of iris features. Our approach first splits each iris image along one dimension, generating a sequence of sub-images that serve as input to a 3D-CNN, enabling the network to capture both spatial and spatio-spatial-temporal cues. To further enhance the modeling of spatio-spatial-temporal feature dynamics, we train the model in curriculum manner. This design allows the network to embed temporal dependencies directly into the feature space, improving discriminability in the deep metric domain. The framework is trained end-to-end with triplet and ArcFace loss in a curriculum manner, enforcing highly discriminative embeddings despite challenges like rotation, scale, reflections, and blur. This design yields a robust and generalizable solution for iris authentication.Github code: https://github.com/GeetanjaliGTZ/CLRecogEye

[95] Pygmalion Effect in Vision: Image-to-Clay Translation for Reflective Geometry Reconstruction

Gayoung Lee,Junho Kim,Jin-Hwa Kim,Junmo Kim

Main category: cs.CV

TL;DR: 本文提出了一种受皮格马利翁神话启发的新框架,通过将图像翻译成类似黏土的形式来抑制镜面反射,从而实现对包含复杂反射的多视角图像中物体的鲁棒三维重建。

Details Motivation: 由于视相关反射导致外观与几何形状纠缠,理解反射在3D重建中一直是一个长期挑战。现有方法难以在存在复杂反射的情况下准确恢复物体的几何结构。 Method: 提出一种双分支网络:一个基于BRDF的反射分支和一个黏土引导分支。利用合成的无反射黏土样图像作为中性监督信号,联合训练两个分支,以稳定几何形状并优化表面法线。 Result: 在合成和真实数据集上的实验表明,该方法在法线准确性和网格完整性方面显著优于现有的处理反射的方法。 Conclusion: “通过去光泽化来看”——即将辐射转化为中性表征——可作为学习反射物体几何形状的有效归纳偏置,为解决反射问题提供了新思路。 Abstract: Understanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically "sculpts" reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.

[96] Scaling Foundation Models for Radar Scene Understanding

Pushkal Mishra,Kshitiz Bansal,Dinesh Bharadia

Main category: cs.CV

TL;DR: 本文提出了RadarFM,一种基于结构化空间语言监督的雷达基础模型,旨在统一雷达感知中的场景级表征学习,解决现有方法碎片化和任务特定的问题。

Details Motivation: 现有的雷达感知方法多为任务特定且架构分散,缺乏跨任务迁移能力;同时,基础模型在视觉和语言领域的成功尚未有效扩展到雷达感知领域。 Method: 提出结构化字幕框架以在原生雷达坐标下编码车辆分布,并设计哈希感知的对比学习目标,量化连续场景相似性而非二值匹配,从而实现细粒度的空间推理。使用CARLA模拟器生成大规模、标注丰富的雷达数据集,并提出定位感知的评估指标。 Result: 通过实验验证了RadarFM在多种驾驶场景下的有效性,新提出的评估指标能够更好衡量空间精度,超越传统检测指标。 Conclusion: RadarFM通过结构化空间语言监督和新型对比学习策略,实现了统一且可迁移的雷达感知表示,推动了雷达基础模型的发展。 Abstract: Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.

[97] EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

Ze Feng,Sen Yang,Boqiang Duan,Wankou Yang,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出EM-KD,一种用于增强高效多模态大语言模型(MLLMs)的新型知识蒸馏范式,通过曼哈顿距离和匈牙利匹配对齐师生模型的视觉token,并引入视觉-语言亲和性蒸馏(VLAD)与视觉语义蒸馏(VSD)策略,有效缓解了因视觉token不平衡导致的细粒度理解差异,在准确性和效率上均显著优于先前方法。

Details Motivation: 现有的高效MLLM在压缩视觉token时会丢失信息,影响理解能力;尽管已有工作引入知识蒸馏,但忽视了师生模型间视觉token不平衡导致的细粒度视觉理解差异。 Method: 提出EM-KD,首先计算教师与学生模型视觉logits之间的曼哈顿距离,并使用匈牙利算法在空间维度上进行对齐;随后引入两种蒸馏策略:视觉-语言亲和性蒸馏(VLAD),通过最小化平滑L1损失对齐文本与视觉token的亲和矩阵;视觉语义蒸馏(VSD),利用反向KL散度衡量对齐后视觉logits在词汇空间上的分布差异。 Result: 在多个基准测试上验证了EM-KD的有效性,其训练的模型在精度和效率方面均大幅超越以往的高效MLLM;相比其他蒸馏方法(结合所提出的视觉token匹配策略进行公平比较),EM-KD也表现出更优性能。 Conclusion: EM-KD通过有效的视觉token对齐机制和双重蒸馏策略,显著提升了高效MLLM的视觉理解能力,为资源受限下的多模态模型优化提供了新思路。 Abstract: Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.

[98] FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain

YuAn Wang,Xiaofan Li,Chi Huang,Wenhao Zhang,Hao Li,Bosheng Wang,Xun Sun,Jun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为FaithFusion的3DGS与扩散模型融合框架,利用像素级期望信息增益(EIG)实现几何保真与视觉真实感的可控驾驶场景重建,在大视角变化下表现出色,并在多个指标上达到SOTA性能。

Details Motivation: 在可控制的驾驶场景重建和3D场景生成中,如何在大视角变化下保持几何保真度并合成视觉上合理的外观是一个挑战;现有方法因缺乏像素级、3D一致的编辑标准而导致过度修复和几何漂移。 Method: 提出FaithFusion框架,引入像素级期望信息增益(EIG)作为统一策略:EIG引导扩散模型作为空间先验来优化高不确定性区域,并通过像素级加权将编辑结果蒸馏回3D高斯泼溅(3DGS),形成即插即用系统,无需额外先验或结构修改。 Result: 在Waymo数据集上的实验表明,该方法在NTA-IoU、NTL-IoU和FID等指标上均达到最先进水平,即使在6米车道偏移下仍保持107.47的FID值。 Conclusion: FaithFusion通过EIG实现了3DGS与扩散模型的有效融合,在不依赖额外条件或结构改动的情况下,显著提升了驾驶场景重建的几何一致性和视觉质量。 Abstract: In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce \textbf{FaithFusion}, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications.Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift. Our code is available at https://github.com/wangyuanbiubiubiu/FaithFusion.

[99] Deformation-aware Temporal Generation for Early Prediction of Alzheimers Disease

Xin Honga,Jie Lin,Minghui Wang

Main category: cs.CV

TL;DR: 提出了一种名为DATGN的新方法,用于通过生成未来MRI图像实现阿尔茨海默病的早期预测,该方法在处理不完整时间序列的同时提升了分类准确率。

Details Motivation: 现有阿尔茨海默病预测方法依赖手动提取脑部图像形态特征,且难以应对MRI时间序列中常见的缺失数据问题,因此需要一种能自动学习疾病进展形态变化并支持不完整数据输入的模型。 Method: 提出Deformation-Aware Temporal Generative Network (DATGN),首先对不完整的MRI时间序列进行插值补全,然后利用双向时间形变感知模块引导网络生成符合疾病进展规律的未来MRI图像,从而实现早期预测。 Result: 在ADNI数据集上验证表明,DATGN在PSNR和MMSE指标上表现优异;生成的合成数据使SVM、CNN和3DCNN分类器的AD vs. NC分类准确率提升6.21%–16%,AD vs. MCI vs. NC分类准确率提升7.34%–21.25%;可视化结果显示生成图像符合脑萎缩趋势。 Conclusion: DATGN能够有效建模脑部形态随阿尔茨海默病进展的变化,通过生成未来MRI图像实现更准确的早期预测,并显著提升下游分类任务性能,具有临床应用潜力。 Abstract: Alzheimer's disease (AD), a degenerative brain condition, can benefit from early prediction to slow its progression. As the disease progresses, patients typically undergo brain atrophy. Current prediction methods for Alzheimers disease largely involve analyzing morphological changes in brain images through manual feature extraction. This paper proposes a novel method, the Deformation-Aware Temporal Generative Network (DATGN), to automate the learning of morphological changes in brain images about disease progression for early prediction. Given the common occurrence of missing data in the temporal sequences of MRI images, DATGN initially interpolates incomplete sequences. Subsequently, a bidirectional temporal deformation-aware module guides the network in generating future MRI images that adhere to the disease's progression, facilitating early prediction of Alzheimer's disease. DATGN was tested for the generation of temporal sequences of future MRI images using the ADNI dataset, and the experimental results are competitive in terms of PSNR and MMSE image quality metrics. Furthermore, when DATGN-generated synthetic data was integrated into the SVM vs. CNN vs. 3DCNN-based classification methods, significant improvements were achieved from 6. 21\% to 16\% in AD vs. NC classification accuracy and from 7. 34\% to 21. 25\% in AD vs. MCI vs. NC classification accuracy. The qualitative visualization results indicate that DATGN produces MRI images consistent with the brain atrophy trend in Alzheimer's disease, enabling early disease prediction.

[100] Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models

Changlin Li,Jiawei Zhang,Zeyi Shi,Zongxin Yang,Zhihui Li,Xiaojun Chang

Main category: cs.CV

TL;DR: 本文提出了一种名为EntPruner的熵引导自动渐进剪枝框架,用于大规模视觉生成模型(如扩散模型和流模型)的高效压缩,在保持生成质量的同时实现高达2.22倍的推理加速。

Details Motivation: 预训练的大规模视觉生成模型在下游任务中存在显著的参数冗余,直接迁移效率低下,且传统剪枝方法易导致模式崩溃和生成多样性下降。 Method: 提出熵引导剪枝策略,使用数据依赖的条件熵偏差(CED)作为模块重要性度量,评估移除某模块后输出分布与原始条件分布的偏离程度;并设计零样本自适应剪枝框架,实现训练过程中动态决定剪枝时机与比例。 Result: 在DiT和SiT模型上进行实验,EntPruner实现了最高2.22倍的推理速度提升,同时在ImageNet及三个下游数据集上保持了有竞争力的生成质量,有效避免了模式崩溃。 Conclusion: EntPruner通过熵引导和自适应机制,为扩散与流模型提供了高效、稳定的剪枝方案,兼顾了模型压缩效率与生成性能。 Abstract: Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive pruning framework for diffusion and flow models. First, we introduce entropy-guided pruning, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require preserving the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes pruning of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive pruning framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot pruning, mitigating mode collapse, and preserving model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.

[101] AnchorOPT: Towards Optimizing Dynamic Anchors for Adaptive Prompt Learning

Zheng Li,Yibing Song,Xin Zhang,Lei Luo,Xiang Li,Jian Yang

Main category: cs.CV

TL;DR: 提出AnchorOPT,一种基于动态锚点的提示学习框架,通过从任务数据中动态学习锚点值并自适应优化锚点与软令牌的位置关系,提升CLIP模型的泛化能力。

Details Motivation: 现有基于CLIP的提示学习方法使用静态文本标记作为锚点,缺乏跨任务和训练阶段的灵活性,限制了模型的适应性和泛化性能。 Method: AnchorOPT在两个维度引入动态性:一是锚点值从任务特定数据中动态学习,而非手工设定;二是通过可学习的位置矩阵自适应调整锚点与软令牌的相对位置,该矩阵依赖于训练阶段和任务上下文。训练分为两阶段:先学习并冻结锚点,再优化软令牌和位置矩阵。 Result: 实验证明,仅使用简单的可学习锚点和位置矩阵,AnchorOPT即可达到或超越一些引入额外模块或正则化技术的方法,且作为即插即用模块,在多个数据集上均带来一致性能提升。 Conclusion: AnchorOPT通过动态锚点机制增强了提示学习的灵活性和适应性,有效提升了CLIP在下游任务中的表现,具有良好的通用性和实用性。 Abstract: Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., "shape", "color"), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.

[102] CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

Dianbing Xi,Jiepeng Wang,Yuanzhi Liang,Xi Qiu,Jialun Liu,Hao Pan,Yuchi Huo,Rui Wang,Haibin Huang,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了一种统一的扩散模型CtrlVDiff,结合多种图形模态(如深度、法线、分割、边缘及材质属性)来提升视频理解和可控生成能力,通过混合模态控制策略实现高时序一致性的精细化编辑。

Details Motivation: 仅依赖几何线索(如深度、边缘)进行视频生成不足以准确表达外观、材质和光照,导致编辑能力受限且易产生时序漂移;需要引入更多图形学模态以提供更丰富的物理约束。 Method: 提出CtrlVDiff模型,采用混合模态控制策略(HMCS),支持任意子集的多模态输入,并设计机制保证缺失输入下的鲁棒性和生成时的时序一致性;构建MMVideo数据集,融合真实与合成视频并提供像素级多模态标注用于训练。 Result: 在视频理解与生成任务上,CtrlVDiff在可控性与生成质量方面优于现有方法,能实现逐层编辑(如重光照、材质替换、物体插入),并在部分模态缺失时保持稳定性能。 Conclusion: 通过引入图形学内在属性等多模态信息并结合统一扩散框架,可有效提升视频理解与可控生成的能力,为复杂视频编辑提供了更具物理意义且时序连贯的解决方案。 Abstract: We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.

[103] Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis

Jiyun Bae,Hyunjong Ok,Sangwoo Mo,Jaeho Lee

Main category: cs.CV

TL;DR: 本研究通过引入包含视觉干扰项的图像问答数据集Idis,探讨了视觉-语言模型中干扰信息对测试时扩展性的影响,发现视觉干扰项与文本干扰项存在本质差异。

Details Motivation: 探究多模态场景下视觉干扰项是否会导致类似文本模型中的逆向扩展效应,从而理解VLMs在面对无关信息时的推理行为。 Method: 构建了一个系统性变化语义、数值和空间维度干扰项的视觉问答数据集Idis,并分析模型在这些干扰下的表现及推理轨迹特征。 Result: 视觉干扰项导致准确率下降但未增加推理长度,表现出与文本干扰不同的逆向扩展现象;追踪推理过程中属性计数有助于理解干扰、推理长度与准确率的关系;该趋势在Waterbirds等视觉偏见基准上也得到验证。 Conclusion: 视觉干扰项对VLMs的影响机制不同于文本干扰,提出一种简单的提示策略可缓解由偏见驱动的预测问题。 Abstract: How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.

[104] DeepRFTv2: Kernel-level Learning for Image Deblurring

Xintian Mao,Haofei Song,Yin-Nian Liu,Qingli Li,Yan Wang

Main category: cs.CV

TL;DR: 提出傅里叶核估计器(FKE),在频域中将卷积转化为乘法,实现低复杂度、无监督的核级模糊过程学习,并结合解耦多尺度架构,在运动去模糊任务上达到SOTA性能。

Details Motivation: 现有深度网络局限于像素级学习,无法真正理解模糊的本质过程,缺乏对核级模糊机制的有效建模。 Method: 提出傅里叶核估计器(FKE),在频域进行激活操作,将空间域卷积转换为频域乘法;使用网络提取的特征而非原始图像进行核卷积,以更好学习模糊过程;设计解耦多尺度架构与可逆策略,提升多尺度特征提取效率并降低内存占用。 Result: 在多个数据集上实现了最先进的运动去模糊效果,核估计器能学习到物理意义明确的模糊核,且模型具有良好的扩展性,可用于其他核相关任务。 Conclusion: FKE通过频域核估计实现了对模糊本质的建模,使网络真正理解核级模糊过程,显著提升了去模糊性能,同时具备高效性和泛化潜力。 Abstract: It is well-known that if a network aims to learn how to deblur, it should understand the blur process. Blurring is naturally caused by the convolution of the sharp image with the blur kernel. Thus, allowing the network to learn the blur process in the kernel-level can significantly improve the image deblurring performance. But, current deep networks are still at the pixel-level learning stage, either performing end-to-end pixel-level restoration or stage-wise pseudo kernel-level restoration, failing to enable the deblur model to understand the essence of the blur. To this end, we propose Fourier Kernel Estimator (FKE), which considers the activation operation in Fourier space and converts the convolution problem in the spatial domain to a multiplication problem in Fourier space. Our FKE, jointly optimized with the deblur model, enables the network to learn the kernel-level blur process with low complexity and without any additional supervision. Furthermore, we change the convolution object of the kernel from ``image" to network extracted ``feature", whose rich semantic and structural information is more suitable to blur process learning. With the convolution of the feature and the estimated kernel, our model can learn the essence of blur in kernel-level. To further improve the efficiency of feature extraction, we design a decoupled multi-scale architecture with multiple hierarchical sub-unets with a reversible strategy, which allows better multi-scale encoding and decoding in low training memory. Extensive experiments indicate that our method achieves state-of-the-art motion deblurring results and show potential for handling other kernel-related problems. Analysis also shows our kernel estimator is able to learn physically meaningful kernels. The code will be available at https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur.

[105] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Wenbo Hu,Jingli Lin,Yilin Long,Yunlong Ran,Lihan Jiang,Yifan Wang,Chenming Zhu,Runsen Xu,Tai Wang,Jiangmiao Pang

Main category: cs.CV

TL;DR: G$^2$VLM是一种几何感知的视觉-语言模型,通过学习3D视觉几何特征来提升空间理解与推理能力,在无需额外标注的情况下利用多视角图像和视频数据进行训练,实现了在3D重建和空间理解任务上的优异表现。

Details Motivation: 现有视觉-语言模型在空间智能方面缺乏鲁棒性,尤其在空间理解与推理任务中表现不佳,主要原因是缺少从2D图像重建3D空间的视觉几何学习过程。 Method: 提出G$^2$VLM模型,将3D视觉几何特征融入视觉-语言框架中,通过上下文学习和交错推理机制,直接预测3D属性并增强空间推理;利用多视角图像和视频数据训练,并结合难以获取的3D先验知识。 Result: 实验表明,G$^2$VLM在3D重建任务上达到与当前最优前馈模型相当的性能,在多个空间理解与推理任务上表现优于或媲美现有方法。 Conclusion: G$^2$VLM统一了强语义的视觉-语言模型与低层3D视觉任务,为社区提供了强大的基线模型,并有望推动如3D场景编辑等未来应用的发展。 Abstract: Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.

[106] Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning

Changlin Li,Jiawei Zhang,Shuhao Liu,Sihao Lin,Zeyi Shi,Zhihui Li,Xiaojun Chang

Main category: cs.CV

TL;DR: 提出了一种名为Ent-Prog的高效训练框架,用于人类视频生成中的扩散模型,通过条件熵膨胀和自适应渐进策略显著降低训练时间和显存消耗。

Details Motivation: 由于在高分辨率、多帧数据上训练扩散模型存在高计算成本和大量内存消耗的问题,因此需要一种更高效的训练方法。 Method: 引入了条件熵膨胀(CEI)来评估不同模型组件的重要性,并采用自适应渐进训练策略,根据收敛效率动态增加计算复杂度。 Result: 在三个数据集上的实验表明,Ent-Prog可实现最高2.2倍的训练加速和2.4倍的GPU内存减少,同时不牺牲生成性能。 Conclusion: Ent-Prog是一种有效的训练框架,能够在保持生成质量的同时显著提升训练效率和降低资源消耗。 Abstract: Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.

[107] Referring Video Object Segmentation with Cross-Modality Proxy Queries

Baoli Sun,Xinzhu Ma,Ning Wang,Zhihui Wang,Zhiyong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为ProxyFormer的新型Referring Video Object Segmentation(RVOS)架构,通过引入代理查询(proxy queries)来增强视觉与文本语义的对齐,并在视频特征编码过程中动态传播这些查询,从而提升跨帧依赖建模和目标跟踪精度。

Details Motivation: 现有方法在处理跨模态对齐时缺乏帧间依赖建模,且文本约束引入过晚,导致难以准确追踪目标对象。 Method: 提出ProxyFormer,使用代理查询整合视觉与文本语义,并在多阶段视频特征编码器中传播;将跨模态交互解耦为时空两个维度以降低计算成本,并设计联合语义一致性(JSC)训练策略以对齐语义。 Result: 在四个主流RVOS基准上的实验表明,ProxyFormer性能优于现有最先进方法。 Conclusion: ProxyFormer通过动态更新的代理查询有效提升了跨模态语义对齐和帧间一致性,显著提高了RVOS任务的准确性和鲁棒性。 Abstract: Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.

[108] TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

Jiaming He,Guanyu Hou,Hongwei Li,Zhicong Huang,Kangjie Chen,Yi Yu,Wenbo Jiang,Guowen Xu,Tianwei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为TEAR的时序感知自动化红队框架,用于发现文本到视频(T2V)模型中与动态时序相关的安全风险,通过两阶段优化生成隐蔽且具有攻击性的提示,实验显示其攻击成功率超过80%,显著优于先前方法。

Details Motivation: 现有的安全评估方法主要针对静态图像和文本生成,无法有效捕捉视频生成中的复杂时序动态,因此需要一种专门针对T2V模型时序安全风险的评估框架。 Method: 提出TEAR框架,采用时序感知的测试生成器,通过初始生成器训练和时序感知的在线偏好学习两阶段优化,生成在文本上无害但能引发违规视频输出的提示,并引入 refine 模型循环提升提示的隐蔽性和对抗有效性。 Result: 在多个开源和商业T2V系统上进行实验,TEAR实现了超过80%的攻击成功率,相较之前最佳结果(57%)有显著提升。 Conclusion: TEAR能够有效揭示T2V模型在时序动态方面的安全漏洞,凸显了现有模型在动态内容安全控制上的不足,为未来更安全的视频生成模型设计提供了重要启示。 Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80% attack success rate, a significant boost from prior best result of 57%.

[109] LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

Shichu Sun,Yichen Zhang,Haolin Song,Zonghao Guo,Chi Chen,Yidan Zhang,Yuan Yao,Zhiyuan Liu,Maosong Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为LLaVA-UHD v3的多模态大语言模型,其核心是渐进式视觉压缩(PVC)方法,可在保持高性能的同时显著降低推理延迟。

Details Motivation: 为了研究当前多模态模型中全局原分辨率视觉编码优于切片方法的趋势,并解决其带来的高计算开销问题。 Method: 提出PVC方法,包含精细化的patch嵌入和分层的窗口化token压缩模块,可集成到标准ViT中实现高效编码。 Result: ViT-UHD在多个基准上表现优异,相比MoonViT将首 token 时间(TTFT)减少2.4倍;LLaVA-UHD v3性能媲美Qwen2-VL的同时,TTFT进一步降低1.9倍。 Conclusion: PVC方法能有效平衡多模态模型的效率与性能,为构建高效的MLLM提供了可行路径。 Abstract: Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.

[110] Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation

Joonhyung Park,Hyeongwon Jang,Joowon Kim,Eunho Yang

Main category: cs.CV

TL;DR: 提出GridAR,一种用于视觉自回归模型的测试时扩展框架,通过网格划分的渐进生成和布局指定的提示重构策略,提升文本到图像生成质量并降低成本。

Details Motivation: 现有的测试时计算扩展方法(如Best-of-N)在视觉自回归模型中效果有限,因缺乏全局画布规划且计算资源浪费在错误生成路径上。 Method: 引入GridAR框架:采用网格分区的渐进生成机制,早期剪枝不可行候选,并以可行结果为锚点引导后续解码;结合布局指定的提示重构策略,基于局部视图推断可行布局以指导生成。 Result: 在N=4时,GridAR在T2I-CompBench++上比Best-of-N(N=8)提升14.4%,成本降低25.6%;在PIE-Bench上图像编辑任务中语义保持提升13.9%。 Conclusion: GridAR有效提升了视觉自回归模型在测试时扩展下的生成质量和效率,兼具优异的跨任务泛化能力。 Abstract: Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.

[111] Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding

Yutao Tang,Cheng Zhao,Gaurav Mittal,Rohith Kukkala,Rama Chellappa,Cheng Peng,Mei Chen

Main category: cs.CV

TL;DR: 本文提出了NDTokenizer3D,一种基于多尺度NDT表示的通用3D视觉语言模型,通过新颖的三阶段场景分词管道实现细粒度的3D场景理解,并在多种任务上取得显著提升。

Details Motivation: 现有的3D视觉语言模型在有效将3D场景分词为整体场景标记并应用于多样化任务方面仍面临挑战,本文旨在构建一个统一且通用的框架来连接语言推理与3D空间理解。 Method: 提出NDTokenizer3D,包含多尺度NDT表示和多尺度NDT解码器(MSDec);首先从原始点云构建多尺度NDT表示,再通过MSDec逐级融合跨尺度特征生成可用于LLM的场景标记,并复用MSDec支持交互式提示和分割解码。 Result: 在3D指代表达分割、3D视觉问答和3D密集描述等任务上实现了显著性能提升,展现出模型的细粒度理解和通用性。 Conclusion: NDTokenizer3D通过统一的架构实现了高效的3D场景分词与多任务支持,为3D视觉语言理解提供了一种紧凑且强大的解决方案。 Abstract: Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.

[112] When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models

Hui Lu,Yi Yu,Yiming Yang,Chenyu Yi,Qixin Zhang,Bingquan Shen,Alex C. Kot,Xudong Jiang

Main category: cs.CV

TL;DR: 本文提出了UPA-RFAS,一种针对视觉-语言-动作(VLA)模型的通用、可迁移对抗补丁攻击框架,能够在未知架构、微调变体和仿真到真实场景迁移下实现跨模型、任务和视角的有效攻击。

Details Motivation: 现有的对抗补丁大多过拟合于单一模型,在黑盒设置中缺乏通用性和可迁移性,限制了对VLA系统安全性的全面评估。 Method: 提出UPA-RFAS框架:结合基于共享特征空间的特征偏移目标(含ℓ₁偏差先验和排斥性InfoNCE损失)、增强鲁棒性的两阶段最小-最大优化过程(内层生成不可见样本级扰动,外层优化通用补丁),以及两个针对VLA的特定损失——补丁注意力主导和语义错位损失,以操控文本到视觉注意力并诱导图文不匹配。 Result: 在多种VLA模型、操作套件和物理实验中验证了UPA-RFAS的性能,结果显示该方法在不同模型、任务和视角间均具有良好的迁移性和攻击效果,且适用于物理世界。 Conclusion: UPA-RFAS揭示了VLA驱动机器人面临的一种实际可行的基于补丁的攻击面,为未来防御机制的研究建立了强有力的基线。 Abstract: Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.

[113] You Can Trust Your Clustering Model: A Parameter-free Self-Boosting Plug-in for Deep Clustering

Hanyang Li,Yuheng Jia,Hui Liu,Junhui Hou

Main category: cs.CV

TL;DR: 本文提出了一种名为DCBoost的无参数插件,用于增强深度聚类模型的全局特征结构,通过利用局部结构的一致性来提升聚类性能。

Details Motivation: 现有深度聚类方法存在全局与局部特征结构不一致的问题,局部结构紧凑而全局结构边界模糊、分离性差。 Method: 通过自适应k近邻一致性筛选高置信度样本作为可靠锚点,并基于这些样本计算判别损失以优化网络,促进类内紧凑性和类间可分性。 Result: 在多个基准数据集上实验表明,DCBoost显著提升了多种深度聚类模型的性能,相比当前最优方法(如ProPos)提升超过3%,轮廓系数提高逾7倍。 Conclusion: DCBoost是一种有效的即插即用模块,能够显著改善深度聚类中全局特征结构,提升整体聚类性能。 Abstract: Recent deep clustering models have produced impressive clustering performance. However, a common issue with existing methods is the disparity between global and local feature structures. While local structures typically show strong consistency and compactness within class samples, global features often present intertwined boundaries and poorly separated clusters. Motivated by this observation, we propose DCBoost, a parameter-free plug-in designed to enhance the global feature structures of current deep clustering models. By harnessing reliable local structural cues, our method aims to elevate clustering performance effectively. Specifically, we first identify high-confidence samples through adaptive $k$-nearest neighbors-based consistency filtering, aiming to select a sufficient number of samples with high label reliability to serve as trustworthy anchors for self-supervision. Subsequently, these samples are utilized to compute a discriminative loss, which promotes both intra-class compactness and inter-class separability, to guide network optimization. Extensive experiments across various benchmark datasets showcase that our DCBoost significantly improves the clustering performance of diverse existing deep clustering models. Notably, our method improves the performance of current state-of-the-art baselines (e.g., ProPos) by more than 3% and amplifies the silhouette coefficient by over $7\times$. Code is available at .

[114] BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data

Selene Cerna,Sara Si-Moussi,Wilfried Thuiller,Hadrien Hendrikx,Vincent Miele

Main category: cs.CV

TL;DR: 本文提出了BotaCLIP,一种轻量级多模态对比框架,用于将领域知识(特别是生态结构)注入预训练的地球观测基础模型(DOFA),通过高分辨率航拍图像与植物群落调查数据的对齐实现,并在多个生态任务中展现出优于基线方法的性能。

Details Motivation: 在生物多样性建模等现实应用中,需要将领域专业知识融入基础模型以提升在数据稀缺场景下的表示学习效果,同时避免从头训练或高昂的计算成本。 Method: 提出BotaCLIP框架,采用轻量级多模态对比学习方法,将预训练的DOFA模型与植物群落调查数据对齐,并引入正则化策略以缓解灾难性遗忘,从而保留原始模型能力的同时注入生态结构知识。 Result: 在植物存在预测、蝴蝶出现建模和土壤营养组丰度估计三个生态任务中,BotaCLIP生成的表示均优于DOFA及有监督基线方法,表现出更强的迁移能力和性能提升。 Conclusion: 领域感知的基础模型适配能够有效将专家知识引入数据稀缺场景,实现高效且节约的表示学习,为专业领域中的基础模型应用提供了可行路径。 Abstract: Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.

[115] Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

Baoli Sun,Yihan Wang,Xinzhu Ma,Zhihui Wang,Kun Lu,Zhiyong Wang

Main category: cs.CV

TL;DR: 提出Action-Region Tracking (ART) 框架,通过查询-响应机制发现并追踪细粒度动作中局部区域的动态变化,提升细粒度动作识别性能。

Details Motivation: 现有方法难以捕捉细粒度动作类别之间在时空局部区域中的细微差异,导致识别性能受限。 Method: 设计区域特定语义激活模块,利用判别性和文本约束的语义作为查询,捕获每帧中最相关的区域响应;构建动作轨迹(action tracklets)以表征跨帧的区域动态,并通过多层级轨迹对比约束和任务特定微调机制优化轨迹表示。 Result: 在多个主流动作识别基准上实现了优于先前最先进方法的性能。 Conclusion: ART框架有效提升了细粒度动作识别中对局部细节的建模能力,通过结合视觉语言模型的文本语义与时空区域追踪,实现了更精准的动作区分。 Abstract: Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.

[116] From Diffusion to One-Step Generation: A Comparative Study of Flow-Based Models with Application to Image Inpainting

Umang Agarwal,Rudraksh Sangore,Sumit Laddha

Main category: cs.CV

TL;DR: 本研究对三种生成模型(DDPM、CFM和MeanFlow)进行了系统比较,使用统一的TinyUNet架构在CIFAR-10上实验表明CFM性能最优,而MeanFlow在单步生成下具有显著推理速度优势;此外,CFM扩展到图像修复任务中也表现出显著性能提升。

Details Motivation: 比较不同生成模型在相同条件下的性能差异,并探索其在图像生成与修复任务中的潜力。 Method: 采用统一的TinyUNet架构实现DDPM、CFM和MeanFlow,在CIFAR-10数据集上评估FID指标;将CFM扩展至图像修复任务,设计掩码引导采样策略并针对四种掩码类型进行实验。 Result: CFM在50步采样下达到FID 24.15,显著优于DDPM的402.98;MeanFlow单步生成达到FID 29.15,推理时间降低50倍;在图像修复任务中,PSNR从4.95提升至8.57 dB,SSIM从0.289提升至0.418。 Conclusion: CFM在生成质量上表现最佳,MeanFlow在效率方面具有明显优势;经过微调的CFM在图像修复任务中效果显著,验证了其应用潜力。 Abstract: We present a comprehensive comparative study of three generative modeling paradigms: Denoising Diffusion Probabilistic Models (DDPM), Conditional Flow Matching (CFM), and MeanFlow. While DDPM and CFM require iterative sampling, MeanFlow enables direct one-step generation by modeling the average velocity over time intervals. We implement all three methods using a unified TinyUNet architecture (<1.5M parameters) on CIFAR-10, demonstrating that CFM achieves an FID of 24.15 with 50 steps, significantly outperforming DDPM (FID 402.98). MeanFlow achieves FID 29.15 with single-step sampling -- a 50X reduction in inference time. We further extend CFM to image inpainting, implementing mask-guided sampling with four mask types (center, random bbox, irregular, half). Our fine-tuned inpainting model achieves substantial improvements: PSNR increases from 4.95 to 8.57 dB on center masks (+73%), and SSIM improves from 0.289 to 0.418 (+45%), demonstrating the effectiveness of inpainting-aware training.

[117] 3-Tracer: A Tri-level Temporal-Aware Framework for Audio Forgery Detection and Localization

Shuhan Xia,Xuannan Liu,Xing Cui,Peipei Li

Main category: cs.CV

TL;DR: 本文提出T3-Tracer,首个联合帧、片段和音频三级分析的框架,用于检测部分音频伪造,包含FA-FAM和SMDAM两个模块,在三个数据集上实现了最先进性能。

Details Motivation: 现有方法仅独立检测单帧是否伪造,缺乏对多时间尺度下瞬态和持续异常的层次化建模能力,难以应对部分音频伪造中局部关键帧篡改的问题。 Method: 提出T3-Tracer框架,包含FA-FAM模块融合帧级与音频级时序信息以检测帧真实性,以及SMDAM模块通过双分支结构在多尺度时间窗口下建模帧特征与帧间差异,检测伪造边界。 Result: 在三个具有挑战性的数据集上进行实验,结果表明T3-Tracer在部分音频伪造检测方面优于现有方法,达到最先进水平。 Conclusion: T3-Tracer通过多层次、多尺度的联合建模,有效提升了部分音频伪造的检测能力,尤其在捕捉伪造边界和全局语义不一致方面表现突出。 Abstract: Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3-Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3-Tracer consists of two complementary core modules: the Frame-Audio Feature Aggregation Module (FA-FAM) and the Segment-level Multi-Scale Discrepancy-Aware Module (SMDAM). FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state-of-the-art performance.

[118] FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision

Chen Ling,Henglin Shi,Hedvig Kjellström

Main category: cs.CV

TL;DR: 提出FIELDS方法,通过直接3D表情参数监督和情感识别分支,实现从单张图像生成高保真、情感丰富的3D人脸重建,有效保留细微情感线索。

Details Motivation: 现有3D人脸重建方法依赖2D监督且缺乏3D真值,难以捕捉细微面部表情细节,导致情感信息丢失。 Method: 提出FIELDS方法,结合自监督2D图像一致性,并引入直接的3D表情参数监督与强度感知的情感识别损失;利用自发性4D面部扫描提供真实表情参数指导编码器,通过辅助情感识别分支增强情感内容建模。 Result: 在单张图像上实现了高保真的3D人脸重建,显著提升了野外场景下的面部表情识别性能,同时保持自然性,缓解了表达强度偏差问题。 Conclusion: FIELDS通过双重监督策略有效弥合了2D/3D域差距,实现了更真实、情感丰富且细节精确的3D人脸表情重建。 Abstract: Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.

[119] Shift-Equivariant Complex-Valued Convolutional Neural Networks

Quentin Gabot,Teck-Yian Lim,Jérémy Fix,Joana Frontera-Pons,Chengfang Ren,Jean-Philippe Ovarlez

Main category: cs.CV

TL;DR: 本文将Learnable Polyphase Sampling (LPS) 扩展到复数神经网络,并提出在Gumbel Softmax前加入从复数域到实数域的投影层,以在分类、重建和语义分割任务中实现平移不变性和等变性,尤其应用于极化SAR图像。

Details Motivation: 传统卷积神经网络因下采样和上采样操作破坏了平移等变性和不变性,缺乏理论保障,需通过结构设计来系统性解决该问题。 Method: 将LPS方法扩展至复数神经网络,引入从复数域到实数域的投影层,并结合Gumbel Softmax进行可微分采样。 Result: 在极化SAR图像的分类、重建和语义分割任务中验证了所提方法能有效提升平移不变性和等变性性能。 Conclusion: 所提出的复数域LPS结构为实现理论上的平移等变/不变提供了有效途径,尤其适用于需要相位信息处理的复杂数据如SAR图像。 Abstract: Convolutional neural networks have shown remarkable performance in recent years on various computer vision problems. However, the traditional convolutional neural network architecture lacks a critical property: shift equivariance and invariance, broken by downsampling and upsampling operations. Although data augmentation techniques can help the model learn the latter property empirically, a consistent and systematic way to achieve this goal is by designing downsampling and upsampling layers that theoretically guarantee these properties by construction. Adaptive Polyphase Sampling (APS) introduced the cornerstone for shift invariance, later extended to shift equivariance with Learnable Polyphase up/downsampling (LPS) applied to real-valued neural networks. In this paper, we extend the work on LPS to complex-valued neural networks both from a theoretical perspective and with a novel building block of a projection layer from $\mathbb{C}$ to $\mathbb{R}$ before the Gumbel Softmax. We finally evaluate this extension on several computer vision problems, specifically for either the invariance property in classification tasks or the equivariance property in both reconstruction and semantic segmentation problems, using polarimetric Synthetic Aperture Radar images.

[120] AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

Shuhan Xia,Peipei Li,Xuannan Liu,Dongsen Zhang,Xinyu Guo,Zekun Li

Main category: cs.CV

TL;DR: 本文提出了首个全面的音视频伪造检测基准AVFakeBench,涵盖多种伪造类型和多层次标注,旨在应对现有基准在多样性和复杂性上的不足。

Details Motivation: 现有的音视频伪造检测基准主要局限于DeepFake和单一粒度的标注,无法反映真实世界中复杂的伪造场景,因此需要一个更全面、更具挑战性的基准来推动该领域的发展。 Method: 提出了一种多阶段混合伪造框架,结合专有任务规划模型和专家生成模型,生成高质量且多样化的音视频伪造样本;构建了包含12K个音视频问题的AVFakeBench,覆盖七种伪造类型和四个层次的标注,并设计了多任务评估框架。 Result: 在AVFakeBench上评估了11种音视频大语言模型(AV-LMMs)和2种主流检测方法,结果表明AV-LMMs具有作为新兴伪造检测工具的潜力,但在细粒度感知和推理方面仍存在明显弱点。 Conclusion: AVFakeBench为音视频伪造检测提供了一个新的、更具挑战性的评估平台,揭示了当前模型的优势与不足,推动了该领域的进一步研究。 Abstract: The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.

[121] LaGen: Towards Autoregressive LiDAR Scene Generation

Sizhuo Zhou,Xiaosong Jia,Fanrui Zhang,Junjie Li,Juyong Zhang,Yukang Feng,Jianwen Sun,Songbur Wong,Junqi You,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出了LaGen,首个能够实现长时距LiDAR场景逐帧自回归生成的框架,支持基于单帧输入和边界框条件的高保真4D点云生成,并通过场景解耦估计和噪声调制模块提升交互性和减少误差累积。

Details Motivation: 现有LiDAR数据生成方法仅支持单帧生成,预测方法缺乏交互性且无法进行长时距逐帧生成,难以满足自动驾驶中对交互式、长时间场景模拟的需求。 Method: 提出LaGen框架,采用自回归方式逐帧生成LiDAR序列;引入场景解耦估计模块以增强对象级内容的交互生成能力,并设计噪声调制模块减轻长时距生成中的误差累积;利用单帧LiDAR和边界框信息作为条件生成4D点云。 Result: 在nuScenes基础上构建了用于评估长时距LiDAR生成的协议,实验表明LaGen在生成质量尤其是后续帧的表现上显著优于现有最先进的生成与预测模型。 Conclusion: LaGen是首个支持长时距交互式生成LiDAR场景的框架,在生成连贯性、细节保真度和误差控制方面均有显著优势,为自动驾驶中的仿真与规划提供了新思路。 Abstract: Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long-horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame-by-frame autoregressive generation of long-horizon LiDAR scenes. LaGen is able to take a single-frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high-fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model's interactive generation capability for object-level content, as well as a noise modulation module to mitigate error accumulation during long-horizon generation. We construct a protocol based on nuScenes for evaluating long-horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on the later frames.

[122] Unlocking Zero-shot Potential of Semi-dense Image Matching via Gaussian Splatting

Juncheng Chen,Chao Xu,Yanjun Cao

Main category: cs.CV

TL;DR: MatchGS是首个系统性修正并利用3D高斯点阵(3DGS)进行鲁棒零样本图像匹配的框架,通过几何保真数据生成和2D-3D表征对齐策略,显著提升匹配性能。

Details Motivation: 基于学习的图像匹配依赖高质量训练数据,而3DGS虽能生成逼真图像,但其几何不准确性和深度渲染偏差限制了可靠对应关系标注,因此需要一种能纠正这些问题的方法。 Method: 提出MatchGS框架:1)构建几何保真的数据生成管线,优化3DGS几何以生成高精度对应标签;2)采用2D-3D表征对齐策略,将3DGS的显式3D知识注入2D匹配器,引导其学习视角不变的3D表征。 Result: 生成的真值对应关系将极线误差减少高达40倍,支持极端视角变化下的监督,并通过高斯属性提供自监督信号;仅使用该数据训练的SOTA匹配器在公开基准上零样本性能提升高达17.7%。 Conclusion: 经过适当的几何修正后,3DGS可作为可扩展、高保真且结构丰富的数据源,推动新一代鲁棒零样本图像匹配器的发展。 Abstract: Learning-based image matching critically depends on large-scale, diverse, and geometrically accurate training data. 3D Gaussian Splatting (3DGS) enables photorealistic novel-view synthesis and thus is attractive for data generation. However, its geometric inaccuracies and biased depth rendering currently prevent robust correspondence labeling. To address this, we introduce MatchGS, the first framework designed to systematically correct and leverage 3DGS for robust, zero-shot image matching. Our approach is twofold: (1) a geometrically-faithful data generation pipeline that refines 3DGS geometry to produce highly precise correspondence labels, enabling the synthesis of a vast and diverse range of viewpoints without compromising rendering fidelity; and (2) a 2D-3D representation alignment strategy that infuses 3DGS' explicit 3D knowledge into the 2D matcher, guiding 2D semi-dense matchers to learn viewpoint-invariant 3D representations. Our generated ground-truth correspondences reduce the epipolar error by up to 40 times compared to existing datasets, enable supervision under extreme viewpoint changes, and provide self-supervisory signals through Gaussian attributes. Consequently, state-of-the-art matchers trained solely on our data achieve significant zero-shot performance gains on public benchmarks, with improvements of up to 17.7%. Our work demonstrates that with proper geometric refinement, 3DGS can serve as a scalable, high-fidelity, and structurally-rich data source, paving the way for a new generation of robust zero-shot image matchers.

[123] Co-Training Vision Language Models for Remote Sensing Multi-task Learning

Qingyun Li,Shuran Ma,Junwei Luo,Yi Yu,Yue Zhou,Fengxiang Wang,Xudong Lu,Xiaoxing Wang,Xin He,Yushi Chen,Xue Yang,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出RSCoVLM,一个用于遥感多任务学习的简单且灵活的视觉语言模型基线,通过数据管理引擎、动态分辨率策略和Zoom-in Chain机制,在多种任务上实现领先性能,并开源全部资源以推动通用遥感模型的发展。

Details Motivation: 随着Transformer在单一遥感任务上的成功,研究者希望构建一个能在多个任务上统一高效执行的模型。多任务学习(MTL)相比单任务方法更具泛化性与实用性,而现有视觉语言模型(VLMs)在遥感图像理解等方面已展现潜力,但缺乏统一、灵活且高效的基线模型。因此,本文旨在建立一个适用于遥感MTL的通用VLM基线。 Method: 提出RSCoVLM模型:1)构建数据管理引擎,涵盖数据获取、离线整合与在线加载及加权,生成灵活的视觉-语言对话;2)设计统一的动态分辨率策略,处理遥感图像中不同尺度的问题;3)针对超高分辨率(UHR)图像引入Zoom-in Chain机制及配套数据集LRS-VQA-Zoom;4)增强模型的目标检测能力,并提出新的评估协议以公平比较VLM与传统检测模型。 Result: 实验表明,RSCoVLM在多项遥感任务上达到当前最优性能,超越现有的遥感VLM,并可与专用专家模型相媲美。动态分辨率和Zoom-in Chain有效降低了计算负担,新评估协议提升了可比性。所有工具、模型权重和数据集均已开源。 Conclusion: RSCoVLM作为一个简单而灵活的视觉语言基线,显著推动了遥感多任务学习的发展,验证了统一文本接口在遥感MTL中的有效性,为未来通用遥感模型的研究提供了坚实基础和开放资源。 Abstract: With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.

[124] PathMamba: A Hybrid Mamba-Transformer for Topologically Coherent Road Segmentation in Satellite Imagery

Jules Decaestecker,Nicolas Vigne

Main category: cs.CV

TL;DR: 本文提出了一种名为PathMamba的新混合架构,结合Mamba的状态空间模型与Transformer的全局推理能力,用于卫星图像中的道路分割,在保持高精度的同时显著提升拓扑连续性,并实现线性计算效率。

Details Motivation: 现有基于Vision Transformer的方法在道路分割中虽能捕捉全局上下文,但其二次计算复杂度限制了在资源受限设备上的高效部署;而道路网络具有长连续结构,需要更高效的序列建模方式。 Method: 提出PathMamba,将Mamba模块用于建模道路的连续性以保持拓扑结构,同时引入Transformer模块融合全局上下文信息进行特征细化,形成互补的混合架构。 Result: 在DeepGlobe和Massachusetts Roads数据集上实验表明,PathMamba在APLS等拓扑指标上达到新SOTA,显著优于现有方法,同时具备良好的计算效率。 Conclusion: PathMamba有效平衡了道路分割任务中的精度、拓扑连续性和计算成本,为未来在资源受限平台(如车载或卫星端)的部署提供了可行方案。 Abstract: Achieving both high accuracy and topological continuity in road segmentation from satellite imagery is a critical goal for applications ranging from urban planning to disaster response. State-of-the-art methods often rely on Vision Transformers, which excel at capturing global context, yet their quadratic complexity is a significant barrier to efficient deployment, particularly for on-board processing in resource-constrained platforms. In contrast, emerging State Space Models like Mamba offer linear-time efficiency and are inherently suited to modeling long, continuous structures. We posit that these architectures have complementary strengths. To this end, we introduce PathMamba, a novel hybrid architecture that integrates Mamba's sequential modeling with the Transformer's global reasoning. Our design strategically uses Mamba blocks to trace the continuous nature of road networks, preserving topological structure, while integrating Transformer blocks to refine features with global context. This approach yields topologically superior segmentation maps without the prohibitive scaling costs of pure attention-based models. Our experiments on the DeepGlobe Road Extraction and Massachusetts Roads datasets demonstrate that PathMamba sets a new state-of-the-art. Notably, it significantly improves topological continuity, as measured by the APLS metric, setting a new benchmark while remaining computationally competitive.

[125] CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation

Chenyu Liu,Hongze Chen,Jingzhi Bao,Lingting Zhu,Runze Zhang,Weikai Chen,Zeyu Hu,Yingda Yin,Keyang Luo,Xin Wang

Main category: cs.CV

TL;DR: 本文提出CaliTex,一种基于几何校准注意力的3D纹理生成框架,通过结构化注意力机制解决跨视角纹理不一致问题。

Details Motivation: 现有扩散模型在3D纹理生成中存在跨视角不一致问题,源于注意力机制的歧义性,导致几何结构与外观耦合不稳定。 Method: 提出CaliTex框架,包含Part-Aligned Attention和Condition-Routed Attention两个模块,结合两阶段扩散Transformer,将注意力机制与3D几何结构对齐。 Result: 实验表明CaliTex在生成无缝且视角一致的纹理方面优于开源和商业基线方法。 Conclusion: CaliTex使几何一致性成为网络的内在行为,显著提升3D纹理生成的质量和稳定性。 Abstract: Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling. To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure. It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity. Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization. Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.

[126] HTTM: Head-wise Temporal Token Merging for Faster VGGT

Weitian Wang,Lukas Meiner,Rai Shubham,Cecilia De La Parra,Akash Kumar

Main category: cs.CV

TL;DR: 本文提出了头级时间合并(HTTM),一种无需训练的3D令牌合并方法,用于加速视觉几何接地Transformer(VGGT)在大规模场景重建中的推理过程,实现了高达7倍的加速且性能损失可忽略。

Details Motivation: VGGT虽然能联合推断关键3D属性,但其全局注意力机制在处理长序列输入时导致显著延迟,限制了其在大场景重建中的应用。 Method: 提出头级时间合并(HTTM),在多头注意力粒度上进行令牌合并,利用头级别观察到的空间局部性和时间对应性,实现更高合并比率和更低合并成本。 Result: HTTM在GPU推理中实现了最高达7倍的加速,同时保持了极小的性能下降。 Conclusion: HTTM通过保留特征令牌的独特性并利用头级结构特性,有效解决了现有令牌合并方法的局限性,为VGGT的大规模应用提供了高效解决方案。 Abstract: The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.

[127] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Stefanos Koutoupis,Michaela Areti Zervou,Konstantinos Kontras,Maarten De Vos,Panagiotis Tsakalides,Grigorios Tsagatakis

Main category: cs.CV

TL;DR: 提出了一种名为Contrastive Fusion (ConFu)的框架,通过扩展传统的成对对比目标,联合嵌入单个模态及其融合表示,有效捕捉高阶依赖关系并保持良好的成对对齐性能。

Details Motivation: 现有方法在多模态学习中主要关注成对对齐,难以充分建模多个模态间的高阶交互,且在单模态任务上表现受限。 Method: 引入ConFu框架,在统一表示空间中同时嵌入单个模态和融合模态,并扩展传统对比学习目标,增加融合模态对比项,以联合优化成对和高阶模态关系。 Result: 在合成和真实多模态基准上验证了ConFu的有效性,能更好捕捉跨模态互补性和高阶依赖(如XOR关系),并在检索与分类任务中表现出竞争力,支持统一的一对一和二对一检索。 Conclusion: ConFu能够兼顾高阶模态交互与成对对齐,在复杂多模态场景下实现更全面的联合表示学习。 Abstract: Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

[128] Hybrid SIFT-SNN for Efficient Anomaly Detection of Traffic Flow-Control Infrastructure

Munish Rathee,Boris Bačić,Maryam Doborjeh

Main category: cs.CV

TL;DR: 提出SIFT-SNN框架,结合SIFT与脉冲神经网络,实现低延迟、高精度的交通基础设施结构异常实时检测。

Details Motivation: 需要一种低延迟、可解释性强且适用于边缘部署的系统,用于实时检测交通基础设施中的结构异常,提升安全监测效率。 Method: 将SIFT用于空间特征编码,结合基于延迟驱动的脉冲转换层和LIF脉冲神经网络进行分类,并在真实与合成增强数据上训练验证。 Result: 在Auckland Harbour Bridge数据集上达到92.3%的分类准确率(±0.8%),单帧推理时间9.5毫秒,稀疏脉冲活动为8.1%,支持实时低功耗边缘部署。 Conclusion: SIFT-SNN框架在保持空间特征可解释性的同时实现了低延迟高效推理,适合嵌入式部署,为交通基础设施安全监测提供了可推广的解决方案,但需进一步验证在未见现场条件下的泛化能力。 Abstract: This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport infrastructure. The proposed approach integrates Scale-Invariant Feature Transform (SIFT) for spatial feature encoding with a latency-driven spike conversion layer and a Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN) for classification. The Auckland Harbour Bridge dataset is recorded under various weather and lighting conditions, comprising 6,000 labelled frames that include both real and synthetically augmented unsafe cases. The presented system achieves a classification accuracy of 92.3% (+- 0.8%) with a per-frame inference time of 9.5 ms. Achieved sub-10 millisecond latency, combined with sparse spike activity (8.1%), enables real-time, low-power edge deployment. Unlike conventional CNN-based approaches, the hybrid SIFT-SNN pipeline explicitly preserves spatial feature grounding, enhances interpretability, supports transparent decision-making, and operates efficiently on embedded hardware. Although synthetic augmentation improved robustness, generalisation to unseen field conditions remains to be validated. The SIFT-SNN framework is validated through a working prototype deployed on a consumer-grade system and framed as a generalisable case study in structural safety monitoring for movable concrete barriers, which, as a traffic flow-control infrastructure, is deployed in over 20 cities worldwide.

[129] SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Tae-Min Choi,Tae Kyeong Jeong,Garam Kim,Jaemin Lee,Yeongyoon Koh,In Cheul Choi,Jae-Ho Chung,Jong Woong Park,Juyoun Park

Main category: cs.CV

TL;DR: SurgMLLMBench是一个统一的多模态基准,旨在推动手术场景理解中的交互式多模态大语言模型研究,包含像素级分割和结构化VQA标注,并提出新的MAVIS数据集。

Details Motivation: 现有外科数据集多采用异构分类体系的视觉问答格式,缺乏对像素级分割的支持,限制了模型的可比性和应用性。 Method: 构建了一个统一的多模态基准SurgMLLMBench,整合了腹腔镜、机器人辅助和显微外科领域的像素级器械分割掩码与结构化VQA标注,并引入新收集的MAVIS数据集。 Result: 在SurgMLLMBench上训练的单一模型在多个手术领域表现一致,并能有效泛化到未见数据集。 Conclusion: SurgMLLMBench为多模态外科AI研究提供了强大资源,支持可重复评估和交互式手术推理模型的发展。 Abstract: Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

[130] PFF-Net: Patch Feature Fitting for Point Cloud Normal Estimation

Qing Li,Huifang Feng,Kanle Shi,Yue Gao,Yi Fang,Yu-Shen Liu,Zhizhong Han

Main category: cs.CV

TL;DR: 提出一种基于多尺度特征融合的点云法线估计方法,通过patch特征聚合与跨尺度特征补偿实现鲁棒且高效的法线估计。

Details Motivation: 现有方法在处理不同数据或几何形状时难以选择合适的局部邻域大小,且参数量大、效率低,难以兼顾准确性和计算成本。 Method: 引入多尺度特征融合策略,设计patch特征拟合(PFF)框架,包含多尺度特征聚合模块和跨尺度特征补偿模块:前者逐步将多尺度特征聚集到中心并缩小补丁范围,后者复用大尺度早期特征以增强不同尺度间的信息关联。 Result: 在合成与真实世界数据集上均达到最先进性能,同时减少网络参数量和运行时间。 Conclusion: 所提方法能有效适应不同尺度的局部补丁,提供最优特征描述,实现更准确、高效和鲁棒的法线估计。 Abstract: Estimating the normal of a point requires constructing a local patch to provide center-surrounding context, but determining the appropriate neighborhood size is difficult when dealing with different data or geometries. Existing methods commonly employ various parameter-heavy strategies to extract a full feature description from the input patch. However, they still have difficulties in accurately and efficiently predicting normals for various point clouds. In this work, we present a new idea of feature extraction for robust normal estimation of point clouds. We use the fusion of multi-scale features from different neighborhood sizes to address the issue of selecting reasonable patch sizes for various data or geometries. We seek to model a patch feature fitting (PFF) based on multi-scale features to approximate the optimal geometric description for normal estimation and implement the approximation process via multi-scale feature aggregation and cross-scale feature compensation. The feature aggregation module progressively aggregates the patch features of different scales to the center of the patch and shrinks the patch size by removing points far from the center. It not only enables the network to precisely capture the structure characteristic in a wide range, but also describes highly detailed geometries. The feature compensation module ensures the reusability of features from earlier layers of large scales and reveals associated information in different patch sizes. Our approximation strategy based on aggregating the features of multiple scales enables the model to achieve scale adaptation of varying local patches and deliver the optimal feature description. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets with fewer network parameters and running time.

[131] Endo-G$^{2}$T: Geometry-Guided & Temporally Aware Time-Embedded 4DGS For Endoscopic Scenes

Yangle Liu,Fengze Li,Kan Liu,Jieming Ma

Main category: cs.CV

TL;DR: 提出Endo-G²T,一种面向动态内窥镜场景的几何引导、时间感知的4D高斯点阵训练框架,通过几何先验蒸馏、时序嵌入表示和关键帧约束流式优化,实现稳定、高效且精确的4D重建。

Details Motivation: 内窥镜视频存在强烈的视角依赖效应(如镜面反射、湿反射和遮挡),纯光度监督易导致几何早期漂移,难以纠正。现有方法在动态场景中难以兼顾几何准确性、时序一致性和计算效率。 Method: 1)几何引导先验蒸馏:将置信门控的单目深度转化为尺度不变的深度及其梯度损失,并采用预热到上限的调度策略软注入先验;2)时序嵌入的高斯场:在四维时空(XYZT)中建模动态,采用类旋量旋转参数化,辅以轻量正则化保证运动平滑与不透明度边界清晰;3)关键帧约束流式训练:在最大点数预算下聚焦关键帧优化,非关键帧轻量更新,提升效率与长时稳定性。 Result: 在EndoNeRF和StereoMIS-P1数据集上,Endo-G²T在单目重建方法中达到最优性能,显著优于现有基线方法,表现出更强的几何稳定性与视觉质量。 Conclusion: Endo-G²T通过早期几何锚定与时间感知建模,有效缓解了内窥镜视频中因光度误差引发的几何漂移问题,为动态医学影像的高质量4D重建提供了高效且鲁棒的解决方案。 Abstract: Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns with geometry and triggers early geometric drift, where erroneous shapes are reinforced during densification and become hard to correct. We ask how to anchor geometry early for 4D Gaussian splatting (4DGS) while maintaining temporal consistency and efficiency in dynamic endoscopic scenes. Thus, we present Endo-G$^{2}$T, a geometry-guided and temporally aware training scheme for time-embedded 4DGS. First, geo-guided prior distillation converts confidence-gated monocular depth into supervision with scale-invariant depth and depth-gradient losses, using a warm-up-to-cap schedule to inject priors softly and avoid early overfitting. Second, a time-embedded Gaussian field represents dynamics in XYZT with a rotor-like rotation parameterization, yielding temporally coherent geometry with lightweight regularization that favors smooth motion and crisp opacity boundaries. Third, keyframe-constrained streaming improves efficiency and long-horizon stability through keyframe-focused optimization under a max-points budget, while non-keyframes advance with lightweight updates. Across EndoNeRF and StereoMIS-P1 datasets, Endo-G$^{2}$T achieves state-of-the-art results among monocular reconstruction baselines.

[132] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Xin Gu,Haoji Zhang,Qihang Fan,Jingxuan Niu,Zhipeng Zhang,Libo Zhang,Guang Chen,Fan Chen,Longyin Wen,Sijie Zhu

Main category: cs.CV

TL;DR: 本文提出了STVG-o1,首个无需架构修改即可实现最先进性能的多模态大语言模型(MLLM)框架,用于视频中的时空定位任务。通过引入边界框思维链机制和多维强化奖励函数,显著提升了细粒度区域-词语对齐能力,在多个基准上取得了领先结果。

Details Motivation: 现有的MLLM在时空视频定位(STVG)任务中表现不佳,主要由于训练目标不一致以及视觉编码器缺乏细粒度的区域-词语对齐能力。 Method: 提出STVG-o1框架,引入边界框思维链(bounding-box chain-of-thought)机制,在预测前显式推理时空位置,并设计包含格式、一致性、时序、空间和思维奖励的多维强化学习奖励函数,通过强化微调提供几何感知监督。 Result: 在HCSTVG-v1/v2和VidSTG数据集上达到最先进水平,HCSTVG-v1的m_tIoU比最佳任务专用方法高出7.3%,在VidSTG上与专用模型相当,且大幅超越所有现有基于MLLM的方法,展现出强大学的开放词汇泛化能力。 Conclusion: STVG-o1证明了无需修改架构的现成MLLM可通过适当的推理机制和训练策略成为高效精准的时空定位工具,为MLLM在视频理解任务中的应用开辟了新路径。 Abstract: Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3\% m\_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.

[133] Monet: Reasoning in Latent Visual Space Beyond Images and Language

Qixun Wang,Yang Shi,Yifei Wang,Yuanxing Zhang,Pengfei Wan,Kun Gai,Xianghua Ying,Yisen Wang

Main category: cs.CV

TL;DR: 本文提出了Monet框架,使多模态大语言模型能在潜在视觉空间中进行直接推理,通过生成连续嵌入作为中间视觉思维,并提出VLPO强化学习方法以增强潜在推理能力。

Details Motivation: 现有视觉推理方法受限于外部工具,缺乏人类般的抽象视觉思维灵活性,需突破文本推理局限。 Method: 设计三阶段基于蒸馏的监督微调(SFT)流程,构建包含12.5万样本的高质量交错图文CoT数据集Monet-SFT-125K,并提出VLPO方法将潜在嵌入引入策略梯度更新。 Result: Monet-7B在现实感知与推理基准上表现持续提升,在分布外抽象视觉推理任务中展现出强泛化能力。 Conclusion: Monet实现了在潜在空间中的视觉推理,有效解决了计算成本与监督不足问题,为未来视觉潜在推理研究提供了实践路径与经验启示。 Abstract: "Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.

[134] DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models

Mingue Park,Prin Phunyaphibarn,Phillip Y. Lee,Minhyuk Sung

Main category: cs.CV

TL;DR: 提出DiverseVAR框架,通过在测试时注入文本嵌入噪声和引入scale-travel细化方法,在不重训练的前提下提升视觉自回归模型的生成多样性,同时有效保持图像质量。

Details Motivation: 视觉自回归模型(VAR)在图像生成中表现优异,但生成结果缺乏多样性,尤其在相同提示下常产生高度相似的图像,这一问题在追求图像质量的研究中被忽视。 Method: 首先在文本嵌入中注入噪声以提升多样性;其次提出scale-travel,利用多尺度自编码器提取粗粒度token,在生成中间阶段恢复生成过程,以维持图像质量。 Result: 实验表明,该方法显著提升了生成多样性,同时最小化了图像质量下降,实现了多样性与质量间新的Pareto最优。 Conclusion: DiverseVAR在无需重训练或高计算成本的情况下,有效解决了VAR模型生成多样性不足的问题,为自回归图像生成提供了更优的测试时优化方案。 Abstract: We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.

[135] SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning

Futian Wang,Mengqi Wang,Xiao Wang,Haowen Wang,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种结合SAM基础模型与知识图谱的遥感图像变化描述方法,通过融合全局视觉特征、语义/运动级变化区域及兴趣对象信息,实现了更精准的自然语言变化描述。

Details Motivation: 现有遥感变化描述方法区域感知弱、时序对齐能力有限,缺乏对兴趣区域的有效建模,限制了描述的准确性和可解释性。 Method: 采用CNN/Transformer提取全局视觉特征,利用SAM模型分割语义和运动层面的变化区域,并构建专门的知识图谱提供兴趣对象信息,最后通过交叉注意力机制融合多源异构信息,由Transformer解码器生成自然语言描述。 Result: 在多个主流遥感变化描述数据集上取得了最先进的性能表现。 Conclusion: 所提方法有效增强了模型的区域感知与知识引导能力,显著提升了遥感图像变化描述的准确性与可读性,为该领域提供了新的技术路径。 Abstract: Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning

[136] E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework

Adeela Islam,Stefano Fiorini,Manuel Lecha,Theodore Tsesmelis,Stuart James,Pietro Morerio,Alessio Del Bue

Main category: cs.CV

TL;DR: 本文提出了一种名为E-M3RF的等变多模态3D重装配框架,结合几何与颜色特征,利用SE(3)流匹配预测碎片的变换,显著提升了在复杂碎片场景下的重装配精度。

Details Motivation: 现有基于深度学习的3D重装配方法主要依赖几何特征,在几何信息不足或模糊时(如小碎片、侵蚀或对称碎片)表现不佳,且缺乏物理约束防止重叠组装。 Method: 提出E-M3RF框架,使用旋转等变编码器提取3D点位置的几何特征,用Transformer编码每点颜色信息,融合形成多模态表示,并通过SE(3)流匹配预测碎片间的变换。 Result: 在四个数据集(Breaking Bad、Fantastic Breaks、RePAIR和Presious)上实验表明,E-M3RF在RePAIR数据集上相比现有方法旋转误差降低23.1%,平移误差降低13.2%,Chamfer距离减少18.4%。 Conclusion: E-M3RF通过融合几何与颜色的多模态表示及等变建模,有效提升了3D碎片重装配的准确性,尤其适用于几何信息受限的真实场景。 Abstract: 3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.

[137] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Jiajie Zhang,Sören Schwertfeger,Alexander Kleiner

Main category: cs.CV

TL;DR: 提出了一种无监督框架,从连续工业视频流中自动提取和组织视觉-语言-动作(VLA)预训练数据。

Details Motivation: 利用大量未标注的人类操作视频数据,解决VLA模型在制造业中缺乏高质量、结构化预训练数据的问题。 Method: 首先训练轻量级运动分词器编码动态动作,然后使用基于“潜在动作能量”度量的无监督动作分割器发现并分割语义连贯的动作基元,输出分段视频及其对应的潜在动作序列。 Result: 在公开基准和专有电机装配数据集上验证了关键任务的有效分割,并通过视觉语言模型聚类和量化评估证实了所发现动作基元的语义一致性。 Conclusion: 这是首个完全自动化、端到端的系统,能从非结构化工业视频中提取VLA预训练数据,为制造业中的具身AI集成提供了可扩展的解决方案。 Abstract: We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.

[138] EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream Aggregation

Futian Wang,Fan Zhang,Xiao Wang,Mengqi Wang,Dexing Huang,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种基于超图的时空事件流补全机制,通过超图连接不同时空位置的事件令牌,并利用上下文信息传递来弥补事件的空间稀疏性问题,同时支持融合RGB模态信息,实验证明该方法在单标签和多标签事件分类任务中均有效。

Details Motivation: 现有事件表示学习方法在处理事件数据的空间稀疏性时存在欠采样问题,难以充分保留时空信息。 Method: 提出超图引导的时空事件流补全机制,构建超图连接跨时空的事件令牌,进行上下文消息传递以补全稀疏事件,并将RGB令牌融入超图实现多模态信息补全,结合自注意力聚合不同时刻的节点信息。 Result: 在单标签和多标签事件分类任务上进行了广泛实验,验证了所提框架的有效性,显著提升了分类性能。 Conclusion: 该方法有效缓解了事件数据的空间稀疏性带来的欠采样问题,实现了高效的多模态特征学习与融合,为事件表示学习提供了新思路。 Abstract: Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.

[139] MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices

Shuai Zhang,Bao Tang,Siyuan Yu,Yueting Zhu,Jingfeng Yao,Ya Zou,Shanglin Yuan,Li Yu,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: 本文提出了MobileI2V,一种轻量级扩散模型,用于在移动设备上实现实时高分辨率图像到视频生成。通过线性混合注意力架构、时间步蒸馏策略和移动端注意力优化,显著提升了生成速度并保持高质量。

Details Motivation: 扩散模型在移动设备上进行图像到视频生成时面临计算复杂度高和生成速度慢的问题,难以满足实时性需求。 Method: 1) 设计线性混合架构去噪器以平衡效率与质量;2) 提出时间步蒸馏策略,将采样步数压缩至两步;3) 应用移动端专用的注意力优化技术。 Result: 实现了在移动设备上每帧低于100ms的720p视频生成速度,比传统方法快10倍,并保持了高质量输出。 Conclusion: MobileI2V首次在移动设备上实现了高效、高质量的实时图像到视频生成,为移动端视频生成应用提供了可行方案。 Abstract: Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the substantial computational complexity and slow generation speed of diffusion models pose significant challenges for real-time, high-resolution video generation on resource-constrained mobile devices. In this work, we propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices. The core lies in: (1) We analyzed the performance of linear attention modules and softmax attention modules on mobile devices, and proposed a linear hybrid architecture denoiser that balances generation efficiency and quality. (2) We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss, resulting in a 10-fold increase in generation speed. (3) We apply mobile-specific attention optimizations that yield a 2-fold speed-up for attention operations during on-device inference. MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models. Under one-step conditions, the generation speed of each frame of 720p video is less than 100 ms. Our code is available at: https://github.com/hustvl/MobileI2V.

[140] Frequency-Aware Token Reduction for Efficient Vision Transformer

Dong-Jae Lee,Jiwan Hur,Jaehyun Choi,Jaemyung Yu,Junmo Kim

Main category: cs.CV

TL;DR: 本文提出了一种频率感知的token缩减策略,通过分离高频和低频token并聚合低频分量来提升视觉Transformer的效率与性能,有效缓解了秩坍缩和过平滑问题。

Details Motivation: Vision Transformers的计算复杂度随token长度呈二次增长,现有token缩减方法忽略了自注意力中的频率特性(如秩坍缩和过平滑),需要更有效的压缩策略。 Method: 将token划分为高频和低频部分,选择性保留高频token,并将低频token聚合为一个紧凑的直流token以保留关键低频信息。 Result: 实验表明该方法在降低计算开销的同时显著提升了准确率,并有效缓解了秩坍缩和过平滑现象。同时分析了先前方法的隐式频率特性及其局限性。 Conclusion: 所提出的频率感知token缩减策略在保持模型性能的同时提高了计算效率,为理解与改进Vision Transformer中的token压缩提供了新视角。 Abstract: Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations.

[141] Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning

Taehoon Kim,Donghwan Jang,Bohyung Han

Main category: cs.CV

TL;DR: 提出了一种名为Merge-and-Bound(M&B)的类增量学习新训练方法,通过在参数空间中直接操作模型权重来优化,结合跨任务和任务内权重合并,并引入有界更新技术以减少灾难性遗忘,在不修改架构或目标的情况下显著提升性能。

Details Motivation: 在类增量学习中,如何有效保留旧任务知识并避免灾难性遗忘是一个核心挑战。现有方法通常依赖复杂的架构设计或存储旧数据,而本文旨在通过直接在参数空间中进行权重操作,提供一种更简洁、通用的优化路径。 Method: 提出Merge-and-Bound(M&B)方法,包含两种权重合并:跨任务合并(对之前各阶段模型权重取平均)和任务内合并(融合当前阶段内的参数)。同时引入有界更新技术,限制参数更新幅度,使新模型靠近旧模型,从而减少知识遗忘。该方法可无缝集成到现有CIL方法中,无需修改模型结构或学习目标。 Result: 在多个标准CIL基准上进行了广泛评估,结果表明M&B显著优于当前最先进的方法,验证了其在缓解灾难性遗忘和提升整体性能方面的有效性。 Conclusion: M&B通过在参数空间中直接进行受控的权重合并,提供了一种简单而强大的类增量学习优化范式,证明了最小化累积更新即可有效平衡新旧知识的学习,具有良好的通用性和应用潜力。 Abstract: We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the parameter space for optimization. Our algorithm involves two types of weight merging: inter-task weight merging and intra-task weight merging. Inter-task weight merging unifies previous models by averaging the weights of models from all previous stages. On the other hand, intra-task weight merging facilitates the learning of current task by combining the model parameters within current stage. For reliable weight merging, we also propose a bounded update technique that aims to optimize the target model with minimal cumulative updates and preserve knowledge from previous tasks; this strategy reveals that it is possible to effectively obtain new models near old ones, reducing catastrophic forgetting. M&B is seamlessly integrated into existing CIL methods without modifying architecture components or revising learning objectives. We extensively evaluate our algorithm on standard CIL benchmarks and demonstrate superior performance compared to state-of-the-art methods.

[142] CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation

Shizhe Sun,Wataru Ohyama

Main category: cs.CV

TL;DR: 提出了一种基于交叉注意力的非局部知识蒸馏方法CanKD,通过让学生特征图的每个像素动态关注教师特征图的所有像素,增强知识迁移效果。

Details Motivation: 传统基于自注意力的知识蒸馏方法独立对齐师生特征图,难以充分捕捉像素间的长距离依赖关系,限制了特征表示的学习。 Method: 引入交叉注意力机制构建非局部知识传递,通过新增一个损失函数实现,使学生网络能够更全面地学习教师网络的特征结构。 Result: 在目标检测和图像分割任务上显著优于现有的特征和混合蒸馏方法,验证了其有效性。 Conclusion: CanKD为视觉任务中的注意力引导蒸馏提供了一个新的有效范式。 Abstract: We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD's potential as a new paradigm for attention-guided distillation in computer vision tasks. Code is available at https://github.com/tori-hotaru/CanKD

[143] Generalized Design Choices for Deepfake Detectors

Lorenzo Pellegrini,Serafino Pandolfini,Davide Maltoni,Matteo Ferrara,Marco Prati,Marco Ramilli

Main category: cs.CV

TL;DR: 本文系统地研究了不同设计选择对深度伪造检测模型准确性与泛化能力的影响,旨在建立与架构无关的最佳实践,以提升检测性能并在AI-GenBench基准上实现最先进效果。

Details Motivation: 深度伪造检测方法的性能常受实现细节(如数据预处理、增强策略和优化技术)影响,导致难以公平比较和识别关键影响因素。 Method: 通过隔离训练、推理和增量更新等各个设计因素的影响,进行系统性实验分析。 Result: 确定了一组能持续提升深度伪造检测性能的设计选择,并在AI-GenBench基准上达到最先进的表现。 Conclusion: 提出了一套鲁棒且与模型架构无关的最佳实践,可用于指导未来深度伪造检测系统的设计与开发。 Abstract: The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentation strategies, and optimization techniques. These factors make it difficult to fairly compare detectors and to understand which factors truly contribute to their performance. To address this, we systematically investigate how different design choices influence the accuracy and generalization capabilities of deepfake detection models, focusing on aspects related to training, inference, and incremental updates. By isolating the impact of individual factors, we aim to establish robust, architecture-agnostic best practices for the design and development of future deepfake detection systems. Our experiments identify a set of design choices that consistently improve deepfake detection and enable state-of-the-art performance on the AI-GenBench benchmark.

[144] Self-Paced Learning for Images of Antinuclear Antibodies

Yiyang Jiang,Guangwu Qian,Jiaxin Wu,Qi Huang,Qing Li,Yongkang Wu,Xiao-Yong Wei

Main category: cs.CV

TL;DR: 本文提出了一种用于抗核抗体(ANA)检测的新框架,通过实例采样、伪标签分配和自步学习策略,在无需手动预处理的情况下解决了多实例多标签学习的复杂性,显著提升了检测性能。

Details Motivation: 手动ANA检测耗时且需要大量训练,同时存在多种荧光模式组合,传统机器学习方法难以应对真实临床环境中的多实例多标签(MIML)挑战。 Method: 提出一种基于未修改显微镜图像的端到端MIML框架,包含三个组件:实例采样器(抑制低置信度实例)、概率伪标签分配器(自适应分配标签)和自步学习权重调整机制(根据标签观测调整训练过程)。 Result: 在ANA数据集上相比现有最佳方法F1-Macro提升+7.0%,mAP提升+12.6%;在公共MIML医学数据集上排名前二,Hamming loss和one-error分别最多降低18.2%和26.9%。 Conclusion: 该框架有效克服了传统MIML方法的局限性,实现了更准确、鲁棒的ANA检测,具有良好的临床应用前景,并为医学图像MIML任务提供了新思路。 Abstract: Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren's syndrome, and scleroderma. Despite its importance, manual ANA detection is slow, labor-intensive, and demands years of training. ANA detection is complicated by over 100 coexisting antibody types, resulting in vast fluorescent pattern combinations. Although machine learning and deep learning have enabled automation, ANA detection in real-world clinical settings presents unique challenges as it involves multi-instance, multi-label (MIML) learning. In this paper, a novel framework for ANA detection is proposed that handles the complexities of MIML tasks using unaltered microscope images without manual preprocessing. Inspired by human labeling logic, it identifies consistent ANA sub-regions and assigns aggregated labels accordingly. These steps are implemented using three task-specific components: an instance sampler, a probabilistic pseudo-label dispatcher, and self-paced weight learning rate coefficients. The instance sampler suppresses low-confidence instances by modeling pattern confidence, while the dispatcher adaptively assigns labels based on instance distinguishability. Self-paced learning adjusts training according to empirical label observations. Our framework overcomes limitations of traditional MIML methods and supports end-to-end optimization. Extensive experiments on one ANA dataset and three public medical MIML benchmarks demonstrate the superiority of our framework. On the ANA dataset, our model achieves up to +7.0% F1-Macro and +12.6% mAP gains over the best prior method, setting new state-of-the-art results. It also ranks top-2 across all key metrics on public datasets, reducing Hamming loss and one-error by up to 18.2% and 26.9%, respectively. The source code can be accessed at https://github.com/fletcherjiang/ANA-SelfPacedLearning.

[145] EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

Pierre Adorni,Minh-Tan Pham,Stéphane May,Sébastien Lefèvre

Main category: cs.CV

TL;DR: 提出一种高效的遥感基础模型“专家集成”框架,通过轻量级、可冻结复用的专用模型提升效率与可扩展性,并支持联邦学习与持续集成。

Details Motivation: 现有遥感基础模型依赖大规模模型和数据,资源消耗大,难以普及且不环保,亟需高效、可持续的替代方案。 Method: 采用“专家集成”框架,将训练分解为多个轻量级、任务特定的ConvNeXtV2专家模型,各专家可冻结并重复使用,支持模块化与联邦训练。 Result: 该方法在保持性能的同时显著降低计算与资源开销,具备良好的可解释性、可扩展性,并支持模型剪枝与持续集成。 Conclusion: 所提框架为构建可扩展、高效且可持续的遥感基础模型提供了新方向,特别适用于资源受限和协作式环境。 Abstract: Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs.

[146] The Age-specific Alzheimer 's Disease Prediction with Characteristic Constraints in Nonuniform Time Span

Xin Hong,Kaifeng Huang

Main category: cs.CV

TL;DR: 本研究提出了一种基于定量指标引导的序列图像生成方法,并结合年龄缩放因子生成特定年龄的MRI图像,以提高阿尔茨海默病长期预测的准确性。

Details Motivation: 阿尔茨海默病需早期识别以制定个性化治疗策略,但在不规则时间间隔下生成能准确反映疾病特征的影像具有挑战性。 Method: 采用定量指标引导的序列图像生成方法,并引入年龄缩放因子优化MRI图像合成过程。 Result: 消融实验表明,定量指标显著提升了MRI图像合成的准确性,年龄缩放像素损失有助于迭代生成更高质量的图像,结构相似性指数达到0.882。 Conclusion: 所提方法能有效保留疾病进展的关键特征,提升不规则时间序列下阿尔茨海默病影像预测的性能。 Abstract: Alzheimer's disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development of personalized treatment strategies that aim to mitigate its progression. The application of generated images for the prediction of Alzheimer's disease poses challenges, particularly in accurately representing the disease's characteristics when input sequences are captured at irregular time intervals. This study presents an innovative methodology for sequential image generation, guided by quantitative metrics, to maintain the essential features indicative of disease progression. Furthermore, an age-scaling factor is integrated into the process to produce age-specific MRI images, facilitating the prediction of advanced stages of the disease. The results obtained from the ablation study suggest that the inclusion of quantitative metrics significantly improves the accuracy of MRI image synthesis. Furthermore, the application of age-scaled pixel loss contributed to the enhanced iterative generation of MRI images. In terms of long-term disease prognosis, the Structural Similarity Index reached a peak value of 0.882, indicating a substantial degree of similarity in the synthesized images.

[147] Video Generation Models Are Good Latent Reward Models

Xiaoyue Mi,Wenqing Yu,Jiesong Lian,Shibo Jie,Ruizhe Zhong,Zijun Liu,Guozhen Zhang,Zixiang Zhou,Zhiyong Xu,Yuan Zhou,Qinglin Lu,Fan Tang

Main category: cs.CV

TL;DR: 提出PRFL框架,通过在噪声潜在空间中进行奖励反馈学习,实现视频生成的高效偏好优化,显著降低内存消耗和训练时间,同时提升与人类偏好的对齐效果。

Details Motivation: 现有基于像素空间的视频奖励模型计算开销大、内存占用高,且仅在去噪后期优化,缺乏对早期运动动态和结构一致性的监督。 Method: 利用预训练视频生成模型在噪声潜在空间中的时序建模能力,设计PRFL框架,在无需VAE解码的情况下全程在潜在空间进行奖励建模和梯度反向传播。 Result: PRFL相比RGB空间的ReFL方法显著减少了内存消耗和训练时间,并在多个实验中表现出更好的人类偏好对齐效果。 Conclusion: 在潜在空间中进行视频生成的奖励反馈学习是更高效且有效的方案,PRFL为视频生成偏好优化提供了新的可行路径。 Abstract: Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

[148] UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes

Kang Du,Xue Liao,Junpeng Xia,Chaozheng Guo,Yi Gu,Yirui Guan,Duotun Wang,ShengHuang,Zeyu Wang

Main category: cs.CV

TL;DR: 本文提出了UAVLight,一个用于光照鲁棒性3D重建的受控但真实的基准数据集,通过在不同时间点重复采集实现自然光照变化下的几何一致性评估。

Details Motivation: 光照不一致是多视角3D重建中的根本挑战,现有数据集无法有效分离光照变化与几何或语义变化的影响,难以评估方法的光照鲁棒性。 Method: 设计可重复、地理配准的无人机飞行路径,在多个固定时间采集同一场景,保持几何、标定和视角一致,引入自然光照变化,并提供标准化评估协议。 Result: UAVLight数据集实现了光照多样性同时保持场景一致性,支持对MVS、SfM和神经渲染等方法在光照变化下的性能进行可靠评估。 Conclusion: UAVLight为户外多视角3D重建提供了可靠的光照鲁棒性基准,推动了在真实环境中具有一致性、保真度和可重光照能力的方法发展。 Abstract: Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable. However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness. We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.

[149] Multimodal Robust Prompt Distillation for 3D Point Cloud Models

Xiang Gu,Liming Lu,Xu Zheng,Anan Du,Yongbin Zhou,Shuchao Pang

Main category: cs.CV

TL;DR: 提出了一种高效的多模态鲁棒提示蒸馏框架(MRPD),通过教师-学生模型提升3D点云模型对对抗攻击的防御能力,训练时蒸馏知识,推理无额外开销。

Details Motivation: 现有3D点云对抗防御方法存在计算开销高和跨攻击泛化能力差的问题,需更高效且通用的解决方案。 Method: 设计教师-学生框架MRPD,利用视觉模型(深度投影)、高性能3D模型和文本编码器三种教师生成鲁棒嵌入,通过轻量级提示学习和置信度门控机制对齐学生模型特征。 Result: MRPD在多种白盒和黑盒攻击下显著优于现有防御方法,且在干净数据上性能更优,推理无额外计算成本。 Conclusion: MRPD提供了一种实用的新范式,通过高效融合多模态知识构建鲁棒的3D视觉系统。 Abstract: Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.

[150] Enhanced Landmark Detection Model in Pelvic Fluoroscopy using 2D/3D Registration Loss

Chou Mo,Yehyun Suh,J. Ryan Martin,Daniel Moyer

Main category: cs.CV

TL;DR: 提出了一种结合2D/3D标志点配准的U-Net框架,用于在可变患者姿态下提高术中骨盆X光图像的标志点检测精度。

Details Motivation: 现有骨盆X光标志点检测方法大多假设为固定的前后位视角,难以应对术中成像角度和患者体位变化的问题。 Method: 将2D/3D标志点配准信息引入U-Net模型训练,采用姿态估计损失进行训练或微调,以提升在非标准视角下的检测鲁棒性。 Result: 相比基线U-Net,引入姿态估计损失的模型在可变术中条件下表现出更高的标志点检测精度。 Conclusion: 该框架能有效提升在患者姿态变化情况下的自动标志点检测性能,具有较强的临床应用潜力。 Abstract: Automated landmark detection offers an efficient approach for medical professionals to understand patient anatomic structure and positioning using intra-operative imaging. While current detection methods for pelvic fluoroscopy demonstrate promising accuracy, most assume a fixed Antero-Posterior view of the pelvis. However, orientation often deviates from this standard view, either due to repositioning of the imaging unit or of the target structure itself. To address this limitation, we propose a novel framework that incorporates 2D/3D landmark registration into the training of a U-Net landmark prediction model. We analyze the performance difference by comparing landmark detection accuracy between the baseline U-Net, U-Net trained with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss under realistic intra-operative conditions where patient pose is variable.

[151] Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Teng Hu,Zhentao Yu,Guozhen Zhang,Zihan Su,Zhengguang Zhou,Youliang Zhang,Yuan Zhou,Qinglin Lu,Ran Yi

Main category: cs.CV

TL;DR: 本文提出了Harmony框架,通过跨任务协同训练、全局-局部解耦交互模块和同步增强的CFG方法,解决了生成式AI中音视频同步的关键挑战,显著提升了音视频生成的质量与对齐精度。

Details Motivation: 现有的开源模型在音视频对齐方面存在不足,主要受限于联合扩散过程中的对应漂移、低效的全局注意力机制以及传统CFG的模态内偏置问题。 Method: 提出Harmony框架:1)跨任务协同训练以减少漂移;2)全局-局部解耦交互模块实现精确时序对齐;3)同步增强型CFG(SyncCFG)在推理中强化对齐信号。 Result: 实验表明,Harmony在生成保真度和细粒度音视频同步方面均显著优于现有方法,达到新的SOTA水平。 Conclusion: Harmony通过机制化设计有效解决了音视频同步中的根本问题,为多模态生成模型提供了新的优化方向。 Abstract: The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

[152] Deep Learning-Based Multiclass Classification of Oral Lesions with Stratified Augmentation

Joy Naoum,Revana Salama,Ali Hamdi

Main category: cs.CV

TL;DR: 本研究提出了一种基于深度学习的多类分类器,用于16种不同口腔病变的分类,通过分层数据划分、数据增强和过采样技术克服了数据集有限且不平衡的问题,实验结果显示该模型在准确率、精确率和召回率上优于现有方法,显示出在口腔癌早期检测中的应用潜力。

Details Motivation: 由于口腔癌在早期难以通过视觉区分良性和恶性病变,常在晚期才被诊断,因此需要一种可靠的计算机辅助诊断系统来提高早期检测能力。 Method: 采用深度学习构建多类分类器,结合分层数据划分、先进的数据增强和过采样技术进行分类。 Result: 实验结果达到83.33%的准确率、89.12%的精确率和77.31%的召回率,优于当前最先进的方法。 Conclusion: 所提出的框架在少数类分类表现突出,展示了过采样和增强策略的有效性,为临床环境中可信的计算机辅助诊断系统提供了有前景的第一步。 Abstract: Oral cancer is highly common across the globe and is mostly diagnosed during the later stages due to the close visual similarity to benign, precancerous, and malignant lesions in the oral cavity. Implementing computer aided diagnosis systems early on has the potential to greatly improve clinical outcomes. This research intends to use deep learning to build a multiclass classifier for sixteen different oral lesions. To overcome the challenges of limited and imbalanced datasets, the proposed technique combines stratified data splitting and advanced data augmentation and oversampling to perform the classification. The experimental results, which achieved 83.33 percent accuracy, 89.12 percent precision, and 77.31 percent recall, demonstrate the superiority of the suggested model over state of the art methods now in use. The suggested model effectively conveys the effectiveness of oversampling and augmentation strategies in situations where the minority class classification performance is noteworthy. As a first step toward trustworthy computer aided diagnostic systems for the early detection of oral cancer in clinical settings, the suggested framework shows promise.

[153] MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

Haotian Xue,Qi Chen,Zhonghao Wang,Xun Huang,Eli Shechtman,Jinrong Xie,Yongxin Chen

Main category: cs.CV

TL;DR: MoGAN是一种无需奖励模型或人类偏好数据的运动中心型后训练框架,通过基于DiT的光流判别器和分布匹配正则化器显著提升视频扩散模型的运动真实感与时间一致性,同时保持视觉保真度。

Details Motivation: 现有视频扩散模型在帧级保真度上表现良好,但缺乏对时间一致性的直接监督,导致生成视频存在抖动、重影或不合理的动态问题。 Method: 在3步蒸馏视频扩散模型基础上,训练一个基于DiT的光流判别器以区分真实与生成的运动,并结合分布匹配正则化器来保持视觉质量。 Result: 在Wan2.1-T2V-1.3B上的实验表明,MoGAN在VBench上比50步教师模型提升+7.3%的运动得分,比3步DMD模型提升+13.3%;在VideoJAM-Bench上分别提升+7.4%和+8.8%,且保持甚至提升了美学和图像质量评分。人类研究也显示其在运动质量上更受偏好。 Conclusion: MoGAN能有效提升视频生成中的运动真实感和时间一致性,同时不牺牲视觉质量和推理效率,为快速高质量视频生成提供了实用路径。 Abstract: Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.

[154] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

M. Naseer Subhani

Main category: cs.CV

TL;DR: 提出一种自提示、点监督的框架,通过Refine-Requery-Reinforce循环提升SAM在遥感图像上的分割性能,仅使用稀疏点标注即可显著优于预训练SAM和其他点监督方法。

Details Motivation: 由于域偏移严重且密集标注稀缺,现有的交互式分割模型(如SAM)在遥感图像上表现不佳,因此需要一种适应遥感域的高效分割方法。 Method: 采用Refine-Requery-Reinforce循环:首先从初始点生成粗略伪掩码(Refine),然后利用自构建的边界框提示优化结果(Requery),并通过迭代对齐嵌入减少确认偏差(Reinforce),实现无需全掩码监督的自我引导提示适应。 Result: 在WHU、HRSID和NWPU VHR-10三个遥感图像基准数据集上验证了方法的有效性,性能持续优于预训练SAM及近期点监督分割方法。 Conclusion: 自提示与语义对齐为基于点级标注的基础分割模型在遥感应用中的可扩展适应提供了有效路径。 Abstract: Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM's segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.

[155] Active Learning for GCN-based Action Recognition

Hichem Sahbi

Main category: cs.CV

TL;DR: 提出了一种标签高效的图卷积网络(GCN)模型,通过对抗策略选择信息丰富的样本,并引入双向稳定GCN架构,提升了骨架动作识别在少量标注数据下的性能。

Details Motivation: 现有GCN在骨架动作识别中依赖大量标注数据,而实际中标注数据稀缺,限制了其应用。 Method: 设计了一种新的获取函数,采用对抗策略平衡代表性、多样性和不确定性来选择最具信息量的样本;同时提出了双向稳定的GCN架构,增强环境空间与潜在空间之间的映射。 Result: 在两个具有挑战性的骨架动作识别基准上进行了广泛评估,结果表明所提方法显著优于先前工作,尤其在标签数据有限的情况下表现突出。 Conclusion: 所提出的标签高效GCN模型能有效减少对标注数据的依赖,同时提升识别性能,适用于实际场景中的骨架动作识别任务。 Abstract: Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of labeled data, which are frequently scarce in practical settings. To address this limitation, we propose a novel label-efficient GCN model. Our work makes two primary contributions. First, we develop a novel acquisition function that employs an adversarial strategy to identify a compact set of informative exemplars for labeling. This selection process balances representativeness, diversity, and uncertainty. Second, we introduce bidirectional and stable GCN architectures. These enhanced networks facilitate a more effective mapping between the ambient and latent data spaces, enabling a better understanding of the learned exemplar distribution. Extensive evaluations on two challenging skeleton-based action recognition benchmarks reveal significant improvements achieved by our label-efficient GCNs compared to prior work.

[156] Qwen3-VL Technical Report

Shuai Bai,Yuxuan Cai,Ruizhe Chen,Keqin Chen,Xionghui Chen,Zesen Cheng,Lianghao Deng,Wei Ding,Chang Gao,Chunjiang Ge,Wenbin Ge,Zhifang Guo,Qidong Huang,Jie Huang,Fei Huang,Binyuan Hui,Shutong Jiang,Zhaohai Li,Mingsheng Li,Mei Li,Kaixin Li,Zicheng Lin,Junyang Lin,Xuejing Liu,Jiawei Liu,Chenglong Liu,Yang Liu,Dayiheng Liu,Shixuan Liu,Dunjie Lu,Ruilin Luo,Chenxu Lv,Rui Men,Lingchen Meng,Xuancheng Ren,Xingzhang Ren,Sibo Song,Yuchong Sun,Jun Tang,Jianhong Tu,Jianqiang Wan,Peng Wang,Pengfei Wang,Qiuyue Wang,Yuxuan Wang,Tianbao Xie,Yiheng Xu,Haiyang Xu,Jin Xu,Zhibo Yang,Mingkun Yang,Jianxin Yang,An Yang,Bowen Yu,Fei Zhang,Hang Zhang,Xi Zhang,Bo Zheng,Humen Zhong,Jingren Zhou,Fan Zhou,Jing Zhou,Yuanzhi Zhu,Ke Zhu

Main category: cs.CV

TL;DR: Qwen3-VL是通义千问系列中性能最强的多模态模型,支持长达256K token的图文视频交错输入,在纯文本理解、长上下文建模和多模态推理方面表现卓越,并通过三项架构升级实现图像与视频任务的先进性能。

Details Motivation: 为了提升视觉语言模型在长上下文、多模态交错内容以及复杂推理任务中的表现,克服现有模型在跨模态对齐、时空建模和时序定位上的局限。 Method: 提出Qwen3-VL模型,采用三项关键技术:增强的交错式MRoPE以改进时空建模;DeepStack集成以融合多层次ViT特征加强视觉-语言对齐;基于文本的时间对齐方法实现更精确的视频时序定位。同时提供密集和混合专家(MoE)多种规模版本。 Result: Qwen3-VL在MMMU、MathVista、MathVision等多个权威多模态基准上取得领先性能,具备强大的纯文本理解能力、256K长上下文支持能力以及在单图、多图和视频任务中的先进推理能力。 Conclusion: Qwen3-VL是一个功能强大的多模态基础模型,适用于图文推理、智能体决策和多模态代码生成等实际应用场景,为现实世界工作流提供了坚实的技术基础。 Abstract: We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

[157] Continual Error Correction on Low-Resource Devices

Kirill Paramonov,Mete Ozay,Aristeidis Mystakidis,Nikolaos Tsalikidis,Dimitrios Sotos,Anastasios Drosou,Dimitrios Tzovaras,Hyunjun Kim,Kiseok Chang,Sangdok Mo,Namwoong Kim,Woojong Yoo,Jijoong Moon,Umberto Michieli

Main category: cs.CV

TL;DR: 提出一种基于原型更新的高效AI错误校正系统,结合服务器端知识蒸馏与设备端原型自适应,在资源受限设备上实现低开销、少样本的AI预测错误纠正。

Details Motivation: 现有AI错误检测方法缺乏对资源受限设备高效的纠错机制,需避免频繁模型重训练以降低计算与存储开销。 Method: 采用服务器端基础模型进行知识蒸馏,生成轻量级特征提取器;设备端通过原型更新机制实现少样本错误纠正,无需重新训练模型。 Result: 在Food-101和Flowers-102数据集上的一次性学习场景中实现超过50%的错误纠正率,遗忘率低于0.02%,计算开销极低,并通过Android应用验证实用性。 Conclusion: 该系统实现了高效、低资源消耗的AI错误纠正,适用于实际部署于资源受限的边缘设备。 Abstract: The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system's effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system's practicality in real-world scenarios.

[158] CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow

Ruisheng Han,Kanglei Zhou,Shuang Chen,Amir Atapour-Abarghouei,Hubert P. H. Shum

Main category: cs.CV

TL;DR: 提出CaFlow框架,结合反事实去混杂与双向时间条件流,用于长时动作质量评估,实现SOTA性能。

Details Motivation: 长时动作质量评估面临建模长期时序动态和抵御上下文混杂因素的挑战,现有方法依赖昂贵标注或单向时序建模,易产生虚假相关性和不稳定的表示。 Method: 提出CaFlow框架,包含因果反事实正则化(CCR)模块,以自监督方式分离因果与混杂特征并通过反事实干预增强因果鲁棒性;以及BiT-Flow模块,通过循环一致性约束建模前向和后向动态,生成更平滑、连贯的表示。 Result: 在多个长时AQA基准上进行的大量实验表明,CaFlow实现了最先进的性能。 Conclusion: CaFlow通过整合反事实去混杂和双向时序建模,有效提升了长时动作质量评估的准确性和鲁棒性,具有广泛应用前景。 Abstract: Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. The Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions, while the BiT-Flow module models forward and backward dynamics with a cycle-consistency constraint to produce smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance. Code is available at https://github.com/Harrison21/CaFlow

[159] Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Tianyi Xiong,Yi Ge,Ming Li,Zuolong Zhang,Pranav Kulkarni,Kaishen Wang,Qi He,Zeying Zhu,Chenxi Liu,Ruibo Chen,Tong Zheng,Yanshuo Chen,Xiyao Wang,Renrui Zhang,Wenhu Chen,Heng Huang

Main category: cs.CV

TL;DR: 本文提出了Multi-Crit基准,用于评估大型多模态模型在遵循多样化、细粒度评价标准方面的能力,揭示了现有模型在多标准判断中的一致性与灵活性不足的问题。

Details Motivation: 尽管大型多模态模型(LMMs)被广泛用作多模态评估的裁判,但其对多样化、细粒度评价标准的遵循能力尚未充分探索。 Method: 构建了一个名为Multi-Crit的基准,包含开放式生成和可验证推理任务,通过严格的数据整理流程收集具有多标准人工标注的挑战性响应对,并提出三个新指标来评估模型在多元标准遵循、标准切换灵活性和偏好冲突识别上的表现。 Result: 对25个LMM的综合分析表明:1)专有模型在保持对多元标准的一致遵循方面仍有困难,尤其在开放式评估中;2)开源模型在灵活遵循多样标准方面更落后;3)基于整体判断信号的批评微调增强了视觉定位能力,但未能泛化到细粒度标准级判断。 Conclusion: Multi-Crit作为首个系统评估多模态裁判能力的基准,为构建可靠且可控的多模态AI评估系统奠定了基础。 Abstract: Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.

[160] Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Naifu Zhang,Wei Tao,Xi Xiao,Qianpu Sun,Yuxin Zheng,Wentao Mo,Peiqiang Wang,Nan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为ADVLA的新型对抗攻击框架,通过在视觉编码器映射到文本特征空间的特征上直接施加扰动,高效且隐蔽地破坏视觉-语言-动作(VLA)模型的动作预测。该方法无需昂贵的端到端训练,具有低幅度、稀疏性和高攻击成功率的优点。

Details Motivation: 现有的VLA模型对抗攻击方法依赖高成本的端到端训练,且生成的扰动补丁明显,影响实用性。因此需要一种更高效、更隐蔽的攻击方式。 Method: ADVLA框架将对抗扰动直接施加于视觉编码器输出并投影至文本特征空间的特征上,结合注意力引导机制实现扰动的聚焦与稀疏化,并提出了三种策略:增强敏感性、强制稀疏性和集中扰动区域。采用Top-K掩码策略,在L∞=4/255约束下仅修改不到10%的图像块。 Result: 实验表明,ADVLA在极低扰动幅度下实现了近100%的攻击成功率,扰动集中在关键区域,整体图像中几乎不可察觉,单步迭代仅需约0.06秒,显著优于传统基于补丁的攻击方法。 Conclusion: ADVLA有效削弱了VLA模型在低幅度和局部稀疏条件下的下游动作预测能力,避免了传统补丁攻击的高训练成本和明显扰动,展现出对VLA特征空间攻击的独特有效性与实用价值。 Abstract: In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.

[161] Revolutionizing Glioma Segmentation & Grading Using 3D MRI - Guided Hybrid Deep Learning Models

Pandiyaraju V,Sreya Mynampati,Abishek Karthik,Poovarasan L,D. Saraswathi

Main category: cs.CV

TL;DR: 提出了一种结合U-Net分割与DenseNet-VGG分类的混合深度学习模型,用于胶质瘤的3D MRI精准分割与分类,引入多头注意力和空间-通道注意力机制,实现了98%的Dice系数和99%的分类准确率,优于传统方法。

Details Motivation: 胶质瘤具有高死亡率,早期准确诊断对治疗至关重要,但传统CNN模型在处理高维3D MRI数据时存在精度和可解释性不足的问题。 Method: 采用U-Net进行肿瘤分割,结合DenseNet与VGG的混合网络进行分类,并引入多头注意力和空间-通道注意力机制;对3D MRI数据进行归一化、重采样和数据增强等预处理。 Result: 在分割任务中达到98%的Dice系数和较高的IoU,在分类任务中达到99%的准确率,以及相应的高精度、召回率和F1分数,优于传统CNN和无注意力机制模型。 Conclusion: 该混合框架在胶质瘤的自动分割与分类中表现出优越性能,提升了模型的可解释性和临床实用性,具有辅助医生进行快速、可靠诊断与分级的潜力。 Abstract: Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors. To address this difficulty, the proposed research will develop a hybrid deep learning model which integrates U-Net based segmentation and a hybrid DenseNet-VGG classification network with multihead attention and spatial-channel attention capabilities. The segmentation model will precisely demarcate the tumors in a 3D volume of MRI data guided by spatial and contextual information. The classification network which combines a branch of both DenseNet and VGG, will incorporate the demarcated tumor on which features with attention mechanisms would be focused on clinically relevant features. High-dimensional 3D MRI data could successfully be utilized in the model through preprocessing steps which are normalization, resampling, and data augmentation. Through a variety of measures the framework is evaluated: measures of performance in segmentation are Dice coefficient and Mean Intersection over Union (IoU) and measures of performance in classification are accuracy precision, recall, and F1-score. The hybrid framework that has been proposed has demonstrated through physical testing that it has the capability of obtaining a Dice coefficient of 98% in tumor segmentation, and 99% on classification accuracy, outperforming traditional CNN models and attention-free methods. Utilizing multi-head attention mechanisms enhances notions of priority in aspects of the tumor that are clinically significant, and enhances interpretability and accuracy. The results suggest a great potential of the framework in facilitating the timely and reliable diagnosis and grading of glioma by clinicians is promising, allowing for better planning of patient treatment.

[162] Seeing without Pixels: Perception from Camera Trajectories

Zihui Xue,Kristen Grauman,Dima Damen,Andrew Zisserman,Tengda Han

Main category: cs.CV

TL;DR: 本文首次系统研究了仅通过相机轨迹(而非像素)感知视频内容的可能性,提出了一种名为CamFormer的对比学习框架,将相机姿态轨迹映射到与自然语言对齐的联合嵌入空间。结果表明,相机运动轨迹是一种高度信息丰富的信号,能够揭示视频中的行为或观察内容,具备跨模态对齐、分类和时序分析等多种应用潜力,并在不同姿态估计方法下均表现出鲁棒性。

Details Motivation: 探索不依赖视觉像素、仅通过相机运动轨迹理解视频内容的可能性,挑战传统视频理解范式,挖掘非传统模态的语义表达能力。 Method: 提出CamFormer,一种基于对比学习的编码器框架,将相机姿态序列编码为嵌入表示,并与自然语言描述对齐,训练其在跨模态空间中捕捉语义信息。 Result: CamFormer在多种下游任务中表现出色,包括跨模态检索、动作分类和时序分析;且其性能在不同相机姿态估计方法(高精度多传感器与普通RGB输入)下均保持稳健,验证了相机轨迹作为轻量、通用模态的有效性。 Conclusion: 相机轨迹本身蕴含丰富的语义信息,可作为一种轻量、鲁棒且多功能的模态用于视频内容理解,为未来低带宽、隐私保护等场景下的视频分析提供了新思路。 Abstract: Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

[163] Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Yusuf Dalva,Guocheng Gordon Qian,Maya Goldenberg,Tsai-Shien Chen,Kfir Aberman,Sergey Tulyakov,Pinar Yanardag,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: 本文提出了Canvas-to-Image框架,通过将多种异构控制信号(如文本、姿态、布局等)统一编码到一个复合画布图像中,实现高保真、多模态的图像生成控制。

Details Motivation: 现有扩散模型在同时处理文本、参考图像、空间布局等多种控制输入时难以保证生成图像的忠实性和组合性,缺乏统一的控制机制。 Method: 提出将多种控制信号融合为单一的复合画布图像,并设计多任务画布训练策略,在统一学习范式下训练扩散模型以联合理解这些控制信号。 Result: 在多任务数据集上验证了方法的有效性,实验表明该方法在身份保持、控制一致性等方面显著优于现有最先进方法,尤其在多人组合、姿态控制、布局约束等复杂场景中表现突出。 Conclusion: Canvas-to-Image提供了一种统一且通用的多模态控制框架,能够有效整合多种异构控制信号,提升扩散模型在复杂用户意图下的生成忠实度与灵活性。 Abstract: While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.