Skip to content

Table of Contents

cs.CL [Back]

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

Hen-Hsen Huang

Main category: cs.CL

TL;DR: 本文主张将大语言模型的效率研究重心从大规模复杂方法转向适用于资源有限环境的简洁高效方案,提出无需重新训练即可提升效率的新架构、轻量级微调、经济化推理及动态知识管理,并倡导采用考虑采用成本、可持续性和公平性的“开销感知效率”作为新标准。

Details Motivation: 现有的大语言模型效率技术(如MoE、推测解码和复杂RAG)依赖于大规模基础设施和专业团队,导致中小机构难以应用,加剧了技术不平等和碳排放问题。 Method: 提出一个新研究议程:通过在不重新训练的前提下改造预训练模型结构、开发轻量微调方法、优化长链推理过程、实现无需重型RAG的知识管理,并引入‘开销感知效率’(OAE)作为评估标准。 Result: 构建了一套面向资源受限场景的高效、稳健且易部署的大语言模型使用范式,降低部署门槛。 Conclusion: 重新定义效率以包含采用成本、可持续性和公平性,有助于实现大语言模型的民主化部署,减少技术鸿沟与环境负担。 Abstract: Large language models (LLMs) have become indispensable, but the most celebrated efficiency methods -- mixture-of-experts (MoE), speculative decoding, and complex retrieval-augmented generation (RAG) -- were built for hyperscale providers with vast infrastructure and elite teams. Outside that context, their benefits collapse into overhead, fragility, and wasted carbon. The result is that a handful of Big Tech companies benefit, while thousands of hospitals, schools, governments, and enterprises are left without viable options. We argue that the next frontier is not greater sophistication at scale, but robust simplicity: efficiency that thrives under modest resources and minimal expertise. We propose a new research agenda: retrofitting pretrained models with more efficient architectures without retraining, inventing lightweight fine-tuning that preserves alignment, making reasoning economical despite long chains of thought, enabling dynamic knowledge management without heavy RAG pipelines, and adopting Overhead-Aware Efficiency (OAE) as a standard benchmark. By redefining efficiency to include adoption cost, sustainability, and fairness, we can democratize LLM deployment -- ensuring that optimization reduces inequality and carbon waste rather than amplifying them.

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

Tcharlies Schmitz

Main category: cs.CL

TL;DR: 本文提出了Harmonic Token Projection (HTP),一种无需训练、词汇表或随机参数的可逆且确定性的文本嵌入生成框架。

Details Motivation: 旨在提供一种透明、高效且可解释的文本嵌入方法,避免传统神经网络嵌入对训练数据和复杂计算的依赖。 Method: 将每个token的Unicode整数表示转换为谐波轨迹,通过解析方式编码为连续向量空间中的点,实现离散符号与向量空间之间的双射映射。 Result: 在STS-B及其多语言扩展上实验显示,HTP在英语中达到Spearman相关系数ρ = 0.68,并在十种语言中保持稳定性能,计算成本极低,每句对延迟低于毫秒。 Conclusion: 语义关系可以从确定性几何结构中产生,HTP为数据驱动嵌入提供了透明、高效的替代方案。 Abstract: This paper introduces the Harmonic Token Projection (HTP), a reversible and deterministic framework for generating text embeddings without training, vocabularies, or stochastic parameters. Unlike neural embeddings that rely on statistical co-occurrence or optimization, HTP encodes each token analytically as a harmonic trajectory derived from its Unicode integer representation, establishing a bijective and interpretable mapping between discrete symbols and continuous vector space. The harmonic formulation provides phase-coherent projections that preserve both structure and reversibility, enabling semantic similarity estimation from purely geometric alignment. Experimental evaluation on the Semantic Textual Similarity Benchmark (STS-B) and its multilingual extension shows that HTP achieves a Spearman correlation of \r{ho} = 0.68 in English, maintaining stable performance across ten languages with negligible computational cost and sub-millisecond latency per sentence pair. This demonstrates that meaningful semantic relations can emerge from deterministic geometry, offering a transparent and efficient alternative to data-driven embeddings. Keywords: Harmonic Token Projection, reversible embedding, deterministic encoding, semantic similarity, multilingual representation.

[3] A centroid based framework for text classification in itsm environments

Hossein Mohanna,Ali Ait-Bachir

Main category: cs.CL

TL;DR: 提出了一种基于双嵌入质心的文本分类框架,用于IT服务管理中的层次化分类任务,兼具高性能、可解释性和高效训练。

Details Motivation: 在IT服务管理中,支持工单需要准确地分类到层次化树状结构中,现有方法往往难以兼顾性能、可解释性与更新效率。 Method: 采用双嵌入(语义和词汇)质心表示法,每个类别维护独立的语义和词汇质心,并在推理时通过互逆秩融合结合两者结果。 Result: 在8,968个工单、123个类别的数据上,分层F1分数达到0.731(优于SVM的0.727),训练速度快5.9倍,增量更新快达152倍,在排除嵌入计算后批量推理速度提升8.6-8.8倍。 Conclusion: 该方法在保持高分类性能的同时显著提升了训练与更新效率,并具备良好的可解释性,适合部署于注重可维护性与效率的生产级ITSM系统。 Abstract: Text classification with hierarchical taxonomies is a fundamental requirement in IT Service Management (ITSM) systems, where support tickets must be categorized into tree-structured taxonomies. We present a dual-embedding centroid-based classification framework that maintains separate semantic and lexical centroid representations per category, combining them through reciprocal rank fusion at inference time. The framework achieves performance competitive with Support Vector Machines (hierarchical F1: 0.731 vs 0.727) while providing interpretability through centroid representations. Evaluated on 8,968 ITSM tickets across 123 categories, this method achieves 5.9 times faster training and up to 152 times faster incremental updates. With 8.6-8.8 times speedup across batch sizes (100-1000 samples) when excluding embedding computation. These results make the method suitable for production ITSM environments prioritizing interpretability and operational efficiency.

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

Yongfu Xue

Main category: cs.CL

TL;DR: 提出PIRA训练范式,通过重构指令、聚合多任务奖励和稳定价值头输出,提升奖励模型的数据效率与鲁棒性。

Details Motivation: 传统判别式奖励模型存在数据效率低和易受过度优化影响的问题,需改进以更好对齐人类偏好。 Method: 将问答对重构为基于偏好的指令,聚合不同偏好任务的奖励,并在不同dropout率下对value-head输出取平均以稳定奖励。 Result: 大量实验验证了PIRA在提升数据效率、减少偏差和增强奖励稳定性方面的有效性。 Conclusion: PIRA有效缓解了奖励模型在数据利用和过优化方面的挑战,提升了LLM对齐性能。 Abstract: Reward models are crucial for aligning Large Language Models (LLMs) with human preferences but face two representative challenges. First, traditional discriminative reward models usually concatenate questions and responses directly as input, resulting in low data efficiency. Second, reward models are vulnerable to reward overoptimization. We propose PIRA, a training paradigm addressing these issues through three strategies: (1) Reformulating question-answer pairs into preference-based instructions for clearer and more explicit task specification, (2) aggregating rewards from diverse preference tasks to reduce bias and improve robustness, and (3) averaging value-head outputs under varying dropout rates to stabilize rewards. Extensive experiments have demonstrated the effectiveness of PIRA.

Mann Khatri,Mirza Yusuf,Rajiv Ratn Shah,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: 本文研究了如何通过重构法律文档、定义修辞角色和模拟法院推理来提升大语言模型在法律任务中的表现,实验结果表明这些方法显著提高了F1分数。

Details Motivation: 大语言模型在通用领域表现出色,但在法律等专业领域因缺乏领域特定预训练而表现不佳,且法律文本通常冗长复杂,难以有效处理。 Method: 在零样本设置下,通过对三个印度法律判决预测数据集进行实验,分析文档重组、定义修辞角色和模拟法院逐步推理对模型性能的影响。 Result: 组织数据或解释关键法律术语显著提升了模型性能,F1分数相比基线最少提高约1.5%,最高提升达4.36%。 Conclusion: 通过结构化信息呈现、术语解释和模拟人类推理过程,可以有效增强大语言模型在法律领域的理解和推理能力,无需完全的领域内训练。 Abstract: Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.

[6] MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

Saad Mankarious,Ayah Zirikly,Daniel Wiechmann,Elma Kerz,Edward Kempa,Yu Qiao

Main category: cs.CL

TL;DR: 本文提出了一个新的大规模心理健康分析数据集MindSET,基于Reddit上自我报告诊断的1300万条标注帖子,涵盖七种心理疾病,解决了现有数据集过时、质量低和多样性不足的问题。通过严格的数据清洗和语言分析,验证了数据集的质量,并在诊断检测任务中显著优于先前基准。

Details Motivation: 现有心理健康研究的社交媒体数据集存在数据过时、清洗不足和多语言/有害内容处理不充分等问题,限制了研究进展,亟需一个高质量、大规模、多样化的新型基准数据集。 Method: 从Reddit收集基于自我报告诊断的帖子,构建MindSET数据集;进行语言过滤、NSFW内容剔除和去重等严格预处理;使用LIWC进行语言学分析;采用微调语言模型和词袋模型进行二分类诊断检测实验。 Result: MindSET包含超过1300万条标注帖子,规模是此前数据集的两倍以上;在诊断检测任务中,模型表现显著提升,自闭症检测F1分数最高提升18个百分点;语言分析揭示了不同心理状态群体间的语言特征差异。 Conclusion: MindSET是一个高质量、大规模的心理健康研究数据集,为社交媒体上的心理状态分析提供了强有力的支持,有助于早期风险识别和新兴心理趋势的深入研究。 Abstract: Social media data has become a vital resource for studying mental health, offering real-time insights into thoughts, emotions, and behaviors that traditional methods often miss. Progress in this area has been facilitated by benchmark datasets for mental health analysis; however, most existing benchmarks have become outdated due to limited data availability, inadequate cleaning, and the inherently diverse nature of social media content (e.g., multilingual and harmful material). We present a new benchmark dataset, \textbf{MindSET}, curated from Reddit using self-reported diagnoses to address these limitations. The annotated dataset contains over \textbf{13M} annotated posts across seven mental health conditions, more than twice the size of previous benchmarks. To ensure data quality, we applied rigorous preprocessing steps, including language filtering, and removal of Not Safe for Work (NSFW) and duplicate content. We further performed a linguistic analysis using LIWC to examine psychological term frequencies across the eight groups represented in the dataset. To demonstrate the dataset utility, we conducted binary classification experiments for diagnosis detection using both fine-tuned language models and Bag-of-Words (BoW) features. Models trained on MindSET consistently outperformed those trained on previous benchmarks, achieving up to an \textbf{18-point} improvement in F1 for Autism detection. Overall, MindSET provides a robust foundation for researchers exploring the intersection of social media and mental health, supporting both early risk detection and deeper analysis of emerging psychological trends.

[7] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

Zheng Hui,Xiaokai Wei,Reza Shirkavand,Chen Wang,Weizhi Zhang,Alejandro Peláez,Michelle Gong

Main category: cs.CL

TL;DR: 提出FlexCode,一种基于流行度感知的生成式推荐框架,通过动态分配协同过滤与语义码本的令牌资源,提升推荐准确性与长尾鲁棒性。

Details Motivation: 现有生成式推荐模型使用统一码本编码物品,忽视了热门物品与长尾物品在协同信号和语义依赖上的差异,导致表示效率低和泛化能力受限。 Method: 设计FlexCode框架,采用双码本(协同过滤码本和语义码本)结构,通过轻量级MoE机制动态分配固定令牌预算,并引入对齐与平滑目标函数以保持跨流行度的一致性。 Result: 在公开和工业规模数据集上实验表明,FlexCode显著优于强基线方法,在推荐准确性和长尾性能方面均有提升。 Conclusion: FlexCode为生成式推荐中的令牌表示提供了新机制,有效平衡了记忆化与泛化,增强了模型的整体表现力与鲁棒性。 Abstract: Generative recommendation has recently emerged as a powerful paradigm that unifies retrieval and generation, representing items as discrete semantic tokens and enabling flexible sequence modeling with autoregressive models. Despite its success, existing approaches rely on a single, uniform codebook to encode all items, overlooking the inherent imbalance between popular items rich in collaborative signals and long-tail items that depend on semantic understanding. We argue that this uniform treatment limits representational efficiency and hinders generalization. To address this, we introduce FlexCode, a popularity-aware framework that adaptively allocates a fixed token budget between a collaborative filtering (CF) codebook and a semantic codebook. A lightweight MoE dynamically balances CF-specific precision and semantic generalization, while an alignment and smoothness objective maintains coherence across the popularity spectrum. We perform experiments on both public and industrial-scale datasets, showing that FlexCode consistently outperform strong baselines. FlexCode provides a new mechanism for token representation in generative recommenders, achieving stronger accuracy and tail robustness, and offering a new perspective on balancing memorization and generalization in token-based recommendation models.

[8] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

Saleh Almohaimeed,May Alsofyani,Saad Almohaimeed,Mansour Al Ghanim,Liqiang Wang

Main category: cs.CL

TL;DR: 本文提出了首个阿拉伯语跨领域、上下文相关的文本到SQL数据集Ar-SParC,并基于大语言模型和多种提示工程技术进行了系统实验,提出GAT corrector方法显著提升了阿拉伯语文本到SQL的解析性能。

Details Motivation: 阿拉伯语在跨域、上下文相关的文本到SQL任务中缺乏相关研究和数据集,限制了该语言在自然语言接口与数据库交互中的发展,因此亟需构建专门的数据集并探索有效的解决方案。 Method: 构建了包含3,450个对话序列(共10,225个问题)的Ar-SParC数据集;采用GPT-3.5-turbo和GPT-4.5-turbo两个大模型,结合四种问题表示方法和六种上下文学习技术进行40组实验;提出GAT corrector方法以提升SQL生成准确性,并通过消融实验分析其有效性。 Result: GAT corrector在零样本设置下平均提升1.9%执行准确率(EX)和1.9%交互准确率(IX),在上下文学习设置下分别提升1.72% EX和0.92% IX;实验验证了该方法在阿拉伯语文本到SQL任务中的优越性。 Conclusion: Ar-SParC填补了阿拉伯语在跨域、上下文相关文本到SQL任务上的空白,GAT corrector的有效性表明针对语言特性设计纠错机制可显著提升生成质量,为非英语语种的文本到SQL研究提供了新思路。 Abstract: In recent years, the task of cross-domain, context-dependent text-to-SQL has received significant attention. Enables users with no prior knowledge of SQL to have a conversation with databases using natural language. However, most of the available datasets and research have been conducted in English, along with some work in Chinese. To this date, no effort has been made to address this task in the Arabic language. In this paper, we introduce Ar-SParC, the first Arabic cross-domain, context-dependent text-to-SQL dataset. The dataset consists of 3,450 sequences of interrelated questions, each sequence containing an average of approximately three questions, which results in a total of 10225 questions along with their corresponding SQL queries. We conducted 40 experiments on the Ar-SParC dataset using two large language models, GPT-3.5-turbo and GPT-4.5-turbo, applying 10 different prompt engineering techniques, including four question representation methods and six in-context learning techniques. Furthermore, we developed a novel approach named GAT corrector, which enhanced the performance across all 40 experiments, yielding an average improvement of 1.9% in execution accuracy (EX) and 1.9% in interaction accuracy (IX) under zero-shot settings, and an average increase of 1.72% EX and 0.92% IX under in-context learning settings. Finally, we conducted an ablation study with two more experiments to explain why the GAT corrector outperformed the previous GAT verifier technique, particularly for the Arabic language.

[9] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

Matthew W. Kenaston,Umair Ayub,Mihir Parmar,Muhammad Umair Anjum,Syed Arsalan Ahmed Naqvi,Priya Kumar,Samarth Rawal,Aadel A. Chaudhuri,Yousef Zakharia,Elizabeth I. Heath,Tanios S. Bekaii-Saab,Cui Tao,Eliezer M. Van Allen,Ben Zhou,YooJung Choi,Chitta Baral,Irbaz Bin Riaz

Main category: cs.CL

TL;DR: 该研究开发了一个分层分类法来识别GPT-4在真实肿瘤学笔记中的推理错误,并发现23%的解读存在推理错误,主要为确认偏见和锚定偏见,可能导致不安全的临床建议。

Details Motivation: 大型语言模型虽在临床基准上表现良好,但可能通过错误的推理得出正确结论,这种推理缺陷对肿瘤学决策支持存在安全隐患,而现有基于准确性的评估无法捕捉此类问题。 Method: 采用回顾性双队列研究,基于CORAL数据集中的乳腺癌和胰腺癌病例构建一个三层推理错误分类体系,并在前列腺癌会诊笔记中验证其临床相关性,分析提取、分析和推荐任务中的推理轨迹。 Result: 23%的解读存在推理错误,其中确认偏见和锚定偏见最常见;这些错误与指南不符且可能有害的建议相关,尤其在晚期疾病管理中更为显著;最先进的自动化评估模型能检测错误存在但无法可靠分类子类型。 Conclusion: 大型语言模型可能因推理缺陷而给出看似合理但临床不安全的建议;所提出的分类法为评估和提升模型推理保真度提供了可推广的框架,应在临床部署前加以应用。 Abstract: Despite high performance on clinical benchmarks, large language models may reach correct conclusions through faulty reasoning, a failure mode with safety implications for oncology decision support that is not captured by accuracy-based evaluation. In this two-cohort retrospective study, we developed a hierarchical taxonomy of reasoning errors from GPT-4 chain-of-thought responses to real oncology notes and tested its clinical relevance. Using breast and pancreatic cancer notes from the CORAL dataset, we annotated 600 reasoning traces to define a three-tier taxonomy mapping computational failures to cognitive bias frameworks. We validated the taxonomy on 822 responses from prostate cancer consult notes spanning localized through metastatic disease, simulating extraction, analysis, and clinical recommendation tasks. Reasoning errors occurred in 23 percent of interpretations and dominated overall errors, with confirmation bias and anchoring bias most common. Reasoning failures were associated with guideline-discordant and potentially harmful recommendations, particularly in advanced disease management. Automated evaluators using state-of-the-art language models detected error presence but could not reliably classify subtypes. These findings show that large language models may provide fluent but clinically unsafe recommendations when reasoning is flawed. The taxonomy provides a generalizable framework for evaluating and improving reasoning fidelity before clinical deployment.

[10] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

Bharadwaj Yadavalli

Main category: cs.CL

TL;DR: 本文提出了动态模板选择(DTS)方法,通过根据查询复杂度自适应匹配响应模板,显著降低大语言模型的输出令牌成本,同时保持响应质量。

Details Motivation: 现有的统一提示策略在处理不同类型查询时缺乏效率,尤其是在简单问题上使用冗长响应导致高昂的输出令牌成本。 Method: 提出动态模板选择(DTS),比较了基于MLP和微调RoBERTa的两种路由方法,利用预计算嵌入或Transformer模型判断查询复杂度并选择合适模板。 Result: 在1,000个MMLU问题上评估显示,MLP路由器准确率达90.5%,略高于RoBERTa的89.5%;跨三个主流LLM提供商的9,000次API调用验证了路由决策的通用性,令牌消耗减少32.6%至33.9%。 Conclusion: DTS能有效实现成本节约且不牺牲质量,具有良好的跨平台泛化能力,为实际部署中的高效推理提供了可行方案。 Abstract: Contemporary large language model deployments typically employ uniform prompting strategies across diverse query types, applying verbose response patterns to both complex analytical tasks and straightforward factual questions. This one-size-fits-all methodology leads to substantial token inefficiency, a concern amplified by the significant cost differential between input and output tokens--the latter commanding 4-8x higher prices across major providers. We present Dynamic Template Selection (DTS), which adaptively matches response templates to query complexity, achieving significant cost reductions without compromising response quality. We compared two routing approaches: a simple MLP that uses pre-computed embeddings and a more complex fine-tuned RoBERTa transformer. Through comprehensive evaluation on 1,000 MMLU questions, we find that the MLP router achieves 90.5% routing accuracy on held-out test data, marginally exceeding RoBERTa's performance (89.5%) despite utilizing 125M fewer parameters. Notably, our empirical analysis reveals provider-agnostic behavior in template selection--routing decisions generalize effectively across 3 major LLM providers (OpenAI GPT-4, Google Gemini, and Anthropic Claude), as validated through 9,000 production API calls. While routing accuracy remains consistent at 90.5% across providers, observed token reductions vary from 32.6% to 33.9%, reflecting provider-specific generation characteristics. This work contributes several key elements: formal problem formulation with theoretical grounding in machine learning, four algorithms with corresponding complexity analyses, and extensive empirical validation across production systems.

[11] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

Lijun Shang,Yadong Yu,Wenqiang Kang,Jian Zhou,Dongyue Gao,Pan Xiang,Zhe Liu,Mengyan Dai,Zhonglu Guo,Zhimei Sun

Main category: cs.CL

TL;DR: 本文探讨了二维材料在能源存储和转换中的应用,强调了从分散的研究文献中提取关键信息的必要性。

Details Motivation: 由于二维材料的相关信息分散在大量研究论文中,难以高效获取其性质和制备方法等关键数据。 Method: 通过系统分析已发表的研究论文,总结二维材料的物理化学性质、电子特性及其制备方法。 Result: 揭示了二维材料在能源领域的重要应用潜力,并提出了信息整合的方法。 Conclusion: 有效整合文献中的信息有助于加速二维材料在能源技术中的开发与应用。 Abstract: Two-dimensional (2D) materials have showed widespread applications in energy storage and conversion owning to their unique physicochemical, and electronic properties. Most of the valuable information for the materials, such as their properties and preparation methods, is included in the published research papers. However, due to the dispersion of synthe

[12] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

Trung Cuong Dang,David Mohaisen

Main category: cs.CL

TL;DR: 本文提出了一种新的框架——多前缀记忆化(multi-prefix memorization),用于检测大语言模型中训练数据的记忆化现象,通过衡量能够触发某段序列的不同前缀数量来评估记忆的鲁棒性。

Details Motivation: 现有记忆化定义在捕捉对齐模型中的记忆现象时存在不足,且难以全面反映数据泄露风险,因此需要一种更全面、更具鲁棒性的定义方法。 Method: 定义一个序列为“被记忆”的条件是:外部对抗性搜索能找到足够多不同的前缀使其生成该序列,从而将记忆化检测转化为对检索路径多样性的量化问题。 Result: 在开源和对齐对话模型上的实验表明,该方法能有效区分被记忆与非记忆内容,且比单路径提取更可靠。 Conclusion: 多前缀记忆化框架提供了一种实用且稳健的方法来审计大语言模型中的数据泄漏问题。 Abstract: Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.

[13] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

Jiaojiao Han,Wujiang Xu,Mingyu Jin,Mengnan Du

Main category: cs.CL

TL;DR: SAGE是一种基于代理的框架,通过主动、解释驱动的方法提升稀疏自动编码器特征的可解释性,在生成和预测准确性上优于现有方法。

Details Motivation: 大型语言模型的内部机制不透明,稀疏自动编码器虽有助于分解表示,但其捕获的特征仍难以解释。 Method: 提出SAGE框架,系统地为每个特征生成多种解释,设计针对性实验进行验证,并根据激活反馈迭代优化解释。 Result: 在多个语言模型的SAE特征上实验表明,SAGE的解释在生成准确性和预测准确性上显著优于现有基线方法。 Conclusion: SAGE通过主动推理和实证反馈机制,有效提升了对LLM中SAE特征的理解,增强了模型可解释性与可靠性。 Abstract: Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

[14] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

Asad Aali,Muhammad Ahmed Mohsin,Vasiliki Bikia,Arnav Singhvi,Richard Gaus,Suhana Bedi,Hejie Cui,Miguel Fuentes,Alyssa Unell,Yifan Mai,Jordan Cahoon,Michael Pfeffer,Roxana Daneshjou,Sanmi Koyejo,Emily Alsentzer,Percy Liang,Christopher Potts,Nigam H. Shah,Akshay S. Chaudhari

Main category: cs.CL

TL;DR: 本文提出了一种结合DSPy与HELM的可复现框架,通过结构化提示方法(尤其是引入推理链)来更准确地评估大语言模型性能,发现传统基准测试因缺乏结构化提示而低估性能、导致排名偏差,并展示了可扩展的性能上限估计对构建决策有用的基准的重要性。

Details Motivation: 现有的语言模型基准测试(如HELM)依赖固定提示,难以泛化,可能导致性能估计不准确;为解决这一问题,需系统评估结构化提示框架(如DSPy)在不同任务上的效果,并量化提示优化对性能评估的影响。 Method: 构建了一个集成DSPy与HELM的可复现框架,采用四种结构化提示方法,在七个通用和医学领域基准上对四个前沿大模型进行评估,并与原始HELM基线对比,分析提示变化对性能估计、稳定性及排行榜排序的影响。 Result: 研究发现:(1)无结构化提示时,HELM平均低估性能4%;(2)性能估计标准差增加2%;(3)7个基准中有3个出现排名反转;(4)引入推理链可降低模型对提示设计的敏感性。 Conclusion: 结构化提示(特别是可优化的推理提示)能显著提升基准测试的准确性与稳定性,应将可扩展的性能上限估计纳入标准评估流程,以支持更可靠的部署决策。 Abstract: As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we estimate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (+2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing reasoning (chain-of-thought) reduces LM sensitivity to prompt design (smaller Δ across prompts). To our knowledge, this is the first large-scale benchmarking study to empirically characterize LM behavior across benchmarks and prompting methods, showing that scalable performance ceiling estimation enables more decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).

[15] Length-MAX Tokenizer for Language Models

Dong Dong,Weijie Su

Main category: cs.CL

TL;DR: 提出了一种名为Length-MAX的新型语言模型分词器,通过最小化平均token每字符长度来减少训练和推理过程中的token数量。该方法将长度加权目标最大化建模为图划分问题,并设计了贪心近似算法,在多个指标上显著优于传统BPE方法。

Details Motivation: 传统的分词方法如字节对编码(BPE)主要基于词频进行合并,忽略了token长度对效率的影响。为了提升语言模型在训练和推理阶段的效率,需要一种能有效减少token数量且保持良好性能的新分词策略。 Method: 将最小化平均token每字符长度的目标转化为图划分问题,提出Length-MAX分词器,并设计了一种贪心近似算法来求解该优化问题,从而获得更高效的词汇表。 Result: 在FineWeb等多领域数据上,相比BPE减少了14%-18%的token数(64K时减少13.0%);训练GPT-2模型时收敛步数减少17.2%-18.5%,推理延迟降低12.7%-13.7%,吞吐量提升16%;下游任务表现更好,LAMBADA困惑度下降11.7%,HellaSwag准确率提升4.3%;词汇覆盖率达99.62%,OOV率为0.12%;推理时嵌入和KV缓存内存减少18%。 Conclusion: 优化平均token长度而非仅依赖频率,是一种在不牺牲甚至提升下游任务性能的前提下,实现更高效语言建模的有效途径。Length-MAX分词器具备实际部署能力,显著提升了训练与推理效率。 Abstract: We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.

[16] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei,Noveen Sachdeva,Benjamin Coleman,Zhankui He,Yuanchen Bei,Xuying Ning,Mengting Ai,Yunzhe Li,Jingrui He,Ed H. Chi,Chi Wang,Shuo Chen,Fernando Pereira,Wang-Cheng Kang,Derek Zhiyuan Cheng

Main category: cs.CL

TL;DR: 本文提出了Evo-Memory,一个用于评估大语言模型代理在连续任务流中自我演化记忆能力的基准和框架,强调记忆的动态积累与更新,并引入了ExpRAG和ReMem方法以提升经验复用与持续学习。

Details Motivation: 现有记忆评估主要集中在静态对话场景,忽略了在动态任务流中记忆的积累与复用能力,而现实应用中的LLM代理需要持续从交互中学习,因此需要能够测试运行时记忆演化的机制。 Method: 构建了Evo-Memory基准,将数据集组织为顺序任务流,要求LLM在每次交互后检索、整合和更新记忆;实现了十多种代表性记忆模块,并在10个多样化多轮目标导向及单轮推理与问答数据集上进行评估;提出了ExpRAG基线方法和ReMem(行动-思考-记忆 refine)流程。 Result: 通过统一框架评估多种记忆模块,验证了ReMem在促进记忆演化和经验复用方面的有效性,提升了LLM代理在长期任务中的性能表现。 Conclusion: Evo-Memory填补了当前对LLM代理动态记忆演化评估的空白,推动了具备持续学习能力的智能代理的发展。 Abstract: Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

[17] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

Ali Jahan,Masood Ghayoomi,Annette Hautli-Janisz

Main category: cs.CL

TL;DR: 本文提出了一种跨语言方法用于低资源语言的论点挖掘,通过三种训练场景在英语和波斯语上进行实验,结果表明结合少量双语数据的轻量级跨语言模型优于基于大模型增强的方法。

Details Motivation: 针对低资源语言在论点挖掘任务中面临的数据稀缺问题,探索有效的跨语言迁移方法以提升模型性能。 Method: 构建了三种训练场景:零样本迁移(仅用英语数据训练)、基于大语言模型生成合成样本增强的英语训练,以及结合英波双语数据的跨语言训练。使用英语Microtext语料库及其波斯语平行翻译进行评估。 Result: 零样本迁移模型在英语和波斯语测试集上F1分别为50.2%和50.7%;LLM增强模型提升至59.2%和69.3%;跨语言模型在波斯语测试集上达到74.8%的F1,表现最优。 Conclusion: 轻量级的跨语言数据融合策略能有效克服低资源语言的数据瓶颈,优于更复杂的LLM增强方法,为低资源语言的论点挖掘提供了实用且高效的解决方案。 Abstract: Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus \citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2\% on the English test set and 50.7\% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2\% on English and 69.3\% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8\%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.

[18] Emergence and Localisation of Semantic Role Circuits in LLMs

Nura Aljaafari,Danilo S. Carvalho,André Freitas

Main category: cs.CL

TL;DR: 提出一种新方法研究大语言模型如何实现语义角色,发现其内部存在紧凑且因果隔离的电路机制,并在不同规模和架构间部分可迁移。

Details Motivation: 大语言模型虽表现出语义能力,但其内部如何支撑抽象语义结构仍不清楚。 Method: 结合角色交叉最小对、时间演化分析和跨模型比较的方法。 Result: 发现了高度集中的电路(89-94%归因于28个节点),语义结构逐步精细化而非相变,较大模型有时绕过局部电路,跨尺度组件重叠中等但谱相似性高。 Conclusion: 大语言模型形成了紧凑且因果隔离的机制来处理抽象语义结构,这些机制在不同规模和架构之间具有部分可迁移性。 Abstract: Despite displaying semantic competence, large language models' internal mechanisms that ground abstract semantic structure remain insufficiently characterised. We propose a method integrating role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study how LLMs implement semantic roles. Our analysis uncovers: (i) highly concentrated circuits (89-94% attribution within 28 nodes); (ii) gradual structural refinement rather than phase transitions, with larger models sometimes bypassing localised circuits; and (iii) moderate cross-scale conservation (24-59% component overlap) alongside high spectral similarity. These findings suggest that LLMs form compact, causally isolated mechanisms for abstract semantic structure, and these mechanisms exhibit partial transfer across scales and architectures.

[19] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

Reham Omar,Abdelghny Orogat,Ibrahim Abdelaziz,Omij Mangukiya,Panos Kalnis,Essam Mansour

Main category: cs.CL

TL;DR: Chatty-KG是一个模块化的多智能体系统,通过任务专用的LLM智能体生成SPARQL查询,实现对知识图谱的高效、准确的多轮对话式问答。

Details Motivation: 现有方法在处理多轮对话式知识图谱问答时存在局限:检索增强生成(RAG)系统常丢失图结构且难以维护上下文,传统KGQA系统则不支持多轮交互、延迟高且缺乏上下文追踪能力。 Method: 提出Chatty-KG,采用多个专门的LLM智能体协作完成上下文理解、对话状态跟踪、实体与关系链接及查询规划,并结合RAG式检索与结构化执行生成SPARQL查询。 Result: 在多种大型知识图谱上的实验表明,Chatty-KG在单轮和多轮设置下均显著优于现有最先进方法,F1和P@1得分更高,延迟低,且兼容商业和开源LLM。 Conclusion: Chatty-KG成功融合了对话灵活性与知识图谱的结构化优势,提供了一种可扩展、无需微调即可适应演化知识图谱的可靠多轮对话问答方案。 Abstract: Conversational Question Answering over Knowledge Graphs (KGs) combines the factual grounding of KG-based QA with the interactive nature of dialogue systems. KGs are widely used in enterprise and domain applications to provide structured, evolving, and reliable knowledge. Large language models (LLMs) enable natural and context-aware conversations, but lack direct access to private and dynamic KGs. Retrieval-augmented generation (RAG) systems can retrieve graph content but often serialize structure, struggle with multi-turn context, and require heavy indexing. Traditional KGQA systems preserve structure but typically support only single-turn QA, incur high latency, and struggle with coreference and context tracking. To address these limitations, we propose Chatty-KG, a modular multi-agent system for conversational QA over KGs. Chatty-KG combines RAG-style retrieval with structured execution by generating SPARQL queries through task-specialized LLM agents. These agents collaborate for contextual interpretation, dialogue tracking, entity and relation linking, and efficient query planning, enabling accurate and low-latency translation of natural questions into executable queries. Experiments on large and diverse KGs show that Chatty-KG significantly outperforms state-of-the-art baselines in both single-turn and multi-turn settings, achieving higher F1 and P@1 scores. Its modular design preserves dialogue coherence and supports evolving KGs without fine-tuning or pre-processing. Evaluations with commercial (e.g., GPT-4o, Gemini-2.0) and open-weight (e.g., Phi-4, Gemma 3) LLMs confirm broad compatibility and stable performance. Overall, Chatty-KG unifies conversational flexibility with structured KG grounding, offering a scalable and extensible approach for reliable multi-turn KGQA.

[20] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

Ioana Buhnila,Aman Sinha,Mathieu Constant

Main category: cs.CL

TL;DR: 该研究通过TrackList分析流程和新构建的RefoMed-EN数据集,评估了大语言模型在不同类型医学查询上的表现,发现模型在定义类问题上表现最佳,而在举例类问题上最差,且更倾向于复述高频流行知识而非低频专业内容。

Details Motivation: 大语言模型在回答定义类问题时表现良好,但在其他类型回答(如举例、释义)上表现较差,研究旨在探究预训练数据对模型回答多样性语言查询的影响。 Method: 提出TrackList分析流程和RefoMed-EN英文医学数据集(含6170个标注术语),利用句法与语义相似性指标、统计相关性和嵌入方法,分析头尾部概念频率对模型输出质量的影响。 Result: 模型在定义类问题上性能最高,举例类最低;模型更倾向于复述高频常见知识,对低频和技术性知识(尤其是专家文本)复述较少。 Conclusion: 大语言模型在处理非定义类语言查询时存在明显性能下降,尤其在低频和专业医学知识上表现不足,反映出预训练数据分布对模型输出倾向的显著影响。 Abstract: Large Language Models (LLMs) have proven efficient in giving definition-type answers to user input queries. While for humans giving various types of answers, such as examples and paraphrases, is an easy task, LLMs struggle to provide correct answers for other than definition-type queries. In this study, we evaluated this drop in performance using TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also introduce RefoMed-EN, an English dataset consisting of 6170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We studied whether the high frequency of a concept (head) or low frequency (tail) impacts the language model's performance. We evaluated the quality of the LLM's output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM's task performance for definition type questions is the highest, while for the exemplification type it is the lowest. Additionally, we showed that for definition-type questions, large language models are prone to paraphrase more on popular and frequent knowledge and less on tail and technical knowledge, especially in the expert texts.

[21] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

Anantha Padmanaban Krishna Kumar

Main category: cs.CL

TL;DR: 本文研究了上下文学习(ICL)是否能覆盖预训练模型中的标签语义,还是仅在其已有语义基础上进行调整。作者将大语言模型视为提示诱导的分类器,通过“自然”和“反转”标签的演示对比其行为,并提出语义覆盖率等指标。实验结果表明,ICL并未真正重映射标签含义,而是依赖于预训练中形成的稳定语义方向,支持“语义锚定”观点。

Details Motivation: 探讨ICL是否真正具备灵活改变标签语义的能力,还是仅仅基于预训练语义进行调整,以理解其工作机制和局限性。 Method: 将LLM视为提示诱导分类器,使用自然和反转标签的示例对比其表现;提出三种对齐度量(真实、先验和提示对齐)及语义覆盖率为评估指标,在八个分类任务和八个开源LLM上进行实验。 Result: 在自然示例下,ICL提升准确率且保持强先验对齐;多数正确预测与零样本行为一致。在反转示例下,模型无法学习反语义分类器:提高提示对齐会牺牲准确率,语义覆盖率为零。 Conclusion: ICL主要依赖预训练获得的稳定语义方向,不能有效覆盖或反转标签语义,其作用更倾向于调整输入在这些语义方向上的投影,表明当前规模下的ICL有根本性限制,改变语义需超出ICL的干预手段。 Abstract: Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1--12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1--12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl.

[22] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

Michael Iskandardinata,William Christian,Derwin Suhartono

Main category: cs.CL

TL;DR: 本文提出了一种检索感知的讽刺检测方法,结合外部检索和模型自身知识来增强上下文理解,在多个数据集上显著提升了大语言模型的性能。

Details Motivation: 现有的预训练和大语言模型在处理具有语言多样性和文化差异的讽刺文本时仍面临挑战,尤其对需要额外背景知识的词汇缺乏可靠识别能力。 Method: 基于Pragmatic Metacognitive Prompting(PMP)方法,引入两种上下文补充方式:通过网络检索获取非参数化知识,以及激发模型内部自我知识进行自我意识策略。 Result: 在Twitter Indonesia Sarcastic数据集上,非参数检索使macro-F1提升9.87%;自我知识检索在SemEval-2018和MUStARD上分别提升3.29%和4.08%。 Conclusion: 上下文信息对提升大语言模型在讽刺检测中的表现至关重要,尤其是应对文化特定俚语和未知术语时,未来将优化检索质量与相关性。 Abstract: Detecting sarcasm remains a challenging task in the areas of Natural Language Processing (NLP) despite recent advances in neural network approaches. Currently, Pre-trained Language Models (PLMs) and Large Language Models (LLMs) are the preferred approach for sarcasm detection. However, the complexity of sarcastic text, combined with linguistic diversity and cultural variation across communities, has made the task more difficult even for PLMs and LLMs. Beyond that, those models also exhibit unreliable detection of words or tokens that require extra grounding for analysis. Building on a state-of-the-art prompting method in LLMs for sarcasm detection called Pragmatic Metacognitive Prompting (PMP), we introduce a retrieval-aware approach that incorporates retrieved contextual information for each target text. Our pipeline explores two complementary ways to provide context: adding non-parametric knowledge using web-based retrieval when the model lacks necessary background, and eliciting the model's own internal knowledge for a self-knowledge awareness strategy. We evaluated our approach with three datasets, such as Twitter Indonesia Sarcastic, SemEval-2018 Task 3, and MUStARD. Non-parametric retrieval resulted in a significant 9.87% macro-F1 improvement on Twitter Indonesia Sarcastic compared to the original PMP method. Self-knowledge retrieval improves macro-F1 by 3.29% on Semeval and by 4.08% on MUStARD. These findings highlight the importance of context in enhancing LLMs performance in sarcasm detection task, particularly the involvement of culturally specific slang, references, or unknown terms to the LLMs. Future work will focus on optimizing the retrieval of relevant contextual information and examining how retrieval quality affects performance. The experiment code is available at: https://github.com/wllchrst/sarcasm-detection_pmp_knowledge-base.

[23] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

Thura Aung,Eaint Kay Khaing Kyaw,Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi

Main category: cs.CL

TL;DR: 该研究探索了使用Kolmogorov-Arnold网络(KANs)作为低资源语言(如缅甸语)分类任务的分类头,相较于传统的MLP方法,在多种嵌入表示中表现出更具竞争力或更优的性能,同时保持高效计算。

Details Motivation: 在低资源语言中,通常仅微调分类层而冻结编码器权重,但传统MLP由于固定的非线性限制了表达能力和效率,因此需要更具表达力且高效的替代方案。 Method: 采用三种KAN变体(FourierKAN、EfficientKAN和FasterKAN),在TF-IDF、fastText和多语言Transformer(mBERT、Distil-mBERT)等不同嵌入上进行评估,并与MLP对比性能。 Result: EfficientKAN结合fastText取得了最高F1分数(0.928);FasterKAN在速度与准确率之间表现最佳平衡;在Transformer嵌入上,EfficientKAN与mBERT搭配达到0.917 F1,性能匹配或略优于MLP。 Conclusion: KAN-based分类头是比MLP更具表达力且高效的替代方案,适用于低资源语言的文本分类任务。 Abstract: In low-resource languages like Burmese, classification tasks often fine-tune only the final classification layer, keeping pre-trained encoder weights frozen. While Multi-Layer Perceptrons (MLPs) are commonly used, their fixed non-linearity can limit expressiveness and increase computational cost. This work explores Kolmogorov-Arnold Networks (KANs) as alternative classification heads, evaluating Fourier-based FourierKAN, Spline-based EfficientKAN, and Grid-based FasterKAN-across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT). Experimental results show that KAN-based heads are competitive with or superior to MLPs. EfficientKAN with fastText achieved the highest F1-score (0.928), while FasterKAN offered the best trade-off between speed and accuracy. On transformer embeddings, EfficientKAN matched or slightly outperformed MLPs with mBERT (0.917 F1). These findings highlight KANs as expressive, efficient alternatives to MLPs for low-resource language classification.

[24] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

Bryan E. Tuck,Rakesh M. Verma

Main category: cs.CL

TL;DR: 该研究评估了28种大语言模型在58个单词谜题上的字符级约束满足能力,发现架构差异对性能的影响远超参数规模,且高容量模型对推理预算更敏感。模型在人类易解但拼写不典型的词上表现差,暴露出过度依赖统计规律而忽视合法拼写的缺陷,提示需专门的架构创新。

Details Motivation: 系统评估不同架构的大语言模型在字符级硬性拼写约束下的表现,探究当前模型在受控文本生成中满足正字法约束的能力及其局限。 Method: 在涵盖Qwen3、Claude Haiku-4.5和GPT-5-mini三个模型家族的28种配置上,测试其在58个需要字符级约束满足的单词谜题上的表现,并结合人类求解者(每题1万人)的难度评分进行校准分析。 Result: 发现架构差异导致的性能差距(F1: 0.761 vs. 0.343)远大于族内参数缩放带来的提升(八倍参数仅提升83%);高容量模型从更多推理预算中获益显著,而中等模型趋于饱和或退化;模型与人类难度感知呈中等相关(r=0.24–0.38),但在常见但拼写异常的词(如'data'、'poop')上失败率高达89–96%,尽管人类成功率86–95%。 Conclusion: 大语言模型的约束满足能力不仅受限于参数规模和计算预算,更受架构设计影响;其对分布合理性的过度依赖损害了对合法但非常规拼写的处理,表明需要专门的架构特性或训练目标来改进受控生成能力。 Abstract: Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography ("data", "poop", "loll": 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.

[25] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Ye Bhone Lin,Thura Aung,Ye Kyaw Thu,Thazin Myint Oo

Main category: cs.CL

TL;DR: 本文首次研究了低资源缅甸语的ASR错误校正,提出结合IPA和对齐信息的序列到序列Transformer模型,显著提升了词和字符级别的识别准确率。

Details Motivation: 由于缅甸语属于低资源语言,现有ASR系统存在较高错误率,缺乏专门的错误校正研究,因此需要探索有效的错误校正方法以提升识别性能。 Method: 采用序列到序列的Transformer模型,融合IPA音标和对齐信息等特征,构建ASR错误校正(AEC)系统,并在五种ASR骨干模型上进行评估。 Result: 所提出的AEC模型将ASR系统的平均WER从51.56降至39.82(未增强),chrF++分数从0.5864提升至0.627,显示出一致的性能增益。 Conclusion: ASR错误校正在低资源场景下具有鲁棒性,合理的特征设计对提升ASR输出至关重要。 Abstract: This paper investigates sequence-to-sequence Transformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IPA and alignment information. To our knowledge, this is the first study addressing ASR error correction specifically for Burmese. We evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word- and character-level accuracy over baseline outputs. The proposed AEC model, combining IPA and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.

[26] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

Manish Jain,Satheesh Kumar Ponnambalam,Salman Faroz,Chandrakanth Lns,Vinay Sharma

Main category: cs.CL

TL;DR: MortgageLLM 是一个面向房贷金融领域的双专家大语言模型,通过指令残差技术在保持指令遵循能力的同时实现领域专业化,显著优于基线模型。

Details Motivation: 在房贷金融等专业领域,通用大模型缺乏足够的领域知识,而直接微调又会损害其指令遵循能力,因此需要一种既能增强领域知识又能保持对话能力的方法。 Method: 提出双轨专业化框架:从同一基础模型(LLaMA-3.1-8B)衍生出两个专家模型——一个用于对话问答,另一个处理结构化任务(如分类与摘要),并采用指令残差技术恢复微调后的指令遵循能力;同时设计基于少样本分类的智能任务路由机制,由专家模型自身决定任务分配。 Result: 在领域基准测试中,MortgageLLM(MLM v2)显著优于基础模型 LLaMA-3.1-8B-Instruct,在摘要、问答和分类任务的评分上分别达到 4.58(vs. 3.99)、4.09(vs. 4.0)和 2.6(vs. 1.2);BERTScore 语义相似性指标也全面领先。 Conclusion: 双专家架构结合指令残差技术能有效解决专业领域模型在结构化任务与对话能力之间的性能权衡问题,为垂直领域大模型开发提供了高效可行的方案。 Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across general domains, yet their application to specialized sectors such as mortgage finance requires domain-specific knowledge augmentation while preserving instruction-following fidelity. We present MortgageLLM, a novel domain-specific large language model that addresses this dual challenge. It is developed using a dual-track specialization framework from a single base model (LLaMA-3.1-8B). We opted for this dual-expert approach as a single multi-task model suffers from performance trade-offs, where optimizing for structured tasks (via SFT) degrades conversational fidelity (via DPO). Our dual-track method solves this by creating two specialists, allowing each to be optimally trained for its distinct capability. Our approach applies the instruction residual technique to restore instruction-following capabilities post-domain adaptation without supervised fine-tuning. We contribute: (1) application of this residual technique to the highly specialized mortgage finance domain; (2) a dual-expert architecture combining a conversational Q&A model and a structured task model for classification and summarization; and (3) an intelligent task routing mechanism using few-shot classification performed by one of the expert models itself. We validate our approach on domain-specific benchmarks, where our final model (MLM v2) significantly outperforms the base LLaMA-3.1-8B-Instruct, achieving an LLM-as-a-Judge summarization score of 4.58 (vs. 3.99), a Q&A score of 4.09 (vs. 4.0), and a classification score of 2.6 (vs. 1.2). On semantic similarity, our model achieved a BERTScore of 0.77 for summarization (vs. 0.74), 0.68 for Q&A (vs. 0.58), and 0.75 for classification (vs. 0.73), substantially outperforming baseline approaches.

[27] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang,Yanxu Zhu,Dongyuan Lu,Jitao Sang

Main category: cs.CL

TL;DR: 本文提出了Synthesized Guideline-based Adaptive Safety Alignment (SGASA)框架,通过生成安全准则并利用监督微调和直接偏好优化,增强推理模型对对抗性越狱提示的防御能力,同时减少对良性请求的误拒。

Details Motivation: 对抗性越狱提示具有隐蔽性和欺骗性,常绕过现有安全机制,导致模型生成有害内容,亟需一种可自适应强化安全对齐的方法。 Method: SGASA框架包含两个阶段:数据预合成阶段生成安全准则和增强提示;对齐微调阶段使用监督微调(SFT)和直接偏好优化(DPO)将准则嵌入模型。 Result: 在多个数据集上的实验表明,SGASA显著提升了模型的安全性,有效增强了对有害对抗提示的鲁棒性,并降低了对正常请求的拒绝率。 Conclusion: SGASA是一种有效、自适应且可扩展的安全对齐方法,能够使模型自主强化防御机制,平衡安全性与可用性。 Abstract: Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.

[28] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

Steven Wang,Kyle Hunt,Shaojie Tang,Kenneth Joseph

Main category: cs.CL

TL;DR: 该研究探讨了在小型人类调查数据上微调大语言模型(LLM)是否能使其更真实地模拟人类行为,发现微调可提升多样性与一致性,但无法复现回归系数,因此仍不足以替代人类参与实证推断分析。

Details Motivation: 评估大语言模型能否作为人类被试的替代品,尤其是在调查和实验研究中,并检验微调是否能缓解其与人类行为不一致的问题。 Method: 通过一个关于信息披露行为的实验,比较人类与LLM生成的回答在分布差异、子群对齐、信念-行为一致性以及回归系数还原等多个维度的表现,使用小样本人类数据对LLM进行微调并评估效果。 Result: 微调显著提升了LLM在反应多样性、子群对齐和信念-行为一致性方面的表现,但所有微调后的模型均未能复现原始研究的回归系数。 Conclusion: 尽管微调能改善LLM模拟人类行为的能力,但由于无法准确还原统计推断结果,LLM生成的数据仍不适合作为人类被试数据的替代品用于正式的实证分析。 Abstract: There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.

[29] Developing an Open Conversational Speech Corpus for the Isan Language

Adisai Na-Thalang,Chanakan Wittayasakpan,Kritsadha Phatcharoen,Supakit Buakaw

Main category: cs.CL

TL;DR: 本文介绍了首个开放的伊桑语会话语音数据集,该数据集基于自然口语,捕捉了方言中的真实语言现象,并解决了因缺乏标准化正字法带来的转录挑战,旨在推动包容性人工智能和濒危语言研究。

Details Motivation: 由于伊桑语缺乏标准化正字法且现有语音语料库多为朗读语音,难以反映真实口语特征,因此需要构建一个基于自然对话的开放语音数据集以支持濒危语言的技术研究与包容性AI发展。 Method: 通过制定兼顾语言真实性与计算处理需求的实用转录协议,收集并标注包含日常口语特征(如俚语、停顿、语码转换)的自然会话语音数据,解决伊桑语书写不统一的问题。 Result: 成功开发了一个开放获取的伊桑语自然会话语音数据集,建立了可行的转录规范,有效应对了正字法不统一带来的挑战。 Conclusion: 该数据集为伊桑语等缺乏资源的语言提供了重要基础,有助于推进会话语音建模、语言保护及包容性语音技术的发展。 Abstract: This paper introduces the development of the first open conversational speech dataset for the Isan language, the most widely spoken regional dialect in Thailand. Unlike existing speech corpora that are primarily based on read or scripted speech, this dataset consists of natural speech, thereby capturing authentic linguistic phenomena such as colloquials, spontaneous prosody, disfluencies, and frequent code-switching with central Thai. A key challenge in building this resource lies in the lack of a standardized orthography for Isan. Current writing practices vary considerably, due to the different lexical tones between Thai and Isan. This variability complicates the design of transcription guidelines and poses questions regarding consistency, usability, and linguistic authenticity. To address these issues, we establish practical transcription protocols that balance the need for representational accuracy with the requirements of computational processing. By releasing this dataset as an open resource, we aim to contribute to inclusive AI development, support research on underrepresented languages, and provide a basis for addressing the linguistic and technical challenges inherent in modeling conversational speech.

[30] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

Robert Belanec,Branislav Pecher,Ivan Srba,Maria Bielikova

Main category: cs.CL

TL;DR: 本文提出了PEFT-Bench,一个用于评估自回归大语言模型上多种参数高效微调(PEFT)方法的统一端到端基准,并引入了综合考虑训练参数、推理速度和训练内存的PSCP评估指标。

Details Motivation: 现有的PEFT方法评估存在模型和数据集覆盖有限、难以复现的问题,因此需要一个统一且可复现的基准来全面评估不同PEFT方法的性能。 Method: 构建了一个涵盖27个NLP数据集和6种PEFT方法的统一评估框架PEFT-Bench,并提出新的评估指标PSCP,综合考虑可训练参数量、推理速度和训练内存消耗。 Result: 实现了对多种PEFT方法在广泛任务上的系统性评估,验证了PEFT-Bench的可用性和PSCP指标的有效性。 Conclusion: PEFT-Bench和PSCP为未来PEFT方法的研究提供了可复现、多维度的评估标准,有助于推动高效微调技术的发展。 Abstract: Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-efficient fine-tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the increased development in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 6 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Score Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.

[31] Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

Kai Kugler

Main category: cs.CL

TL;DR: 首次系统研究神经语言模型训练过程中词频与多义性之间的Martin定律,发现其呈现非单调发展轨迹,存在最优语义窗口。

Details Motivation: 探究神经语言模型在训练过程中是否遵循人类语言中的Martin定律,即词频与多义性之间的经验关系。 Method: 使用DBSCAN聚类上下文化词嵌入作为词义的操作化定义,分析四个不同规模Pythia模型在30个训练检查点上的演变。 Result: Martin定律在训练中期出现并达到高峰后衰退;小模型后期出现语义崩溃,大模型则表现平缓退化;频率-特异性权衡在整个过程中保持稳定。 Conclusion: 大型语言模型生成文本中语言规律的符合程度并非随训练单调提升,而是存在一个最佳语义阶段,提示需重新思考模型语义发展的评估方式。 Abstract: We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.

[32] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

Joshua Fonseca Rivera

Main category: cs.CL

TL;DR: 通过微调,一个7B参数的语言模型被训练成能够可靠地检测和报告注入的单token‘思想’,准确率达到85%,且无误报,满足了Lindsey提出的三项标准,表明 introspective 行为可通过训练直接诱导。

Details Motivation: Lindsey (2025) 发现语言模型对注入激活模式的内省识别能力有限且不可靠(约20%成功率),本文旨在探索是否可通过直接训练提升这种自我报告能力,而非依赖其自然涌现。 Method: 在瞬时单token注入数据上对7B参数语言模型进行微调,训练其检测并报告注入的语义内容,并评估其在未见概念上的泛化能力及是否满足准确性、 grounding 和 internality 标准。 Result: 模型从近乎完全失败(0.4%准确率,6.7%误报率)提升至85%准确率(α=40时),在保留概念上实现0%误报,并在未见概念上有良好泛化(仅7.5个百分点差距)。 Conclusion: 至少一种内省行为组件可通过直接训练诱导产生,这为构建具备内在透明性的AI系统提供了可行路径,回应了Lindsey关于训练是否能消除模型间差异的开放问题。 Abstract: Lindsey (2025) investigates introspective awareness in language models through four experiments, finding that models can sometimes detect and identify injected activation patterns -- but unreliably (~20% success in the best model). We focus on the first of these experiments -- self-report of injected "thoughts" -- and ask whether this capability can be directly trained rather than waiting for emergence. Through fine-tuning on transient single-token injections, we transform a 7B parameter model from near-complete failure (0.4% accuracy, 6.7% false positive rate) to reliable detection (85% accuracy on held-out concepts at α=40, 0% false positives). Our model detects fleeting "thoughts" injected at a single token position, retains that information, and reports the semantic content across subsequent generation steps. On this task, our trained model satisfies three of Lindsey's criteria: accuracy (correct identification), grounding (0/60 false positives), and internality (detection precedes verbalization). Generalization to unseen concept vectors (7.5pp gap) demonstrates the model learns a transferable skill rather than memorizing specific vectors, though this does not establish metacognitive representation in Lindsey's sense. These results address an open question raised by Lindsey: whether "training for introspection would help eliminate cross-model differences." We show that at least one component of introspective behavior can be directly induced, offering a pathway to built-in AI transparency.

[33] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Antonín Jarolím,Martin Fajčík,Lucia Makaiová

Main category: cs.CL

TL;DR: 本文研究了针对捷克语和斯洛伐克语声明的细粒度证据提取任务,构建了由付费标注者完成的双向标注数据集,并评估了大语言模型在该任务上与人类标注的一致性。结果显示,现有模型常无法从源文本中逐字复制证据,导致输出无效;其中llama3.1:8b表现优异,而gpt-oss-120b表现不佳,qwen3:14b、deepseek-r1:32b和gpt-oss:20b则在模型大小与人类对齐之间表现出良好平衡。

Details Motivation: 在线新闻评论中常传播错误信息,需有效检测事实错误内容。为有力支持或反驳评论中的声明,需精确定位可佐证或反驳声明的文本片段。 Method: 构建了一个新的数据集,包含由付费标注人员完成的双向标注的细粒度证据,并在此数据集上评估多个大语言模型(LLMs)与人类标注的一致性。 Result: 实验发现大语言模型常不能从源文本中逐字复制证据,导致输出无效;llama3.1:8b虽规模小但正确率高,gpt-oss-120b参数多却表现差;qwen3:14b、deepseek-r1:32b和gpt-oss:20b在模型大小与人类对齐方面表现均衡。 Conclusion: 当前大语言模型在细粒度证据提取任务上仍存在挑战,特别是在精确复制源文本方面;模型性能不完全依赖参数规模,某些中小规模模型在对齐人类标注方面表现更优。 Abstract: Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task -- fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.

[34] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

Zhifeng Hao,Qibin Song,Ruichu Cai,Boyan Xu

Main category: cs.CL

TL;DR: DSR-SQL是一种双状态推理框架,通过建模上下文状态与生成状态的交互来提升大模型在复杂数据库上的Text-to-SQL性能,无需后训练或示例即可实现优异表现。

Details Motivation: 现有基于思维链的Text-to-SQL方法在处理复杂企业数据库时受限于上下文容量、模式链接不可靠及语义接地薄弱,难以保持连贯推理。 Method: 提出DSR-SQL框架,包含两个状态:一是构建紧凑且语义忠实的上下文环境,通过精炼大规模模式并选择相关结构;二是将SQL生成形式化为反馈引导的状态转移过程,支持模型自我修正并与用户意图对齐。 Result: 在无需任何后训练或上下文示例的情况下,DSR-SQL在Spider 2.0-Snow上达到35.28%的执行准确率,在BIRD开发集上达到68.32%。 Conclusion: DSR-SQL有效提升了大模型在复杂数据库场景下的Text-to-SQL能力,具备良好的实用性和可扩展性,未来将开源代码以促进研究复现。 Abstract: Recent divide-and-conquer reasoning approaches, particularly those based on Chain-of-Thought (CoT), have substantially improved the Text-to-SQL capabilities of Large Language Models (LLMs). However, when applied to complex enterprise databases, such methods struggle to maintain coherent reasoning due to limited context capacity, unreliable schema linking, and weak grounding in database semantics. To overcome these issues, we introduce DSR-SQL, a \textbf{D}ual-\textbf{S}tate \textbf{R}easoning framework that models Text-to-SQL as an interaction between an adaptive context state and a progressive generation state. The first constructs a compact, semantically faithful environment by refining large schemas and selecting relevant structures, while the second formalizes SQL synthesis as feedback-guided state transitions, enabling the model to self-correct and align with user intent. Without any post-training or in-context examples, DSR-SQL achieves competitive performance, reaching 35.28\% execution accuracy on Spider 2.0-Snow and 68.32\% on BIRD development set. Our implementation will be open-sourced at: https://github.com/DMIRLAB-Group/DSR-SQL.

[35] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

Kaifeng Hong,Yinglong Zhang,Xiaoying Hong,Xuewen Xia,Xing Xu

Main category: cs.CL

TL;DR: 本文提出了Odin,一种通过定向双模块机制在特定Transformer层中注入图结构的新架构,实现了文本与图结构的有效结合。与依赖多跳扩散的GNN不同,Odin避免了过平滑问题,并在表达能力上超越了纯Transformer和GNN。为提升效率,还提出了轻量版本Light Odin,在多个文本图基准上达到SOTA性能,且计算成本显著降低。

Details Motivation: 现有方法在处理文本图时存在局限:GNN受限于过平滑和_hop-dependent_扩散,而Transformer忽略图拓扑结构。因此需要一种能有效融合强文本理解与结构推理的新模型。 Method: 提出Odin架构,通过在选定深度的Transformer层中引入定向双模块机制来注入图结构;不依赖多跳扩散,而在特定层整合多跳结构,实现与语义层次对齐的低、中、高层结构抽象;使用全局[CLS]表示进行聚合,避免过平滑。同时提出轻量版Light Odin以提高效率。 Result: Odin在多个文本丰富图基准上实现了最先进的准确性,Light Odin在显著降低计算成本的同时保持了有竞争力的性能。实验验证了其表达能力强于纯Transformer和GNN。 Conclusion: Odin和Light Odin构成了一个统一、无需_hop_的结构-文本融合框架,有效解决了GNN和Transformer在文本图上的固有问题,为文本图建模提供了新的范式。 Abstract: Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs--limited by over-smoothing and hop-dependent diffusion--or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism.Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model's semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin's expressive power strictly contains that of both pure Transformers and GNNs.To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.

[36] A Systematic Study of Model Merging Techniques in Large Language Models

Oğuz Kağan Hitit,Leander Girrbach,Zeynep Akata

Main category: cs.CL

TL;DR: 本文对六种先进的模型合并方法在大语言模型(LLM)上的应用进行了大规模系统性评估,发现最简单的方法Task Arithmetic是唯一能稳定提升性能的方法,其他方法常导致性能下降,表明现有合并技术难以直接适用于现代LLM,需设计针对LLM的合并算法和微调方法。

Details Motivation: 研究当前在小型模型和分类器上有效的模型合并方法是否能在大语言模型(LLMs)上同样取得优势,探索其可迁移性和有效性。 Method: 对六种最先进的合并方法(包括近期的子空间方法)在四个开源LLM、每个基础模型十二个微调检查点以及十六个标准LLM基准上进行系统性评估,使用标准化基准衡量合并模型相对于基础模型和最佳单独检查点的性能增益。 Result: Task Arithmetic是最简单且最老的方法,是唯一能稳定提升LLM性能的合并方法;其他干扰感知和子空间合并方法通常导致显著性能下降。 Conclusion: 当前的模型合并技术不能直接推广到现代大语言模型,需要开发专门针对LLM的合并算法以及支持合并的微调方法。 Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.

[37] Hierarchical Ranking Neural Network for Long Document Readability Assessment

Yurui Zheng,Yijun Chen,Shaohong Zhang

Main category: cs.CL

TL;DR: 本文提出了一种双向可读性评估机制和成对排序算法,以解决现有深度学习方法在文本长度和可读性标签序数关系上的不足,通过句子级预测辅助文档级可读性评估,并在中英文数据集上验证了模型的优越性能。

Details Motivation: 现有的深度学习方法在可读性评估中往往忽略文本长度影响和可读性标签之间的序数关系,导致评估效果受限,因此需要一种能同时捕捉上下文信息和标签顺序关系的新方法。 Method: 提出双向可读性评估机制,利用上下文信息识别文本中语义丰富的区域,进行句子级可读性预测,并将结果用于文档级可读性预测;引入基于标签差分的成对排序算法来建模可读性等级间的序数关系。 Result: 在中文和英文数据集上的实验表明,所提模型具有竞争力的表现,优于其他基线模型。 Conclusion: 该方法有效提升了可读性评估的准确性,尤其在处理不同长度文本和建模可读性等级顺序方面表现出优势,适用于多语言环境下的可读性分析。 Abstract: Readability assessment aims to evaluate the reading difficulty of a text. In recent years, while deep learning technology has been gradually applied to readability assessment, most approaches fail to consider either the length of the text or the ordinal relationship of readability labels. This paper proposes a bidirectional readability assessment mechanism that captures contextual information to identify regions with rich semantic information in the text, thereby predicting the readability level of individual sentences. These sentence-level labels are then used to assist in predicting the overall readability level of the document. Additionally, a pairwise sorting algorithm is introduced to model the ordinal relationship between readability levels through label subtraction. Experimental results on Chinese and English datasets demonstrate that the proposed model achieves competitive performance and outperforms other baseline models.

[38] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

Lina Conti,Dennis Fucci,Marco Gaido,Matteo Negri,Guillaume Wisniewski,Luisa Bentivogli

Main category: cs.CL

TL;DR: 该研究探讨了语音翻译(ST)模型在处理涉及说话人性别指代术语时的机制,揭示了模型如何结合训练数据模式、内部语言模型偏差和声学信息进行性别分配。研究发现,模型不仅学习到男性偏好的普遍模式,并能根据声学输入覆盖语言模型的男性偏向。通过对比特征归因分析,发现高准确率模型利用第一人称代词将性别信息与说话人关联,并从频谱分布中提取而不仅仅是依赖音高。

Details Motivation: 由于语音包含说话人性别等声学线索,当从无语法性别的语言(如英语)翻译为有语法性别的语言时,可能引发基于声音特征的性别误判问题。当前对ST模型如何做出此类性别分配决策尚缺乏理解,因此需要深入探究其内在机制以减少模态特异性偏见。 Method: 研究针对三种语言对(en-es/fr/it),分析ST模型在翻译过程中如何分配说话人指代词的语法性别,结合训练数据中的性别分布、内部语言模型(ILM)偏差以及声学输入的影响。采用对比特征归因方法作用于梅尔频谱图,识别模型在性别判断中使用的声学特征。 Result: 模型并未简单复制训练数据中的特定词汇性别关联,而是学习到了更广泛的男性主导模式;尽管ILM具有强烈男性偏向,但模型可根据声学信号调整该偏好;高准确率模型利用第一人称代词作为桥梁,将性别信息与说话人关联,并依赖分布在全频谱而非仅音高的声学特征进行判断。 Conclusion: ST模型在性别指代翻译中表现出复杂的决策机制,能够整合语言模型先验与声学证据,避免单纯依赖音高或默认男性化倾向。这一发现揭示了一种新的、分布式的性别信息利用方式,为缓解语音翻译中的性别偏见提供了新视角。 Abstract: Unlike text, speech conveys information about the speaker, such as gender, through acoustic cues like pitch. This gives rise to modality-specific bias concerns. For example, in speech translation (ST), when translating from languages with notional gender, such as English, into languages where gender-ambiguous terms referring to the speaker are assigned grammatical gender, the speaker's vocal characteristics may play a role in gender assignment. This risks misgendering speakers, whether through masculine defaults or vocal-based assumptions. Yet, how ST models make these decisions remains poorly understood. We investigate the mechanisms ST models use to assign gender to speaker-referring terms across three language pairs (en-es/fr/it), examining how training data patterns, internal language model (ILM) biases, and acoustic information interact. We find that models do not simply replicate term-specific gender associations from training data, but learn broader patterns of masculine prevalence. While the ILM exhibits strong masculine bias, models can override these preferences based on acoustic input. Using contrastive feature attribution on spectrograms, we reveal that the model with higher gender accuracy relies on a previously unknown mechanism: using first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across the frequency spectrum rather than concentrated in pitch.

[39] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

Husne Ara Rubaiyeat,Hasan Mahmud,Md Kamrul Hasan

Main category: cs.CL

TL;DR: 本文介绍了IsharaKhobor数据集及其子集,旨在推动孟加拉手语翻译(BdSLT)的研究,解决低资源语言带来的挑战,并通过基准测试和词汇规范化提供了未来研究方向。

Details Motivation: 由于孟加拉手语资源极度匮乏,缺乏标准句子级别的数据集,严重限制了面向听障人群的AI辅助工具的发展,因此需要构建高质量的数据集以促进相关研究。 Method: 提出了IsharaKhobor数据集及两个子集(IsharaKhobor_small和IsharaKhobor_canonical_small),采用基于地标(landmark)的原始嵌入和RQE嵌入进行基准测试,并对词汇限制和规范化进行了消融实验。 Result: 成功构建并公开发布了IsharaKhobor数据集及其变体,验证了不同词汇处理策略对数据集性能的影响,为BdSLT任务提供了有效的基准。 Conclusion: IsharaKhobor数据集填补了孟加拉手语翻译领域的空白,为后续AI驱动的辅助技术开发提供了重要基础,并指出了在低资源条件下构建手语数据集的可行路径。 Abstract: Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creation for BdSLT is of immense importance for developing AI based assistive tools for deaf and hard of hearing people of Bangla speaking community. In this paper, we present a dataset, IsharaKhobor , and two subset of it for enabling research. We also present the challenges towards developing the dataset and present some way forward by benchmarking with landmark based raw and RQE embedding. We do some ablation on vocabulary restriction and canonicalization of the same within the dataset, which resulted in two more datasets, IsharaKhobor_small and IsharaKhobor_canonical_small. The dataset is publicly available at: www.kaggle.com/datasets/hasanssl/isharakhobor [1].

[40] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

Minjoon Choi

Main category: cs.CL

TL;DR: 本文提出了RoParQ基准和XParaCon评估指标,用于衡量大语言模型在回答改写问题时的一致性,并通过基于推理的监督微调方法提升模型对语义不变性的对齐,实验表明该方法能显著增强模型鲁棒性。

Details Motivation: 大语言模型在回答改写后的问题时常表现不一致,表明其依赖表面模式而非真正理解语义,因此需要一种新方法来评估和提升模型的跨改写一致性。 Method: 构建RoParQ基准,利用专有模型生成标准数据集的改写样本,并筛选出导致判断模型置信度不一致的例子;提出XParaCon指标,通过计算模型在不同问题变体上的准确率标准差来量化鲁棒性;设计一种基于推理的、关注改写的监督微调(SFT)策略,以增强语义不变性对齐。 Result: 实验显示,经过针对性微调的轻量级模型在一致性上可达到与更大预训练模型相当的水平,且XParaCon指标有效反映了模型的鲁棒性提升。 Conclusion: 所提出的RoParQ基准、XParaCon指标和改写感知SFT策略共同有效缓解了模型的表层记忆问题,促进了更可靠、更具语义理解能力的大语言模型发展。 Abstract: Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions, suggesting a reliance on surface-level patterns rather than true semantic understanding. To address this limitation, we introduce RoParQ, a benchmark specifically constructed to evaluate cross-paraphrase consistency in closed-book multiple-choice QA. This benchmark is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. We further propose XParaCon, a novel evaluation metric that quantifies a model's robustness by measuring the standard deviation of accuracies across question variants. Additionally, we implement a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy designed to align models toward semantic invariance. Our experiments demonstrate that this targeted alignment significantly enhances robustness. Notably, fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models. These results highlight the efficacy of our approach in mitigating superficial memorization and fostering more robust, reliable LLMs.

[41] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

Yixiu Zhao,Xiaozhi Wang,Zijun Yao,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 本文提出一种轻量级、广泛适用的方法,通过关联神经元激活与外部标签或模型自信度等辅助指标,来识别大语言模型中编码特定技能的神经元,并在多种任务上验证了其有效性,揭示了算术推理中的新捷径。

Details Motivation: 大语言模型能力强大但内部机制不透明,现有方法难以有效识别与特定技能相关的神经元,尤其在涉及多技能的复杂场景下。 Method: 基于软提示训练寻找“技能神经元”的先前工作,扩展至多技能复杂场景,通过将神经元激活与外部标签、模型自信度等辅助指标相关联,以发现可解释且任务特定的行为,无需手动标记聚合。 Result: 在开放文本生成和自然语言推断等任务上验证了方法的有效性,成功检测到驱动已知技能的神经元,并发现了BigBench算术推理任务中此前未识别的捷径。 Conclusion: 所提出的方法能够有效隔离与特定技能相关的神经元,具有良好的可解释性和广泛适用性,有助于理解大语言模型的内部工作机制。 Abstract: Large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, yet their internal mechanisms remain largely opaque. In this paper, we introduce a simple, lightweight, and broadly applicable method with a focus on isolating neurons that encode specific skills. Building upon prior work that identified "skill neurons" via soft prompt training on classification tasks, our approach extends the analysis to complex scenarios involving multiple skills. We correlate neuron activations with auxiliary metrics -- such as external labels and the model's own confidence score -- thereby uncovering interpretable and task-specific behaviors without the need for manual token aggregation. We empirically validate our method on tasks spanning open-ended text generation and natural language inference, demonstrating its ability to detect neurons that not only drive known skills but also reveal previously unidentified shortcuts in arithmetic reasoning on BigBench.

[42] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Dongyang Fan,Diba Hashemi,Sai Praneeth Karimireddy,Martin Jaggi

Main category: cs.CL

TL;DR: 本研究探讨了在大语言模型预训练中引入多种元数据(如文档质量细粒度指标)以加速训练的方法,发现细粒度编码的元数据更有效,并提出通过附加元数据作为辅助任务和可学习元标记来提升训练效率。

Details Motivation: 先前工作仅利用URL这一种元数据信号来加速LLM预训练,本文旨在探索其他类型的元数据是否能带来更大收益。 Method: 研究考察了多种元数据类型,分析其在预训练前缀或后缀中的作用,引入可学习元标记与掩码损失,并通过探针分析潜在表示如何被元数据影响。 Result: 发现细粒度的文档质量指标等元数据能显著加速预训练;元数据附加作为辅助任务以及可学习元标记均可提升训练效率;探针结果显示元数据能诱导出质量感知的潜在结构。 Conclusion: 有效的元数据具有细粒度编码特征,通过合理设计元数据的使用方式(如附加、可学习标记),可有效提升大语言模型预训练的效率与效果。 Abstract: Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.

[43] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

Anna Marklová,Ondřej Vinš,Martina Vokáčová,Jiří Milička

Main category: cs.CL

TL;DR: 该研究探讨了捷克语AI生成诗歌与人类创作诗歌的可区分性及审美评价,发现母语者难以分辨两者,且对认为是AI创作的诗歌评价更低,存在作者身份偏见。

Details Motivation: 探讨在训练数据较少的复杂语言(如捷克语)中,AI生成诗歌的质量及其被感知的方式,特别是读者对作者身份的认知如何影响审美判断。 Method: 通过让捷克语母语者阅读AI和人类创作的诗歌,评估其识别作者的能力以及对诗歌的审美评价,并使用逻辑回归分析影响识别准确性的因素。 Result: 参与者在判断诗歌作者时表现接近随机(平均45.8%正确),表明AI生成的捷克语诗歌与人类创作难以区分;当读者认为诗歌由AI创作时,审美评分更低,尽管AI诗歌实际评分相当或更高;喜爱程度越高,越难准确判断作者身份;诗歌熟悉度或文学背景不影响识别准确性。 Conclusion: AI能够在形态复杂的低资源斯拉夫语言(如捷克语)中生成具有说服力的诗歌,且读者的审美评价受其作者身份信念的影响,显示认知与审美之间的关联。 Abstract: Large language models are increasingly capable of producing creative texts, yet most studies on AI-generated poetry focus on English -- a language that dominates training data. In this paper, we examine the perception of AI- and human-written Czech poetry. We ask if Czech native speakers are able to identify it and how they aesthetically judge it. Participants performed at chance level when guessing authorship (45.8\% correct on average), indicating that Czech AI-generated poems were largely indistinguishable from human-written ones. Aesthetic evaluations revealed a strong authorship bias: when participants believed a poem was AI-generated, they rated it as less favorably, even though AI poems were in fact rated equally or more favorably than human ones on average. The logistic regression model uncovered that the more the people liked a poem, the less probable was that they accurately assign the authorship. Familiarity with poetry or literary background had no effect on recognition accuracy. Our findings show that AI can convincingly produce poetry even in a morphologically complex, low-resource (with respect of the training data of AI models) Slavic language such as Czech. The results suggest that readers' beliefs about authorship and the aesthetic evaluation of the poem are interconnected.

[44] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Dong Wang,Yang Li,Ansong Ni,Ching-Feng Yeh,Youssef Emad,Xinjie Lei,Liam Robbins,Karthik Padthe,Hu Xu,Xian Li,Asli Celikyilmaz,Ramya Raghavendra,Lifei Huang,Carole-Jean Wu,Shang-Wen Li

Main category: cs.CL

TL;DR: 本文提出了Matrix,一个去中心化的多智能体合成数据生成框架,通过分布式队列传递消息,消除中心化协调器,实现高吞吐、可扩展且灵活的合成数据生产。

Details Motivation: 现有的多智能体合成数据框架依赖中心化调度器导致扩展性差,或局限于特定领域而缺乏灵活性。 Method: Matrix将控制流和数据流表示为通过分布式队列传递的序列化消息,采用点对点架构,任务由轻量级代理独立推进,计算密集型操作由分布式服务处理,并基于Ray构建以支持大规模并发工作流。 Result: 在多种数据生成场景下(如多智能体对话、网页推理数据提取、客服工具使用轨迹生成),Matrix在相同硬件下实现2-15倍的数据生成吞吐提升,且不牺牲输出质量。 Conclusion: Matrix提供了一种模块化、可配置的去中心化框架,能高效扩展至数万个并发工作流,适用于广泛的合成数据生成任务。 Abstract: Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.

[45] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Hongjin Su,Shizhe Diao,Ximing Lu,Mingjie Liu,Jiacheng Xu,Xin Dong,Yonggan Fu,Peter Belcak,Hanrong Ye,Hongxu Yin,Yi Dong,Evelina Bakhturina,Tao Yu,Yejin Choi,Jan Kautz,Pavlo Molchanov

Main category: cs.CL

TL;DR: 本文提出了一种名为ToolOrchestra的方法,用于训练小型协调模型(Orchestrator),通过强化学习协调多种智能工具,在解决复杂任务时实现了比现有方法更高的准确性和效率。该8B模型在 Humanity's Last Exam 等基准上超越了GPT-5,同时成本更低,展示了轻量级协调模型在工具增强推理中的优越性。

Details Motivation: 大型语言模型虽为通才,但在处理如Humanity's Last Exam等复杂问题时仍面临智力上限和计算成本高的挑战。因此需要更高效、更具扩展性的工具使用方法。 Method: 提出ToolOrchestra方法,采用面向结果、效率和用户偏好的强化学习来训练小型协调模型(Orchestrator),由其动态选择并调度合适的工具来完成复杂任务。 Result: 8B的Orchestrator在HLE上得分为37.1%,超过GPT-5的35.1%,且效率高2.5倍;在tau2-Bench和FRAMES上显著优于GPT-5,仅用约30%的成本。模型在性能与成本之间达到最佳平衡,并能泛化到未见工具。 Conclusion: 通过轻量级协调模型组合多样化工具,比单纯依赖大模型更高效且更有效,为实用化、可扩展的工具增强推理系统提供了新路径。 Abstract: Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.

[46] Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Yeganeh Kordi,Nihal V. Nayak,Max Zuo,Ilana Nguyen,Stephen H. Bach

Main category: cs.CL

TL;DR: 该研究通过大规模、细粒度的分析,使用数千个大语言模型和项目反应理论(IRT)来评估任务难度,发现LLM在不同难度任务间的泛化能力有限,训练数据的难易选择无法一致提升所有难度级别的表现,强调训练和评估中应包含多样化的难度。

Details Motivation: 现有研究对训练数据难易程度如何影响大语言模型(LLM)在不同难度任务上的泛化效果存在分歧,本文旨在系统探究LLM跨难度泛化能力,以指导数据构建与模型评估。 Method: 利用数千个LLM的输出结合项目反应理论(IRT)对六个数据集中的样本进行难度排序,基于模型行为而非人类主观判断确定难度,并在多个模型和数据集上进行细粒度的跨难度泛化分析。 Result: 研究发现LLM在不同难度任务间的泛化能力通常有限,仅训练在简单或困难数据上无法在所有难度级别上实现一致改进,且改进效果不具跨难度一致性。 Conclusion: 为确保LLM的有效性和鲁棒性,训练和评估数据中都应包含从易到难的多样化样本;仅依赖简单或困难数据的‘捷径’做法存在风险。 Abstract: We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.

cs.CV [Back]

[47] Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

David Amebley,Sayanton Dibbo

Main category: cs.CV

TL;DR: 本文提出了一种神经科学启发的拓扑正则化(tau)框架,用于增强多模态视觉-语言模型(VLMs)对黑盒成员推断攻击(MIA)的隐私抗性,并在多个模型和数据集上验证了其有效性。

Details Motivation: 随着多模态模型的广泛应用,其潜在的隐私泄露风险日益突出,尤其是成员推断攻击。现有研究主要集中于单模态系统,而对多模态模型尤其是受神经科学启发的模型在隐私攻击下的鲁棒性尚缺乏探索。 Method: 提出一种基于神经科学启发的拓扑正则化(tau)框架,构建具有更强拓扑结构的“NEURO VLM”变体,在BLIP、PaliGemma 2和ViT-GPT2三种VLM上,结合COCO、CC3M和NoCaps三个基准数据集,评估其对图像-文本成员推断攻击的防御能力。 Result: 实验表明,引入tau正则化的NEURO VLM显著降低了MIA攻击成功率(如BLIP在COCO上平均ROC-AUC下降24%),同时保持了与原始模型相当的生成质量(MPNet和ROUGE-2指标相近)。该结果在不同模型和数据集上具有一致性。 Conclusion: 神经科学启发的拓扑正则化能有效提升多模态视觉-语言模型对成员推断攻击的隐私抗性,且不显著牺牲模型效用,为构建更安全的多模态AI系统提供了新思路。 Abstract: In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.

[48] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Inferix Team,Tianyu Feng,Yizeng Han,Jiahao He,Yuanyu He,Xi Lin,Teng Liu,Hanfeng Lu,Jiasheng Tang,Wei Wang,Zhiyuan Wang,Jichao Wu,Mingyang Yang,Yinghao Yu,Zeyu Zhang,Bohan Zhuang

Main category: cs.CV

TL;DR: Inferix是一个专为沉浸式世界合成设计的下一代推理引擎,通过优化半自回归解码过程,支持高效、可变长度和高质量的视频生成,适用于代理AI、具身AI和游戏等领域。

Details Motivation: 现有的视频扩散模型在生成长序列、物理真实且可交互的高质量视频方面存在局限性,而标准的LLM推理系统主要针对高并发场景,不适合复杂的动态世界模拟。因此需要一个专门针对世界模型特点设计的高效推理引擎。 Method: 提出Inferix,一种专为半自回归(块扩散)解码范式优化的推理引擎,结合扩散与自回归方法的优势,在块内使用扩散生成视频token,并利用前序块进行条件建模;引入LLM风格的KV缓存管理机制,实现高效、可变长度的视频生成;支持交互式视频流和性能分析,并集成LV-Bench进行细粒度评估。 Result: Inferix实现了更连贯、稳定和高质量的分钟级视频生成,支持实时交互和动态世界建模,在生成效率和质量上优于传统视频扩散模型和通用推理系统。 Conclusion: Inferix为世界模型提供了一个专用的高性能推理框架,推动了视觉感知、理解和推理能力的发展,有望成为继LLM之后的新型视觉基础模型范式。 Abstract: World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.

[49] Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?

Kun Guo,Yun Shen,Xijun Wang,Chaoqun You,Yun Rui,Tony Q. S. Quek

Main category: cs.CV

TL;DR: 本文提出了一种基于深度强化学习的自适应视频对象识别框架LTED-Ada,通过在边缘服务器检测与本地跟踪之间动态切换,优化识别精度、延迟和资源消耗,在单设备和多设备场景下均表现出优越性能。

Details Motivation: 在资源受限设备(如交通摄像头)上实现快速准确的视频对象识别具有挑战性,现有混合方法缺乏有效的检测与跟踪调度策略。 Method: 构建了单设备和多设备下的长期优化问题模型,提出LTED-Ada算法,结合深度强化学习自适应选择本地跟踪或边缘检测,并引入联邦学习实现多设备协同策略训练。 Result: 硬件在环实验表明,LTED-Ada能有效适应不同帧率和性能需求,在识别准确性和延迟方面优于对比方法。 Conclusion: LTED-Ada为资源受限环境下的视频分析提供了高效解决方案,结合强化学习与联邦学习实现了良好的泛化性和实用性。 Abstract: Fast and accurate video object recognition, which relies on frame-by-frame video analytics, remains a challenge for resource-constrained devices such as traffic cameras. Recent advances in mobile edge computing have made it possible to offload computation-intensive object detection to edge servers equipped with high-accuracy neural networks, while lightweight and fast object tracking algorithms run locally on devices. This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking. To address this, we formulate two long-term optimization problems for both single-device and multi-device scenarios, taking into account the temporal correlation of consecutive frames and the dynamic conditions of mobile edge networks. Based on the formulation, we propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection, according to the frame rate as well as recognition accuracy and delay requirement. In multi-device setting, we further enhance LTED-Ada using federated learning to enable collaborative policy training across devices, thereby improving its generalization to unseen frame rates and performance requirements. Finally, we conduct extensive hardware-in-the-loop experiments using multiple Raspberry Pi 4B devices and a personal computer as the edge server, demonstrating the superiority of LTED-Ada.

[50] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

Haibo HU,Lianming Huang,Nan Guan,Chun Jason Xue

Main category: cs.CV

TL;DR: DeeAD是一种无需训练的动作引导早期退出框架,通过评估中间轨迹的物理可行性来加速视觉-语言动作(VLA)模型的规划过程,可在保持规划质量和安全性的前提下实现最高28%的变压器层稀疏性和29%的延迟降低。

Details Motivation: 现有的VLA模型由于深层Transformer堆栈导致显著的推理延迟,限制了其在自动驾驶中的实时应用,因此需要一种高效且无需重新训练的加速方法。 Method: 提出DeeAD框架,利用轻量级规划先验(如导航或低精度规划)判断中间轨迹是否在可接受偏差内(<2m),从而决定是否提前退出;引入多跳控制器自适应跳过冗余层以进一步提升效率。 Result: 在Bench2Drive基准上实验表明,DeeAD可实现最高28%的变压器层稀疏性和29%的延迟减少,同时保持规划性能和安全性。 Conclusion: DeeAD是一种即插即用、无需训练的VLA加速方案,有效平衡了推理速度与规划质量,适用于对实时性要求高的自动驾驶系统。 Abstract: Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.

[51] Foundry: Distilling 3D Foundation Models for the Edge

Guillaume Letellier,Siddharth Srivastava,Frédéric Jurie,Gaurav Sharma

Main category: cs.CV

TL;DR: 本文提出了Foundation Model Distillation (FMD),一种用于压缩自监督学习基础模型的新方法,并实现了首个针对3D点云的FMD框架Foundry,能够在保持模型通用性的同时显著降低计算成本。

Details Motivation: 大型基础模型因体积和计算开销大而难以部署在边缘设备上,现有压缩方法会牺牲模型的通用性。 Method: 提出FMD框架,通过让学生模型学习重建教师模型的token级表示来压缩模型;Foundry利用SuperTokens捕捉其潜在空间的紧凑基。 Result: 单一蒸馏模型在分类、部件分割和少样本等下游任务中表现出强迁移能力,性能接近原始基础模型,同时显著减少token数量和FLOPs。 Conclusion: FMD能够有效压缩大型SSL模型,在保留其通用表示能力的同时提升在资源受限设备上的部署可行性。 Abstract: Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient 'specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher's token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.

[52] DinoLizer: Learning from the Best for Generative Inpainting Localization

Minh Thong Doi,Jan Butora,Vincent Itier,Jérémie Boulanger,Patrick Bas

Main category: cs.CV

TL;DR: DinoLizer 是一种基于 DINOv2 的模型,用于定位生成式图像修复中的篡改区域,通过在 ViT 的 patch embedding 上添加分类头,并采用滑动窗口策略处理大图,在多种数据集和后处理条件下均显著优于现有方法。

Details Motivation: 现有的图像篡改定位方法在面对生成式修复操作时表现不足,尤其是对语义修改的敏感性和鲁棒性有待提升。因此需要一种更有效、更具泛化能力的方法来精确定位篡改区域。 Method: 基于在 B-Free 数据集上预训练的 DINOv2 模型,利用 ViT 的 patch embeddings 并在其上添加线性分类头,以 14×14 patch 分辨率预测篡改区域;使用滑动窗口策略处理大尺寸图像,并通过后处理优化热图生成二值化篡改掩码。 Result: DinoLizer 在多个生成式修复数据集上超越了当前最先进的局部篡改检测方法,平均 IoU 提升 12%,在经历缩放、噪声、JPEG 压缩等后处理后仍保持强鲁棒性;实验还验证了 DINOv2 相较于 DINOv3 在该任务上的优越性。 Conclusion: DinoLizer 充分利用了 DINOv2 预训练模型的表征能力,在图像篡改定位任务中实现了更高的精度与鲁棒性,证明了 Vision Transformer 在该领域具有强大潜力。 Abstract: We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer's patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12\% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer's superiority. The code will be publicly available upon acceptance of the paper.

[53] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

Daeheon Jeong,Seoyeon Byun,Kihoon Son,Dae Hyun Kim,Juho Kim

Main category: cs.CV

TL;DR: 本文提出了CANVAS,一个用于评估视觉语言模型(VLMs)在基于工具的用户界面设计中性能的新基准,包含598个任务,旨在衡量VLMs在UI设计软件中通过工具调用进行迭代设计的能力。

Details Motivation: 现有研究缺乏对VLMs在真实设计软件中执行UI设计任务能力的系统评估,且无相关基准测试支持该方向的发展。 Method: 构建了CANVAS基准,包含来自30个功能类别的3.3K移动UI设计样本中的598个带真值标注的任务,涵盖设计复制和设计修改两类任务,通过上下文感知的工具调用来评估VLMs在Figma或Sketch等软件中的操作能力。 Result: 实验结果表明,先进的VLM能够进行更具策略性的工具调用,从而提升设计质量,并识别出模型常见的错误模式。 Conclusion: CANVAS为评估和改进VLM在真实设计环境中的工具使用能力提供了有效基准,揭示了其辅助设计师进行实际工作的潜力与挑战。 Abstract: User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs' potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.

[54] Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru,Xiaochuang Han,Bhuwan Dhingra,Emily Dinan,Maha Elbayad

Main category: cs.CV

TL;DR: 提出了一种文本引导的语义图像编码器TIE,能够根据输入文本查询生成条件化的图像表示,在多个图文任务上优于传统方法,并提升推理效率和可解释性。

Details Motivation: 传统的图像编码器在预训练时独立于下游任务和文本查询,导致无法针对特定任务进行优化,缺乏对文本相关信息的关注。 Method: 提出Text-Guided Semantic Image Encoder (TIE),使图像编码器生成的特征表示依赖于输入的文本查询,实现文本引导下的图像编码。 Result: 在1B和3B规模的模型上,TIE在九个图文基准上平均提升+1.5和+1.3点,部分任务如DocVQA和InfoVQA提升高达6点;同时仅用一半图像块即实现更优性能,显著提高推理效率,并展现出对通用查询的良好泛化能力。 Conclusion: TIE通过文本条件化训练有效优化图像编码器,使其更关注与查询相关的视觉区域,提升了模型性能、效率、可解释性和任务适应性。 Abstract: Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.

[55] One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

Sindhuja Penchala,Gavin Money,Gabriel Marques,Samuel Wood,Jessica Kirschman,Travis Atkison,Shahram Rahimi,Noorbakhsh Amiri Golilarz

Main category: cs.CV

TL;DR: 本文提出SMARC模型,能够仅从图像的10%连续区域实现表面材质的重建与分类,结合部分卷积U-Net与分类头,在极端稀疏视觉输入下实现空间修复与语义理解,在真实纹理数据集上达到SOTA性能。

Details Motivation: 现有方法依赖密集或全场景观测,难以应对部分视野或受限环境下的表面材质理解需求,限制了其在机器人、仿真和材料感知中的应用。 Method: 提出SMARC模型,采用部分卷积U-Net结合分类头,利用单个10%连续图像块进行全RGB表面重建并同步完成材质分类,增强对缺失数据的空间推理能力。 Result: 在Touch and Go真实世界纹理数据集上,SMARC取得17.55 dB的PSNR和85.10%的材质分类准确率,优于包括卷积自编码器、ViT、MAE、Swin Transformer和DETR在内的五种模型。 Conclusion: 部分卷积在处理极端稀疏观测时具有优越的空间推理能力,SMARC为基于最小视觉输入的表面理解提供了有效且强大的解决方案。 Abstract: Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation, and material perception. However, most existing methods rely on dense or full-scene observations, limiting their effectiveness in constrained or partial view environment. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders [17], Vision Transformer (ViT) [13], Masked Autoencoder (MAE) [5], Swin Transformer [9], and DETR [2] using Touch and Go dataset [16] of real-world surface textures. SMARC achieves state-of-the-art results with a PSNR of 17.55 dB and a material classification accuracy of 85.10%. Our findings highlight the advantages of partial convolution in spatial reasoning under missing data and establish a strong foundation for minimal-vision surface understanding.

[56] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang,Sudong Wang,Kaichen Zhang,Keming Wu,Sicong Leng,Yifan Zhang,Chengwei Qin,Shijian Lu,Xingxuan Li,Lidong Bing

Main category: cs.CV

TL;DR: 本文提出了一种名为LongVT的端到端代理框架,通过多模态链式工具思维实现长视频理解中的“思考”,有效缓解了大模型在处理长视频时因证据稀疏而导致的幻觉问题。

Details Motivation: 现有大视觉模型在长视频推理中容易产生幻觉,尤其是在视觉证据稀疏且时间分布分散的情况下,难以进行可靠推理。 Method: 受人类观看长视频先全局浏览再聚焦细节的启发,提出LongVT框架,利用LMM的时序定位能力作为原生视频裁剪工具,实现从全局到局部的迭代推理;通过交错使用多模态链式工具思维(Chain-of-Tool-Thought)进行视频片段聚焦和帧重采样。 Result: 在四个具有挑战性的长视频理解与推理基准上,LongVT consistently超越现有强基线;作者构建并发布了VideoSIAH数据集,包含247.9K训练样本及1,280个标注良好的测试问答对。 Conclusion: LongVT通过模仿人类的观看策略,结合三阶段训练方法,在长视频理解任务中显著提升了推理准确性和鲁棒性,为解决长视频中的幻觉问题提供了有效方案。 Abstract: Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

[57] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models

Souradeep Dutta,Keshav Bulia,Neena S Nair

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的KRISP模型复现,参数更少,适用于资源受限环境,在避免AI幻觉的同时支持边缘设备上的离线视觉推理。

Details Motivation: 原KRISP模型计算开销大、依赖大型骨干网络,难以在资源受限场景部署,本文旨在探索其在轻量化设置下的可行性与问题。 Method: 通过系统性消融实验重新审视KRISP,构建参数更少的轻量级版本,并在合成VQA数据和DAQUAR数据集上进行验证,限制外部知识图谱域以控制输出范围。 Result: 复现模型性能达到原模型约75%,揭示了原设计中的多个缺陷与现实陷阱,且能在知识域内抑制AI幻觉,支持边缘设备运行。 Conclusion: 轻量化的知识增强VQA架构在资源受限条件下仍具有效性,本文为面向边缘计算的可靠视觉语言推理提供了可行路径与实践洞察。 Abstract: Facebook AI Research introduced KRISP [4], which integrates structured external knowledge into pipelines for vision-language reasoning. Despite its effectiveness, the original model has been developed for industrial-scale training, is computationally demanding, and is tightly connected to a large backbone. In this work, we reexamine KRISP from a different angle and offer a lightweight reproduction with significantly fewer parameters. Even though our replicated model performs about 75 % of the original, the replication process uncovers a number of design flaws, real-world pitfalls, and implicit problems that were not fully covered in the original paper. We offer insights into the scalability and efficacy of knowledge-enhanced VQA architectures under resource constraints through systematic ablation studies, which include a proof-of-concept on synthetic VQA data and evaluation on the DAQUAR dataset. Our model, configured with a low parameter setup and constrained by the external Knowledge graph domain, prevents AI hallucinations and generates outputs solely within that domain. Minimal parameters allow us to function on edge devices like smartphones and AR-VR, further improving offline visual reasoning.

[58] Intriguing Properties of Dynamic Sampling Networks

Dario Morle,Reid Zaffino

Main category: cs.CV

TL;DR: 本文提出了一种称为“warping”的新型算子,统一了深度学习中各种动态采样方法,并对其进行了理论分析,揭示了其在前向与反向传播中的不对称性及其与传统卷积算子的本质区别,同时探讨了动态采样网络稳定训练的条件及离散化效应。

Details Motivation: 现有的动态采样机制在多种计算机视觉模型中表现出色,但缺乏统一的理论分析框架。为了建立统一视角并深入理解其行为,需要一种能够概括现有方法的通用形式。 Method: 提出了‘warping’算子作为动态采样的通用形式,通过建模输入为独立同分布变量和齐次随机场进行统计分析,并引入基于梯度更新的新型损失景观可视化方法来研究学习行为。 Result: 证明了warping可重构可变形卷积、主动卷积单元和空间变换网络等结构;发现了前向与反向传播之间的独特不对称性;表明该类算子构成了一类不同于传统平移不变卷积的新正交算子类别;给出了确保训练稳定的条件,并分析了离散化带来的统计影响。 Conclusion: 动态采样机制代表了一类全新的运算结构,warping为分析此类模型提供了简洁且可推广的理论框架,有助于设计更稳定、高效的动态网络架构。 Abstract: Dynamic sampling mechanisms in deep learning architectures have demonstrated utility across many computer vision models, though the theoretical analysis of these structures has not yet been unified. In this paper we connect the various dynamic sampling methods by developing and analyzing a novel operator which generalizes existing methods, which we term "warping". Warping provides a minimal implementation of dynamic sampling which is amenable to analysis, and can be used to reconstruct existing architectures including deformable convolutions, active convolutional units, and spatial transformer networks. Using our formalism, we provide statistical analysis of the operator by modeling the inputs as both IID variables and homogeneous random fields. Extending this analysis, we discover a unique asymmetry between the forward and backward pass of the model training. We demonstrate that these mechanisms represent an entirely different class of orthogonal operators to the traditional translationally invariant operators defined by convolutions. With a combination of theoretical analysis and empirical investigation, we find the conditions necessary to ensure stable training of dynamic sampling networks. In addition, statistical analysis of discretization effects are studied. Finally, we introduce a novel loss landscape visualization which utilizes gradient update information directly, to better understand learning behavior.

[59] $Δ$-NeRF: Incremental Refinement of Neural Radiance Fields through Residual Control and Knowledge Transfer

Kriti Ghosh,Devjyoti Chakraborty,Lakshmish Ramaswamy,Suchendra M. Bhandarkar,In Kee Kim,Nancy O'Hare,Deepak Mishra

Main category: cs.CV

TL;DR: 本文提出了Δ-NeRF,一种用于增量式NeRF优化的模块化残差框架,适用于数据流式到达的场景(如卫星遥感)。该方法通过残差控制器、不确定性感知门控机制和视图选择策略,在无需重训和存储历史数据的情况下实现高效更新,并结合知识蒸馏压缩模型,显著提升训练效率与性能。

Details Motivation: 现有NeRF方法在新增视图时需重新训练,难以应对数据持续到来的实际场景(如卫星对地观测),且易发生灾难性遗忘。因此需要一种支持增量学习、避免重训并保留历史知识的方法。 Method: 提出Δ-NeRF:1)引入残差控制器,向冻结的基础NeRF中注入逐层修正;2)设计不确定性感知门控机制,自适应融合基础与修正预测,防止过修正;3)采用视图选择策略减少训练数据量;4)使用知识蒸馏将增强模型压缩为原大小20%的学生网络。 Result: 在卫星图像上实验表明,Δ-NeRF性能媲美联合训练,训练时间减少30-42%;相比朴素微调PSNR最高提升43.5%,并在某些指标上优于联合训练。视图选择可减少47%训练数据而不损性能。 Conclusion: Δ-NeRF实现了高效的增量NeRF优化,解决了灾难性遗忘问题,兼顾性能、效率与模型大小,特别适用于长期、连续观测的应用场景如卫星地形分析。 Abstract: Neural Radiance Fields (NeRFs) have demonstrated remarkable capabilities in 3D reconstruction and novel view synthesis. However, most existing NeRF frameworks require complete retraining when new views are introduced incrementally, limiting their applicability in domains where data arrives sequentially. This limitation is particularly problematic in satellite-based terrain analysis, where regions are repeatedly observed over time. Incremental refinement of NeRFs remains underexplored, and naive approaches suffer from catastrophic forgetting when past data is unavailable. We propose $Δ$-NeRF, a unique modular residual framework for incremental NeRF refinement. $Δ$-NeRF introduces several novel techniques including: (1) a residual controller that injects per-layer corrections into a frozen base NeRF, enabling refinement without access to past data; (2) an uncertainty-aware gating mechanism that prevents overcorrection by adaptively combining base and refined predictions; and (3) a view selection strategy that reduces training data by up to 47\% while maintaining performance. Additionally, we employ knowledge distillation to compress the enhanced model into a compact student network (20\% of original size). Experiments on satellite imagery demonstrate that $Δ$-NeRF achieves performance comparable to joint training while reducing training time by 30-42\%. $Δ$-NeRF consistently outperforms existing baselines, achieving an improvement of up to 43.5\% in PSNR over naive fine-tuning and surpassing joint training on some metrics.

[60] Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara,Yujia Chen,Ming-Hsuan Yang,James M. Rehg,Wen-Sheng Chu,Du Tran

Main category: cs.CV

TL;DR: 提出Split-then-Merge(StM)框架,通过自分解和重组无标签视频提升生成视频合成的控制能力与数据利用效率。

Details Motivation: 解决生成视频合成中对标注数据或手工规则的依赖以及数据稀缺问题。 Method: 将大量无标签视频拆分为动态前景与背景层,进行自组合学习;引入变换感知训练流程、多层融合增强和身份保持损失。 Result: 在定量基准和人类/VLLM定性评估中均优于当前最先进方法。 Conclusion: StM能有效学习复杂动态组合规律,实现更逼真的视频生成并提升可控性。 Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam,Saksham Aggarwal,Justin Yang Chae,Nidhi Rastogi

Main category: cs.CV

TL;DR: Sphinx是一个用于视觉感知和推理的合成环境,包含25种任务类型,评估显示当前最先进的大模型表现远低于人类,而使用可验证奖励的强化学习能显著提升性能。

Details Motivation: 为了推动视觉和多模态推理的发展,需要一个具有可验证真值解的可控、系统化的基准测试环境。 Method: 提出Sphinx环境,通过程序化生成包含多种视觉元素的谜题,并引入强化学习与可验证奖励(RLVR)来提升模型性能。 Result: 最先进的LVLM(如GPT-5)在Sphinx上仅达到51.1%的准确率,远低于人类;RLVR方法显著提升了模型在该基准及其他外部基准上的表现。 Conclusion: Sphinx为视觉推理提供了可扩展且可评估的平台,RLVR是一种有前景的改进多模态模型推理能力的方法。 Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell'Erba,Andrew D. Bagdanov

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视觉反演方法OVI,用于替代扩散模型中昂贵的文本到图像先验网络,并通过引入两种新约束提升生成图像质量,实验表明该方法在多个指标上可与现有最优方法媲美。

Details Motivation: 现有的文本到图像扩散模型依赖计算成本高且需大量数据训练的先验网络,本文旨在探索是否可以完全避免使用此类训练型先验。 Method: 采用基于优化的视觉反演(OVI),从随机伪标记初始化潜在视觉表示,并通过最大化与文本提示嵌入的余弦相似性进行迭代优化;提出Mahalanobis正则化和最近邻损失两种新约束来引导优化过程。 Result: 在Kandinsky 2.2上实验显示,仅用文本嵌入作先验在T2I-CompBench++上得分虚高,而OVI结合最近邻约束能显著提升图像视觉保真度,定量指标达到或超过当前最先进的数据高效先验方法。 Conclusion: OVI作为一种无需训练、无需数据的先验替代方案是可行的,且性能具有竞争力,揭示了当前评估基准存在的问题,表明该方向值得进一步研究。 Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

[63] RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerline Graphs

Roman Naeem,David Hagerman,Jennifer Alvén,Fredrik Kahl

Main category: cs.CV

TL;DR: RefTr是一种基于Transformer的3D图像到图模型,通过循环优化融合轨迹生成血管树中心线,具有高召回率、精确度和更少参数。

Details Motivation: 准确检测管状树(如血管)的中心线并保持正确拓扑结构对临床诊断和手术导航至关重要,尤其需要高召回率以避免遗漏小分支导致严重错误。 Method: 提出RefTr模型,采用Producer-Refiner架构和Transformer解码器,Producer生成初始融合轨迹,Refiner多次迭代优化轨迹;引入高效的非极大值抑制算法合并重复分支。 Result: 在多个公开数据集上,RefTr实现了优于现有方法的召回率和相当的精度,推理速度更快,解码器参数减少2.4倍。 Conclusion: RefTr在保持正确树状拓扑的同时显著提升了中心线检测性能,具备成为3D医学影像血管分析新SOTA框架的潜力。 Abstract: Tubular trees, such as blood vessels and lung airways, are essential for material transport within the human body. Accurately detecting their centerlines with correct tree topology is critical for clinical tasks such as diagnosis, treatment planning, and surgical navigation. In these applications, maintaining high recall is crucial, as missing small branches can result in fatal mistakes caused by incomplete assessments or undetected abnormalities. We present RefTr, a 3D image-to-graph model for centerline generation of vascular trees via recurrent refinement of confluent trajectories. RefTr uses a Producer-Refiner architecture based on a Transformer decoder, where the Producer proposes a set of initial confluent trajectories that are recurrently refined by the Refiner to produce final trajectories, which forms the centerline graph. The confluent trajectory representation enables refinement of complete trajectories while explicitly enforcing a valid tree topology. The recurrent refinement scheme improves precision and reuses the same Refiner block across multiple steps, yielding a 2.4x reduction in decoder parameters compared to previous SOTA. We also introduce an efficient non-maximum suppression algorithm for spatial tree graphs to merge duplicate branches and boost precision. Across multiple public centerline datasets, RefTr achieves superior recall and comparable precision to previous SOTA, while offering faster inference and substantially fewer parameters, demonstrating its potential as a new state-of-the-art framework for vascular tree analysis in 3D medical imaging.

[64] MODEST: Multi-Optics Depth-of-Field Stereo Dataset

Nisarg K. Trivedi,Vinayak A. Belludi,Li-Yun Wang,Pardis Taghavi,Dante Lok

Main category: cs.CV

TL;DR: 本文提出首个高分辨率立体DSLR数据集,包含18000张图像,系统性地变化焦距和光圈,用于提升深度估计、浅景深渲染等任务在真实光学条件下的泛化能力。

Details Motivation: 现有深度估计研究受限于缺乏大规模、高保真的真实立体DSLR数据集,导致模型在真实场景中泛化能力差,尤其基于合成数据训练的模型难以应对真实光学复杂性。 Method: 采集了9个复杂场景,在10个焦距(28-70mm)和5个光圈(f/2.8-f/22)组合下,使用两个相同的DSLR相机系统拍摄,共覆盖50种光学配置,每场景2000张图像,总18000张。每个焦距配置配有独立标定图像集,并提供反射、透明物体、光学错觉等挑战性视觉元素。 Result: 该数据集支持单目与双目深度估计、浅景深渲染、去模糊、3D重建和新视角合成等任务的评估,并揭示当前最先进方法在真实光学条件下的局限性。提供了数据集、标定文件与评估代码。 Conclusion: 该数据集有效弥合了合成数据与真实相机光学之间的现实差距,为提升深度估计等视觉任务在真实场景中的泛化能力提供了重要资源。 Abstract: Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.

[65] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

Sree Bhattacharyya,Yaman Kumar Singla,Sudhir Yarram,Somesh Kumar Singh,Harini S,James Z. Wang

Main category: cs.CV

TL;DR: 本文提出首个大规模无监督视觉内容记忆性数据集,包含8.2万多个视频及其回忆描述,利用Reddit等平台的“舌尖现象”(ToT)检索查询,捕捉开放回忆中的细粒度记忆信号。基于该数据集训练的视觉语言模型在生成记忆性描述和多模态ToT检索任务上优于现有方法,推动了视觉记忆性研究的发展。

Details Motivation: 现有视觉记忆性数据集依赖人工标注,成本高、规模受限,且仅提供聚合记忆分数,缺乏对开放回忆中细粒度记忆信号的建模。因此,需要一种可扩展、无监督的方法来捕捉更丰富的记忆性特征。 Method: 从Reddit等在线平台收集“舌尖现象”(ToT)检索查询作为无监督信号,构建包含82,000多个视频及其对应回忆描述的大规模数据集;利用该数据集微调大型视觉语言模型以生成开放回忆描述,并采用对比学习策略训练首个支持多模态ToT检索的模型。 Result: 基于该数据集训练的模型在开放回忆生成任务上优于GPT-4o等SOTA模型,并首次实现了多模态ToT检索,验证了数据集在记忆性相关任务中的有效性与丰富性。 Conclusion: 本文提出的无监督数据集和模型为视觉内容记忆性研究提供了新方向,显著提升了对复杂记忆信号的建模能力,具有良好的可扩展性和应用潜力。 Abstract: Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.

[66] Estimating Fog Parameters from a Sequence of Stereo Images

Yining Ding,João F. C. Mota,Andrew M. Wallace,Sen Wang

Main category: cs.CV

TL;DR: 提出一种基于序列立体雾天图像的动态雾参数估计方法,通过联合优化解决传统方法误差累积问题,并构建首个真实雾天驾驶数据集SDIRF用于验证。

Details Motivation: 现有雾参数估计方法多为逐帧顺序估计,易产生误差传播,且假设全局均匀雾,在真实非均匀雾场景中性能受限。需要更鲁棒、动态且适用于实际视觉系统的解决方案。 Method: 提出一种联合优化算法,利用立体雾天图像序列同时估计所有雾参数;假设雾局部均匀以适应全局非均匀的真实雾况;可作为插件模块集成到现有视觉SLAM或里程计系统中。 Result: 在合成数据和真实SDIRF数据集上均优于先前方法,参数估计更准确,对真实雾况适应性更强;发布了包含相机光度标定参数和对应晴天数据的SDIRF数据集(40分钟,34k帧)。 Conclusion: 该方法有效提升了雾天环境下视觉系统的参数估计精度与鲁棒性,所发布数据集为雾天视觉感知研究提供了重要资源,推动自动驾驶在恶劣天气下的发展。 Abstract: We propose a method which, given a sequence of stereo foggy images, estimates the parameters of a fog model and updates them dynamically. In contrast with previous approaches, which estimate the parameters sequentially and thus are prone to error propagation, our algorithm estimates all the parameters simultaneously by solving a novel optimisation problem. By assuming that fog is only locally homogeneous, our method effectively handles real-world fog, which is often globally inhomogeneous. The proposed algorithm can be easily used as an add-on module in existing visual Simultaneous Localisation and Mapping (SLAM) or odometry systems in the presence of fog. In order to assess our method, we also created a new dataset, the Stereo Driving In Real Fog (SDIRF), consisting of high-quality, consecutive stereo frames of real, foggy road scenes under a variety of visibility conditions, totalling over 40 minutes and 34k frames. As a first-of-its-kind, SDIRF contains the camera's photometric parameters calibrated in a lab environment, which is a prerequisite for correctly applying the atmospheric scattering model to foggy images. The dataset also includes the counterpart clear data of the same routes recorded in overcast weather, which is useful for companion work in image defogging and depth reconstruction. We conducted extensive experiments using both synthetic foggy data and real foggy sequences from SDIRF to demonstrate the superiority of the proposed algorithm over prior methods. Our method not only produces the most accurate estimates on synthetic data, but also adapts better to real fog. We make our code and SDIRF publicly available\footnote{https://github.com/SenseRoboticsLab/estimating-fog-parameters} to the community with the aim of advancing the research on visual perception in fog.

[67] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Jiancheng Pan,Runze Wang,Tianwen Qian,Mohammad Mahdi,Yanwei Fu,Xiangyang Xue,Xiaomeng Huang,Luc Van Gool,Danda Pani Paudel,Yuqian Fu

Main category: cs.CV

TL;DR: 本文提出了V^2-SAM,一个统一的跨视角物体对应框架,通过两个互补的提示生成器将SAM2从单视图分割扩展到跨视角对应,在多个基准上实现了最先进的性能。

Details Motivation: 由于显著的视角和外观变化,现有的分割模型(如SAM2)难以直接应用于跨视角物体对应任务。 Method: 提出V^2-SAM框架,包含基于DINOv3特征的跨视角锚点提示生成器(V^2-Anchor)和增强外观引导线索的跨视角视觉提示生成器(V^2-Visual),并采用多专家设计与后处理循环一致性选择器(PCCS)自适应选择最可靠的专家。 Result: 在Ego-Exo4D、DAVIS-2017和HANDAL-X等多个数据集上验证了V^2-SAM的有效性,取得了新的最先进性能。 Conclusion: V^2-SAM成功地将SAM2扩展至跨视角场景,首次实现基于坐标的提示,并通过几何与外观线索融合及循环一致性选择提升了跨视角物体对应的准确性。 Abstract: Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).

[68] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation

Taehoon Kim,Henry Gouk,Timothy Hospedales

Main category: cs.CV

TL;DR: 提出Null-Text Test-Time Alignment (Null-TTA),通过优化无条件文本嵌入实现扩散模型在测试时的对齐,有效避免奖励劫持并提升目标对齐性能。

Details Motivation: 现有测试时对齐方法容易欠优化或过优化(奖励劫持),缺乏在语义一致流形上进行对齐的有效机制。 Method: 在分类器自由引导中优化无条件文本嵌入,而非潜变量或噪声,利用文本嵌入空间的结构语义特性实现语义空间中的对齐。 Result: Null-TTA在目标测试时对齐上达到最先进水平,同时保持良好的跨奖励泛化能力。 Conclusion: 语义空间优化是一种有效且原则性的新范式,适用于测试时对齐。 Abstract: Test-time alignment (TTA) aims to adapt models to specific rewards during inference. However, existing methods tend to either under-optimise or over-optimise (reward hack) the target reward function. We propose Null-Text Test-Time Alignment (Null-TTA), which aligns diffusion models by optimising the unconditional embedding in classifier-free guidance, rather than manipulating latent or noise variables. Due to the structured semantic nature of the text embedding space, this ensures alignment occurs on a semantically coherent manifold and prevents reward hacking (exploiting non-semantic noise patterns to improve the reward). Since the unconditional embedding in classifier-free guidance serves as the anchor for the model's generative distribution, Null-TTA directly steers model's generative distribution towards the target reward rather than just adjusting the samples, even without updating model parameters. Thanks to these desirable properties, we show that Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalisation. This establishes semantic-space optimisation as an effective and principled novel paradigm for TTA.

[69] GaINeR: Geometry-Aware Implicit Network Representation

Weronika Jakubowska,Mikołaj Zieliński,Rafał Tobiasz,Krzysztof Byrski,Maciej Zięba,Dominik Belter,Przemysław Spurek

Main category: cs.CV

TL;DR: 提出了一种新的几何感知隐式神经表示方法GaINeR,结合可训练的高斯分布与神经网络,实现2D图像的连续表示、可解释的几何结构和灵活的局部编辑。

Details Motivation: 传统隐式神经表示(INRs)缺乏显式的几何结构,难以支持局部编辑和物理仿真,限制了其在动态或交互场景中的应用。 Method: GaINeR将可训练的高斯分布与基于神经网络的INR结合;对于给定图像坐标,检索K个最近的高斯分布,聚合距离加权的嵌入,并通过神经网络预测RGB值。 Result: 该方法实现了高质量的图像重建,具备显式的几何结构,支持灵活的局部编辑和物理感知的交互操作。 Conclusion: GaINeR在保持INR高保真表示能力的同时,引入了几何结构和可编辑性,拓展了其在交互式和物理感知应用中的潜力。 Abstract: Implicit Neural Representations (INRs) have become an essential tool for modeling continuous 2D images, enabling high-fidelity reconstruction, super-resolution, and compression. Popular architectures such as SIREN, WIRE, and FINER demonstrate the potential of INR for capturing fine-grained image details. However, traditional INRs often lack explicit geometric structure and have limited capabilities for local editing or integration with physical simulation, restricting their applicability in dynamic or interactive settings. To address these limitations, we propose GaINeR: Geometry-Aware Implicit Network Representation, a novel framework for 2D images that combines trainable Gaussian distributions with a neural network-based INR. For a given image coordinate, the model retrieves the K nearest Gaussians, aggregates distance-weighted embeddings, and predicts the RGB value via a neural network. This design enables continuous image representation, interpretable geometric structure, and flexible local editing, providing a foundation for physically aware and interactive image manipulation. The official implementation of our method is publicly available at https://github.com/WJakubowska/GaINeR.

[70] A deep learning model to reduce agent dose for contrast-enhanced MRI of the cerebellopontine angle cistern

Yunjie Chen,Rianne A. Weber,Olaf M. Neve,Stephan R. Romeijn,Erik F. Hensen,Jelmer M. Wolterink,Qian Tao,Marius Staring,Berit M. Verbist

Main category: cs.CV

TL;DR: 该研究开发了一种深度学习模型,用于从低剂量对比增强T1加权MRI中恢复标准剂量图像质量,可在仅使用10%-30%造影剂的情况下实现可靠的内听道肿瘤检测与诊断。

Details Motivation: 减少磁共振成像中造影剂的使用剂量,降低患者风险,同时保持诊断所需的图像质量。 Method: 基于多中心回顾性数据,利用 vestibular schwannoma 患者的T1和T1ce MRI构建深度学习模型,模拟不同低剂量场景,并评估模型恢复图像质量及分割性能的能力。 Result: 随着输入剂量增加,恢复图像的结构相似性(SSIM)和峰值信噪比(PSNR)显著提升;在10%剂量下,分割指标Dice、Hausdorff距离和表面距离均改善;放射科医生评价显示10%和30%剂量恢复图像质量优良,后者更具信息性。 Conclusion: 深度学习模型可有效提升低剂量CPA区MRI的图像质量,使仅用标准剂量10%-30%的造影剂即可完成可靠诊断。 Abstract: Objectives: To evaluate a deep learning (DL) model for reducing the agent dose of contrast-enhanced T1-weighted MRI (T1ce) of the cerebellopontine angle (CPA) cistern. Materials and methods: In this multi-center retrospective study, T1 and T1ce of vestibular schwannoma (VS) patients were used to simulate low-dose T1ce with varying reductions of contrast agent dose. DL models were trained to restore standard-dose T1ce from the low-dose simulation. The image quality and segmentation performance of the DL-restored T1ce were evaluated. A head and neck radiologist was asked to rate DL-restored images in multiple aspects, including image quality and diagnostic characterization. Results: 203 MRI studies from 72 VS patients (mean age, 58.51 \pm 14.73, 39 men) were evaluated. As the input dose increased, the structural similarity index measure of the restored T1ce increased from 0.639 \pm 0.113 to 0.993 \pm 0.009, and the peak signal-to-noise ratio increased from 21.6 \pm 3.73 dB to 41.4 \pm 4.84 dB. At 10% input dose, using DL-restored T1ce for segmentation improved the Dice from 0.673 to 0.734, the 95% Hausdorff distance from 2.38 mm to 2.07 mm, and the average surface distance from 1.00 mm to 0.59 mm. Both DL-restored T1ce from 10% and 30% input doses showed excellent images, with the latter being considered more informative. Conclusion: The DL model improved the image quality of low-dose MRI of the CPA cistern, which makes lesion detection and diagnostic characterization possible with 10% - 30% of the standard dose.

[71] Smooth regularization for efficient video recognition

Gil Goldman,Raja Giryes,Mahadev Satyanarayanan

Main category: cs.CV

TL;DR: 提出一种基于高斯随机游走(GRW)的平滑正则化方法,增强视频识别模型的时间归纳偏置,显著提升轻量级模型在Kinetics-600上的性能。

Details Motivation: 轻量级视频模型难以有效捕捉复杂时间动态,需引入更强的时间归纳偏置以利用视频固有的时间连贯性。 Method: 通过建模连续帧中间层嵌入的变化为高斯随机游走(GRW),对表示的突变进行惩罚,鼓励低加速度、平滑的时序变化。 Result: 在Kinetics-600上,MoViNets系列提升3.8%-6.1%,MobileNetV3和MoViNets-Stream提升4.9%-6.4%,均超越各自FLOP或内存约束下的SOTA。 Conclusion: 该平滑正则化技术有效增强了轻量级视频模型的时间建模能力,在多种架构上显著提升性能,推动了高效视频识别的发展。 Abstract: We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/gilgoldm/grw-smoothing.

[72] Open Vocabulary Compositional Explanations for Neuron Alignment

Biagio La Rosa,Leilani H. Gilpin

Main category: cs.CV

TL;DR: 本文提出了一种用于视觉领域的开放词汇组合解释框架,通过开放词汇语义分割生成的掩码来分析神经元对任意概念的编码,提升了可解释性方法的灵活性和适用范围。

Details Motivation: 现有组合解释方法依赖人工标注数据,限制了其在特定领域和预定义概念外的应用,本文旨在突破这一限制。 Method: 框架包含三个步骤:指定任意概念、使用开放词汇模型生成语义分割掩码、基于掩码推导组合解释。 Result: 相比传统方法,该框架在定量指标和人类可理解性上表现相当,并支持跨任务和属性的灵活解释。 Conclusion: 该方法摆脱了对人工标注的依赖,扩展了组合解释在开放概念和多样化数据集上的应用能力。 Abstract: Neurons are the fundamental building blocks of deep neural networks, and their interconnections allow AI to achieve unprecedented results. Motivated by the goal of understanding how neurons encode information, compositional explanations leverage logical relationships between concepts to express the spatial alignment between neuron activations and human knowledge. However, these explanations rely on human-annotated datasets, restricting their applicability to specific domains and predefined concepts. This paper addresses this limitation by introducing a framework for the vision domain that allows users to probe neurons for arbitrary concepts and datasets. Specifically, the framework leverages masks generated by open vocabulary semantic segmentation to compute open vocabulary compositional explanations. The proposed framework consists of three steps: specifying arbitrary concepts, generating semantic segmentation masks using open vocabulary models, and deriving compositional explanations from these masks. The paper compares the proposed framework with previous methods for computing compositional explanations both in terms of quantitative metrics and human interpretability, analyzes the differences in explanations when shifting from human-annotated data to model-annotated data, and showcases the additional capabilities provided by the framework in terms of flexibility of the explanations with respect to the tasks and properties of interest.

[73] UruDendro4: A Benchmark Dataset for Automatic Tree-Ring Detection in Cross-Section Images of Pinus taeda L

Henry Marichal,Joaquin Blanco,Diego Passarella,Gregory Randall

Main category: cs.CV

TL;DR: 本文提出了一个新的松树年轮数据集UruDendro4,包含102个在不同树干高度采集的Pinus taeda L.图像样本,并提供了手动标注的年轮信息,支持基于横截面图像的树木年轮自动检测和体积建模。同时给出了当前最先进方法的基准性能评估,其中DeepCS-TRD表现最佳,并验证了该数据集能提升模型在年轮检测任务中的泛化能力。

Details Motivation: 由于现有木材横截面数据稀缺且多局限于单一高度采样,难以支持树木年轮的精确自动检测与体积建模,因此需要一个更全面、高质量的数据集来推动相关算法的发展和应用。 Method: 构建了一个名为UruDendro4的新数据集,包含102张火炬松(Pinus taeda L.)的横截面图像,每张均有人工标注的年轮边界;这些样本来自树干多个高度位置,支持体积建模;在此数据集上评估了多种最先进的年轮检测方法,使用mAP、mAR和Adapted Rand Error等指标进行比较,并通过消融实验优化参数配置;此外还测试了将本数据集纳入训练对模型泛化能力的影响。 Result: DeepCS-TRD方法在UruDendro4数据集上取得了最优性能,mAP为0.838,mAR为0.782,Adapted Rand Error为0.084;消融实验验证了模型配置的有效性;将UruDendro4用于训练可显著提升模型在其他数据上的年轮检测泛化能力。 Conclusion: UruDendro4是一个具有多高度采样和精细标注的高质量年轮数据集,填补了现有数据集在体积建模方面的空白,不仅为年轮自动检测提供了新的基准测试平台,还能有效提升深度学习模型的泛化性能,推动林学研究与智能林业技术的发展。 Abstract: Tree-ring growth represents the annual wood increment for a tree, and quantifying it allows researchers to assess which silvicultural practices are best suited for each species. Manual measurement of this growth is time-consuming and often imprecise, as it is typically performed along 4 to 8 radial directions on a cross-sectional disc. In recent years, automated algorithms and datasets have emerged to enhance accuracy and automate the delineation of annual rings in cross-sectional images. To address the scarcity of wood cross-section data, we introduce the UruDendro4 dataset, a collection of 102 image samples of Pinus taeda L., each manually annotated with annual growth rings. Unlike existing public datasets, UruDendro4 includes samples extracted at multiple heights along the stem, allowing for the volumetric modeling of annual growth using manually delineated rings. This dataset (images and annotations) allows the development of volumetric models for annual wood estimation based on cross-sectional imagery. Additionally, we provide a performance baseline for automatic ring detection on this dataset using state-of-the-art methods. The highest performance was achieved by the DeepCS-TRD method, with a mean Average Precision of 0.838, a mean Average Recall of 0.782, and an Adapted Rand Error score of 0.084. A series of ablation experiments were conducted to empirically validate the final parameter configuration. Furthermore, we empirically demonstrate that training a learning model including this dataset improves the model's generalization in the tree-ring detection task.

[74] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

Rawa Mohammed,Mina Attin,Bryar Shareef

Main category: cs.CV

TL;DR: 本文提出了一种名为BUSTR的多任务视觉-语言框架,用于在无需配对图像-报告监督的情况下生成乳腺超声(BUS)报告。该方法利用结构化描述符和放射组学特征,通过多头Swin编码器学习描述符感知的视觉表示,并采用双层次目标进行视觉与文本标记对齐。

Details Motivation: 现有的自动放射学报告生成(RRG)受限于缺乏配对的图像-报告数据集,且大型语言模型存在幻觉风险。因此,需要一种不依赖配对数据且能提高临床有效性的方法。 Method: BUSTR构建报告基于结构化描述符(如BI-RADS、病理、组织学)和放射组学特征;使用多头Swin编码器结合多任务损失训练描述符感知的视觉表示;并通过结合标记级交叉熵和输入输出表示之间的余弦相似性对齐损失的双层次目标实现视觉与文本标记对齐。 Result: 在BrEaST和BUS-BRA两个公开BUS数据集上评估显示,BUSTR在标准自然语言生成指标和临床有效性指标上均有提升,尤其在BI-RADS分类和病理判断方面表现突出。 Conclusion: 这种基于描述符感知的视觉模型结合标记级与表示对齐损失的方法,在无需配对图像-报告数据的前提下,显著提升了自动生成报告的质量和临床实用性。 Abstract: Automated radiology report generation (RRG) for breast ultrasound (BUS) is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models. We propose BUSTR, a multitask vision-language framework that generates BUS reports without requiring paired image-report supervision. BUSTR constructs reports from structured descriptors (e.g., BI-RADS, pathology, histology) and radiomics features, learns descriptor-aware visual representations with a multi-head Swin encoder trained using a multitask loss over dataset-specific descriptor sets, and aligns visual and textual tokens via a dual-level objective that combines token-level cross-entropy with a cosine-similarity alignment loss between input and output representations. We evaluate BUSTR on two public BUS datasets, BrEaST and BUS-BRA, which differ in size and available descriptors. Across both datasets, BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics, particularly for key targets such as BI-RADS category and pathology. Our results show that this descriptor-aware vision model, trained with a combined token-level and alignment loss, improves both automatic report metrics and clinical efficacy without requiring paired image-report data. The source code can be found at https://github.com/AAR-UNLV/BUSTR

[75] Beyond Realism: Learning the Art of Expressive Composition with StickerNet

Haoming Lu,David Kocharian,Humphrey Shi

Main category: cs.CV

TL;DR: 本文提出了表达性图像合成任务和StickerNet框架,通过真实用户编辑行为数据学习非真实感、富有表现力的贴纸放置,强调用户意图与创意表达而非视觉 realism。

Details Motivation: 现有图像合成研究过于关注视觉真实感,但实际创作平台中用户常追求艺术性、趣味性或社交吸引力的内容,因此需要一种更符合真实用户意图的合成方法。 Method: 提出StickerNet,一个两阶段框架:先识别合成类型,再预测位置参数(如透明度、掩码、位置、缩放);使用来自匿名在线创作平台的180万条真实编辑操作构建数据集。 Result: 用户研究和定量评估表明,StickerNet优于常见基线方法,在模拟人类放置行为方面表现优异,验证了从真实世界编辑模式中学习的有效性。 Conclusion: 本研究开创了以表达性和用户意图为导向的视觉理解新方向,突破传统对真实感的依赖,更贴近实际创意编辑场景。 Abstract: As a widely used operation in image editing workflows, image composition has traditionally been studied with a focus on achieving visual realism and semantic plausibility. However, in practical editing scenarios of the modern content creation landscape, many compositions are not intended to preserve realism. Instead, users of online platforms motivated by gaining community recognition often aim to create content that is more artistic, playful, or socially engaging. Taking inspiration from this observation, we define the expressive composition task, a new formulation of image composition that embraces stylistic diversity and looser placement logic, reflecting how users edit images on real-world creative platforms. To address this underexplored problem, we present StickerNet, a two-stage framework that first determines the composition type, then predicts placement parameters such as opacity, mask, location, and scale accordingly. Unlike prior work that constructs datasets by simulating object placements on real images, we directly build our dataset from 1.8 million editing actions collected on an anonymous online visual creation and editing platform, each reflecting user-community validated placement decisions. This grounding in authentic editing behavior ensures strong alignment between task definition and training supervision. User studies and quantitative evaluations show that StickerNet outperforms common baselines and closely matches human placement behavior, demonstrating the effectiveness of learning from real-world editing patterns despite the inherent ambiguity of the task. This work introduces a new direction in visual understanding that emphasizes expressiveness and user intent over realism.

[76] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

Md Adnan Arefeen,Biplob Debnath,Srimat Chakradhar

Main category: cs.CV

TL;DR: 提出TrafficLens算法,用于高效处理多摄像头交通视频分析,通过序列化方法和对象级相似性检测减少VLM调用,显著缩短视频到文本转换时间。

Details Motivation: 现有基于LLM和RAG的系统在处理交通视频时需将视频转为文本,过程耗时,难以及时获取交通洞察。 Method: 采用摄像头覆盖区域重叠的序列化策略,迭代使用不同token限制的VLM,并利用前一输出作为后续提示,结合对象级相似性检测跳过冗余VLM调用。 Result: 实验表明,TrafficLens可将视频到文本转换时间最多减少4倍,同时保持信息准确性。 Conclusion: TrafficLens有效提升了多摄像头交通视频的分析效率,支持快速生成详细描述,适用于实际交通管理场景。 Abstract: Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.

[77] Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI

Al Amin,Kamrul Hasan,Liang Hong,Sharif Ullah

Main category: cs.CV

TL;DR: 提出一种结合Vision Transformer和同态加密的隐私保护联邦学习框架,通过加密CLS token实现安全的多机构组织病理学分类,兼顾高精度、强隐私保护和低通信开销。

Details Motivation: 传统联邦学习中的模型梯度容易受到重构攻击,存在泄露敏感医疗数据的风险,而直接共享原始数据又违反隐私法规,因此需要更安全的隐私保护机制。 Method: 利用Vision Transformer的CLS token作为紧凑特征表示,采用CKKS同态加密对CLS token进行加密后传输,并在密文上直接进行推理和聚合,避免传输易受攻击的梯度信息。 Result: 在三客户端肺部癌症组织病理学分类任务中,传统梯度传输易受模型反演攻击(PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741),而所提方法可完全抵御此类攻击;每轮聚合仅需326 KB加密数据传输,通信量减少30倍;在明文域准确率达96.12%,密文域达90.02%。 Conclusion: 该框架在保证高分类性能的同时,显著提升了隐私保护能力并降低了通信成本,适用于跨医疗机构的安全协作学习。 Abstract: Collaborative machine learning across healthcare institutions promises improved diagnostic accuracy by leveraging diverse datasets, yet privacy regulations such as HIPAA prohibit direct patient data sharing. While federated learning (FL) enables decentralized training without raw data exchange, recent studies show that model gradients in conventional FL remain vulnerable to reconstruction attacks, potentially exposing sensitive medical information. This paper presents a privacy-preserving federated learning framework combining Vision Transformers (ViT) with homomorphic encryption (HE) for secure multi-institutional histopathology classification. The approach leverages the ViT CLS token as a compact 768-dimensional feature representation for secure aggregation, encrypting these tokens using CKKS homomorphic encryption before transmission to the server. We demonstrate that encrypting CLS tokens achieves a 30-fold communication reduction compared to gradient encryption while maintaining strong privacy guarantees. Through evaluation on a three-client federated setup for lung cancer histopathology classification, we show that gradients are highly susceptible to model inversion attacks (PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741), enabling near-perfect image reconstruction. In contrast, the proposed CLS-protected HE approach prevents such attacks while enabling encrypted inference directly on ciphertexts, requiring only 326 KB of encrypted data transmission per aggregation round. The framework achieves 96.12 percent global classification accuracy in the unencrypted domain and 90.02 percent in the encrypted domain.

[78] Inversion-Free Style Transfer with Dual Rectified Flows

Yingying Deng,Xiangyu He,Fan Tang,Weiming Dong,Xucheng Yin

Main category: cs.CV

TL;DR: 提出了一种基于双校正流的无需反演的风格迁移框架,仅通过前向传播实现高效、高质量的图像风格迁移。

Details Motivation: 现有的基于扩散模型的无训练风格迁移方法依赖计算成本高的反演过程,效率低且反演不准确时会导致视觉失真。 Method: 提出一种基于双校正流的前向风格迁移框架,通过并行预测内容与风格轨迹,并在动态中点插值中融合两者的速度场;同时引入注意力注入机制以更好地引导风格融合。 Result: 该方法在多种风格和内容组合上表现出优异的泛化能力,有效避免了传统方法中的视觉伪影,显著提升内容保持性、视觉质量和计算效率。 Conclusion: 所提出的无需反演、仅需前向传播的风格迁移框架在效率和生成质量方面均优于现有扩散模型方法,为实际应用提供了高效可靠的解决方案。 Abstract: Style transfer, a pivotal task in image processing, synthesizes visually compelling images by seamlessly blending realistic content with artistic styles, enabling applications in photo editing and creative design. While mainstream training-free diffusion-based methods have greatly advanced style transfer in recent years, their reliance on computationally inversion processes compromises efficiency and introduces visual distortions when inversion is inaccurate. To address these limitations, we propose a novel \textit{inversion-free} style transfer framework based on dual rectified flows, which tackles the challenge of finding an unknown stylized distribution from two distinct inputs (content and style images), \textit{only with forward pass}. Our approach predicts content and style trajectories in parallel, then fuses them through a dynamic midpoint interpolation that integrates velocities from both paths while adapting to the evolving stylized image. By jointly modeling the content, style, and stylized distributions, our velocity field design achieves robust fusion and avoids the shortcomings of naive overlays. Attention injection further guides style integration, enhancing visual fidelity, content preservation, and computational efficiency. Extensive experiments demonstrate generalization across diverse styles and content, providing an effective and efficient pipeline for style transfer.

[79] RefOnce: Distilling References into a Prototype Memory for Referring Camouflaged Object Detection

Yu-Huan Wu,Zi-Xuan Zhu,Yan Wang,Liangli Zhen,Deng-Ping Fan

Main category: cs.CV

TL;DR: 提出了一种新的Ref-COD框架,通过在训练时将参考信息蒸馏到类别原型记忆中,并在推理时合成参考向量,从而实现无需测试时参考图像的高效检测。

Details Motivation: 现有Ref-COD系统依赖双分支结构,在测试时需要参考图像,限制了部署能力并增加了延迟和数据收集负担。 Method: 维护每个类别的EMA更新原型,通过查询条件下的原型混合预测权重生成引导向量,并引入双向注意力对齐模块来弥合参考统计与伪装查询特征之间的表示差距。 Result: 在R2C7K大规模基准上进行了评估,实验表明所提方法性能优于或媲美最新方法。 Conclusion: 该方法提供了一种简单高效的Ref-COD路径,无需强制使用测试时参考图像,提升了模型的实用性与部署效率。 Abstract: Referring Camouflaged Object Detection (Ref-COD) segments specified camouflaged objects in a scene by leveraging a small set of referring images. Though effective, current systems adopt a dual-branch design that requires reference images at test time, which limits deployability and adds latency and data-collection burden. We introduce a Ref-COD framework that distills references into a class-prototype memory during training and synthesizes a reference vector at inference via a query-conditioned mixture of prototypes. Concretely, we maintain an EMA-updated prototype per category and predict mixture weights from the query to produce a guidance vector without any test-time references. To bridge the representation gap between reference statistics and camouflaged query features, we propose a bidirectional attention alignment module that adapts both the query features and the class representation. Thus, our approach yields a simple, efficient path to Ref-COD without mandatory references. We evaluate the proposed method on the large-scale R2C7K benchmark. Extensive experiments demonstrate competitive or superior performance of the proposed method compared with recent state-of-the-arts. Code is available at https://github.com/yuhuan-wu/RefOnce.

[80] Wavefront-Constrained Passive Obscured Object Detection

Zhiwen Zheng,Yiwei Ouyang,Zhao Huang,Tao Zhang,Xiaoshuai Zhang,Huiyu Zhou,Wenwen Tang,Shaowei Jiang,Jin Liu,Xingru Huang

Main category: cs.CV

TL;DR: 提出了一种物理驱动的WavePCNet网络,用于通过模拟波前传播来增强遮挡物体的感知能力,在低信噪比和复杂散射环境下实现了高精度和鲁棒性的非视域成像。

Details Motivation: 现有方法在非视域成像中因忽略相干光传播的物理特性,且在低信噪比下易产生非物理解,导致稳定性与可靠性不足。 Method: 提出WavePCNet,包含Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) 模块以复振幅传播算子精确建模相干光传输,并引入动量记忆机制抑制扰动累积;同时设计高频跨层补偿增强模块,构建多尺度频选通路以动态保持结构一致性。 Result: 在四个真实采集的数据集上实验表明,该方法在定位与分割被遮挡物体方面显著优于现有最先进方法,具备更高精度与鲁棒性。 Conclusion: WavePCNet通过深度融合物理模型与深度学习,有效提升了复杂环境下非视域成像的性能,增强了模型的可解释性与物理一致性。 Abstract: Accurately localizing and segmenting obscured objects from faint light patterns beyond the field of view is highly challenging due to multiple scattering and medium-induced perturbations. Most existing methods, based on real-valued modeling or local convolutional operations, are inadequate for capturing the underlying physics of coherent light propagation. Moreover, under low signal-to-noise conditions, these methods often converge to non-physical solutions, severely compromising the stability and reliability of the observation. To address these challenges, we propose a novel physics-driven Wavefront Propagating Compensation Network (WavePCNet) to simulate wavefront propagation and enhance the perception of obscured objects. This WavePCNet integrates the Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) to incorporate complex amplitude transfer operators to precisely constrain coherent propagation behavior, along with a momentum memory mechanism to effectively suppress the accumulation of perturbations. Additionally, a High-frequency Cross-layer Compensation Enhancement is introduced to construct frequency-selective pathways with multi-scale receptive fields and dynamically model structural consistency across layers, further boosting the model's robustness and interpretability under complex environmental conditions. Extensive experiments conducted on four physically collected datasets demonstrate that WavePCNet consistently outperforms state-of-the-art methods across both accuracy and robustness.

[81] GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision

Yuxiao Xiang,Junchi Chen,Zhenchao Jin,Changtao Miao,Haojie Yuan,Qi Chu,Tao Gong,Nenghai Yu

Main category: cs.CV

TL;DR: 本文提出了GuardTrace-VL,一种能够监控多模态大推理模型(MLRM)在视觉-语言任务中整个推理过程的安全审计工具,通过联合图像-文本分析检测中间推理阶段的不安全内容。

Details Motivation: 现有安全防护方法仅关注输入问题和最终答案,忽略了可能包含有害内容的中间推理过程,导致潜在风险无法被及时发现。 Method: 提出GuardTrace-VL模型,并构建GuardTrace数据集,采用三阶段渐进式训练方案结合数据精炼流程,实现对不同风险等级下复杂情境的安全偏好学习。 Result: 在涵盖领域内和领域外场景的测试集上,GuardTrace-VL在不安全推理检测任务中的F1分数达到93.1%,相比此前最强方法提升了13.5%。 Conclusion: GuardTrace-VL能有效识别多模态推理过程中出现的不安全内容,显著提升MLRM部署安全性,具有良好的泛化能力和实际应用前景。 Abstract: Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.

[82] From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

Jingxi Chen,Yixiao Zhang,Xiaoye Qian,Zongxia Li,Cornelia Fermuller,Caren Chen,Yiannis Aloimonos

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的轻量级微调方法,用于单幅图像的图层分解,结合合成数据训练和新型多模态上下文融合模块,在物体去除和遮挡恢复中表现优异。

Details Motivation: 由于缺乏足够的方法和数据,从单幅图像中分解出前景与背景图层仍具挑战性;而分层表示对图像编辑和内容创作具有重要意义。 Method: 利用扩散-based修复模型,通过轻量级微调适配于图层分解任务,并引入一种具有线性注意力复杂度的多模态上下文融合模块以在潜在空间中保留更多细节。 Result: 模型在纯合成数据上训练,显著提升了物体去除和遮挡恢复的效果,优于现有方法。 Conclusion: 该方法有效实现了单图像图层分解,为下游编辑和创意应用提供了新可能。 Abstract: Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.

[83] Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

Xiaoxing You,Qiang Huang,Lingyu Li,Chi Zhang,Xiaopeng Liu,Min Zhang,Jun Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为MERGE的多模态实体感知检索增强生成框架,用于新闻图像描述生成,通过构建实体中心的多模态知识库和动态检索机制,在多个数据集上显著优于现有方法,并展现出良好的泛化能力。

Details Motivation: 现有新闻图像描述方法在信息覆盖、跨模态对齐和视觉实体定位方面存在不足,难以充分结合上下文信息生成高质量描述。 Method: 提出MERGE框架,构建融合文本、视觉与结构化知识的实体中心多模态知识库(EMKB),采用多阶段假设-描述策略提升跨模态对齐,并通过图像引导的动态检索增强视觉-实体匹配。 Result: 在GoodNews和NYTimes800k数据集上CIDEr分别提升+6.84和+1.16,F1-score在命名实体识别上提升+4.14和+2.64;在未见的Visual News数据集上CIDEr提升+20.17,F1-score提升+6.22,表现出强鲁棒性与领域适应性。 Conclusion: MERGE有效解决了新闻图像描述中的关键挑战,显著提升了描述质量与实体准确性,具备良好的泛化能力,为未来多模态新闻理解提供了新思路。 Abstract: News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.

[84] MetaRank: Task-Aware Metric Selection for Model Transferability Estimation

Yuhang Liu,Wenjie Zhao,Yunhui Guo

Main category: cs.CV

TL;DR: 本文提出了MetaRank,一种基于元学习的模型迁移性估计(MTE)指标选择框架,通过将数据集和MTE指标的文本描述嵌入共享语义空间,并以列表级目标训练元预测器,实现对新任务的高效、任务感知的MTE指标排序。

Details Motivation: 现有MTE指标的选择通常依赖经验或平均性能,缺乏任务适应性,且不同MTE指标在不同目标任务上的表现差异大,因此需要一种自动化的、任务感知的MTE指标选择方法。 Method: 将MTE指标选择建模为学习排序问题,使用预训练语言模型编码数据集和MTE指标的文本描述,在共享语义空间中进行表示;通过多样化的元任务离线训练一个元预测器,采用列表级损失函数优化其对高性能MTE指标的排序能力。 Result: 在11个预训练模型和11个目标数据集上的实验表明,MetaRank能有效识别出最适合特定任务的MTE指标,显著优于固定选择或平均性能驱动的方法。 Conclusion: MetaRank实现了任务自适应的MTE指标选择,提升了迁移学习中源模型评估的效率与准确性,为实际应用中的模型选择提供了可靠依据。 Abstract: Selecting an appropriate pre-trained source model is a critical, yet computationally expensive, task in transfer learning. Model Transferability Estimation (MTE) methods address this by providing efficient proxy metrics to rank models without full fine-tuning. In practice, the choice of which MTE metric to use is often ad hoc or guided simply by a metric's average historical performance. However, we observe that the effectiveness of MTE metrics is highly task-dependent and no single metric is universally optimal across all target datasets. To address this gap, we introduce MetaRank, a meta-learning framework for automatic, task-aware MTE metric selection. We formulate metric selection as a learning-to-rank problem. Rather than relying on conventional meta-features, MetaRank encodes textual descriptions of both datasets and MTE metrics using a pretrained language model, embedding them into a shared semantic space. A meta-predictor is then trained offline on diverse meta-tasks to learn the intricate relationship between dataset characteristics and metric mechanisms, optimized with a listwise objective that prioritizes correctly ranking the top-performing metrics. During the subsequent online phase, MetaRank efficiently ranks the candidate MTE metrics for a new, unseen target dataset based on its textual description, enabling practitioners to select the most appropriate metric a priori. Extensive experiments across 11 pretrained models and 11 target datasets demonstrate the strong effectiveness of our approach.

[85] Structure-Aware Prototype Guided Trusted Multi-View Classification

Haojian Huang,Jiahao Shi,Zhe Liu,Harold Haodong Chen,Han Fang,Hao Sun,Zhongjiang He

Main category: cs.CV

TL;DR: 提出一种基于原型的可信多视图分类框架,通过简化视图内邻居关系学习和动态对齐视图内外结构,提升跨视图一致性与分类可靠性。

Details Motivation: 现有方法依赖全局密集邻居关系,计算成本高且难以保证视图间一致性,同时使用人工赋权聚合证据,缺乏类空间内多视图结构一致性的保障。 Method: 引入原型表示各视图的邻居结构,简化视图内关系学习,并实现视图内与视图间结构的动态对齐,以增强跨视图共识的发现。 Result: 在多个公开多视图数据集上的实验表明,该方法在下游任务性能和鲁棒性方面优于或媲美现有的主流可信多视图分类方法。 Conclusion: 所提框架有效提升了多视图分类的效率、一致性和可信度,为处理异构、冲突信息提供了新思路。 Abstract: Trustworthy multi-view classification (TMVC) addresses the challenge of achieving reliable decision-making in complex scenarios where multi-source information is heterogeneous, inconsistent, or even conflicting. Existing TMVC approaches predominantly rely on globally dense neighbor relationships to model intra-view dependencies, leading to high computational costs and an inability to directly ensure consistency across inter-view relationships. Furthermore, these methods typically aggregate evidence from different views through manually assigned weights, lacking guarantees that the learned multi-view neighbor structures are consistent within the class space, thus undermining the trustworthiness of classification outcomes. To overcome these limitations, we propose a novel TMVC framework that introduces prototypes to represent the neighbor structures of each view. By simplifying the learning of intra-view neighbor relations and enabling dynamic alignment of intra- and inter-view structure, our approach facilitates more efficient and consistent discovery of cross-view consensus. Extensive experiments on multiple public multi-view datasets demonstrate that our method achieves competitive downstream performance and robustness compared to prevalent TMVC methods.

[86] CameraMaster: Unified Camera Semantic-Parameter Control for Photography Retouching

Qirui Yang,Yang Yang,Ying Zeng,Xiaobin Hu,Bo Li,Huanjing Yue,Jingyu Yang,Peng-Tao Jiang

Main category: cs.CV

TL;DR: 本文提出CameraMaster,一种统一的相机感知图像润饰框架,通过解耦相机指令并融合摄影师意图与精确相机参数,实现物理一致且精细可控的图像编辑。

Details Motivation: 现有基于文本引导的扩散模型在图像润饰中难以实现对曝光、白平衡、变焦等相机参数的精确控制,且缺乏可扩展性和对多参数组合及细微变化的敏感性。 Method: 提出CameraMaster框架,显式解耦相机指令,并引入相机参数嵌入来调制指令和内容语义;通过交叉注意力将调制后的指令注入内容特征,并将指令与参数嵌入作为条件和门控信号注入时间嵌入,实现去噪过程中的统一逐层调控。 Result: 构建了包含78K图像-提示对并标注相机参数的大规模数据集;实验表明CameraMaster对参数变化具有单调且接近线性的响应,支持无缝的多参数组合,并显著优于现有方法。 Conclusion: CameraMaster实现了更精准、可预测和可组合的相机参数控制,推动了物理一致的图像润饰发展。 Abstract: Text-guided diffusion models have greatly advanced image editing and generation. However, achieving physically consistent image retouching with precise parameter control (e.g., exposure, white balance, zoom) remains challenging. Existing methods either rely solely on ambiguous and entangled text prompts, which hinders precise camera control, or train separate heads/weights for parameter adjustment, which compromises scalability, multi-parameter composition, and sensitivity to subtle variations. To address these limitations, we propose CameraMaster, a unified camera-aware framework for image retouching. The key idea is to explicitly decouple the camera directive and then coherently integrate two critical information streams: a directive representation that captures the photographer's intent, and a parameter embedding that encodes precise camera settings. CameraMaster first uses the camera parameter embedding to modulate both the camera directive and the content semantics. The modulated directive is then injected into the content features via cross-attention, yielding a strongly camera-sensitive semantic context. In addition, the directive and camera embeddings are injected as conditioning and gating signals into the time embedding, enabling unified, layer-wise modulation throughout the denoising process and enforcing tight semantic-parameter alignment. To train and evaluate CameraMaster, we construct a large-scale dataset of 78K image-prompt pairs annotated with camera parameters. Extensive experiments show that CameraMaster produces monotonic and near-linear responses to parameter variations, supports seamless multi-parameter composition, and significantly outperforms existing methods.

[87] CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang,Yunong Liu,Bohan Zhai,Ximeng Sun,Zicheng Liu,Emad Barsoum,Manling Li,Chenfeng Xu

Main category: cs.CV

TL;DR: 本文提出了一个基于实用性的图像字幕评估基准CaptionQA,通过衡量字幕在下游任务中的表现来评估其质量,覆盖四个领域并构建了大量多选题,发现当前模型在传统指标上表现相近但字幕实用性差距显著。

Details Motivation: 现有字幕评估方法未能回答一个核心问题:字幕是否能在真实下游任务中有效替代图像?因此需要一种基于实用性的新评估方式。 Method: 提出CaptionQA基准,包含四个领域的细粒度分类体系和33,027个密集标注的多选题,使用LLM仅依赖字幕回答问题,直接评估字幕对视觉信息的保留和可用性。 Result: 实验显示当前最先进的多模态大模型在字幕实用性上存在显著差距,一些在传统图像问答基准上表现接近的模型,其字幕实用性最多下降32%。 Conclusion: CaptionQA能有效揭示字幕在实际应用中的性能瓶颈,为改进字幕生成提供了新的评估标准,并支持向新领域扩展。 Abstract: Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.

[88] FlowerDance: MeanFlow for Efficient and Refined 3D Dance Generation

Kaixing Yang,Xulong Tang,Ziqiao Peng,Xiangyue Zhang,Puwei Wang,Jun He,Hongyan Liu

Main category: cs.CV

TL;DR: 提出FlowerDance,一种高效且高质量的音乐到舞蹈生成模型,结合MeanFlow与物理一致性约束,并采用BiMamba架构和通道级跨模态融合,实现在少量采样步数下快速生成具有艺术表现力和物理合理性的3D舞蹈动作。

Details Motivation: 现有音乐到舞蹈生成方法生成效率低,难以满足实时高保真3D渲染需求,限制了实际应用中的表现力。 Method: 结合MeanFlow与物理一致性约束,使用BiMamba骨干网络和通道级跨模态融合,以非自回归方式高效生成舞蹈动作,并支持交互式动作编辑。 Result: 在AIST++和FineDance数据集上实验表明,FlowerDance在运动质量和生成效率方面均达到SOTA水平,显著提升推理速度和内存利用率。 Conclusion: FlowerDance在保证舞蹈动作高质量的同时大幅提升生成效率,适合实际应用场景,具备良好的可扩展性和交互性。 Abstract: Music-to-dance generation aims to translate auditory signals into expressive human motion, with broad applications in virtual reality, choreography, and digital entertainment. Despite promising progress, the limited generation efficiency of existing methods leaves insufficient computational headroom for high-fidelity 3D rendering, thereby constraining the expressiveness of 3D characters during real-world applications. Thus, we propose FlowerDance, which not only generates refined motion with physical plausibility and artistic expressiveness, but also achieves significant generation efficiency on inference speed and memory utilization . Specifically, FlowerDance combines MeanFlow with Physical Consistency Constraints, which enables high-quality motion generation with only a few sampling steps. Moreover, FlowerDance leverages a simple but efficient model architecture with BiMamba-based backbone and Channel-Level Cross-Modal Fusion, which generates dance with efficient non-autoregressive manner. Meanwhile, FlowerDance supports motion editing, enabling users to interactively refine dance sequences. Extensive experiments on AIST++ and FineDance show that FlowerDance achieves state-of-the-art results in both motion quality and generation efficiency. Code will be released upon acceptance.

[89] LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules

Cheng Yang,Hui Jin,Xinlei Yu,Zhipeng Wang,Yaoqun Liu,Fenglei Fan,Dajiang Lei,Gangyong Jia,Changmiao Wang,Ruiquan Ge

Main category: cs.CV

TL;DR: 提出了一种名为LungNoduleAgent的协作式多智能体系统,用于分析肺部CT扫描,通过三个模块提高肺结节描述和恶性分级的准确性,在多个数据集上优于现有模型。

Details Motivation: 现有的多模态大语言模型在准确描述肺结节形态和融入医学专业知识方面仍存在局限,影响其在临床应用中的可靠性;而多智能体系统在病理学中的潜力尚未充分探索。 Method: 将诊断过程分解为三个模块:Nodule Spotter负责检测结节,Radiologist生成局部图像描述并撰写CT报告,Doctor Agent System结合影像、报告、病理知识库和多智能体框架进行恶性推理。 Result: 在两个私有数据集和公开的LIDC-IDRI数据集上实验表明,LungNoduleAgent在结节描述和恶性分级方面优于主流视觉-语言模型、智能体系统和专家模型,验证了区域级语义对齐与多智能体协作的重要性。 Conclusion: LungNoduleAgent是一种有前景的肺结节临床分析基础工具,展示了多智能体协作在医学影像诊断中的潜力。 Abstract: Diagnosing lung cancer typically involves physicians identifying lung nodules in Computed tomography (CT) scans and generating diagnostic reports based on their morphological features and medical expertise. Although advancements have been made in using multimodal large language models for analyzing lung CT scans, challenges remain in accurately describing nodule morphology and incorporating medical expertise. These limitations affect the reliability and effectiveness of these models in clinical settings. Collaborative multi-agent systems offer a promising strategy for achieving a balance between generality and precision in medical applications, yet their potential in pathology has not been thoroughly explored. To bridge these gaps, we introduce LungNoduleAgent, an innovative collaborative multi-agent system specifically designed for analyzing lung CT scans. LungNoduleAgent streamlines the diagnostic process into sequential components, improving precision in describing nodules and grading malignancy through three primary modules. The first module, the Nodule Spotter, coordinates clinical detection models to accurately identify nodules. The second module, the Radiologist, integrates localized image description techniques to produce comprehensive CT reports. Finally, the Doctor Agent System performs malignancy reasoning by using images and CT reports, supported by a pathology knowledge base and a multi-agent system framework. Extensive testing on two private datasets and the public LIDC-IDRI dataset indicates that LungNoduleAgent surpasses mainstream vision-language models, agent systems, and advanced expert models. These results highlight the importance of region-level semantic alignment and multi-agent collaboration in diagnosing nodules. LungNoduleAgent stands out as a promising foundational tool for supporting clinical analyses of lung nodules.

[90] PG-ControlNet: A Physics-Guided ControlNet for Generative Spatially Varying Image Deblurring

Hakki Motorcu,Mujdat Cetin

Main category: cs.CV

TL;DR: 本文提出了一种新的图像去模糊框架,通过将强大的生成先验与显式的密集物理约束相结合,解决了空间变化模糊的问题,实现了物理准确性和感知真实感之间的平衡。

Details Motivation: 现有的基于学习的去模糊方法在物理约束和感知质量之间存在权衡:模型驱动的方法虽然物理上准确但纹理过平滑,而生成模型虽视觉效果好却容易产生虚构细节。本文旨在统一这两类方法的优势。 Method: 提出一种密集连续的高维压缩核来建模复杂的空间变化退化场,并利用该描述子场作为条件引导ControlNet架构下的扩散模型采样过程,从而融合生成先验与物理约束。 Result: 实验表明,该方法在严重模糊的复杂场景下优于现有最先进的模型驱动和生成式去模糊方法,有效结合了物理准确性和视觉真实感。 Conclusion: 所提框架成功地调和了生成模型的感知优势与物理模型的准确性,在空间变化图像去模糊任务中表现出卓越性能。 Abstract: Spatially varying image deblurring remains a fundamentally ill-posed problem, especially when degradations arise from complex mixtures of motion and other forms of blur under significant noise. State-of-the-art learning-based approaches generally fall into two paradigms: model-based deep unrolling methods that enforce physical constraints by modeling the degradations, but often produce over-smoothed, artifact-laden textures, and generative models that achieve superior perceptual quality yet hallucinate details due to weak physical constraints. In this paper, we propose a novel framework that uniquely reconciles these paradigms by taming a powerful generative prior with explicit, dense physical constraints. Rather than oversimplifying the degradation field, we model it as a dense continuum of high-dimensional compressed kernels, ensuring that minute variations in motion and other degradation patterns are captured. We leverage this rich descriptor field to condition a ControlNet architecture, strongly guiding the diffusion sampling process. Extensive experiments demonstrate that our method effectively bridges the gap between physical accuracy and perceptual realism, outperforming state-of-the-art model-based methods as well as generative baselines in challenging, severely blurred scenarios.

[91] MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization

Yingjie Xia,Xi Wang,Jinglei Shi,Vicky Kalogeiton,Jian Yang

Main category: cs.CV

TL;DR: MUSE是首个统一的图像情感合成框架,能够同时进行情感生成与编辑,无需额外训练扩散模型或专用数据集,通过梯度优化和语义相似性指导实现高效、准确的情感控制。

Details Motivation: 现有图像情感合成方法将生成与编辑任务割裂,导致效率低下且限制了在治疗、叙事等场景的应用。需要一个统一框架来解决这一问题。 Method: 提出MUSE框架,采用类测试时扩展(TTS)策略,利用现成的情绪分类器进行梯度优化;通过语义相似性确定最佳情绪引导时机,并设计多情绪损失函数以减少情绪干扰。 Result: 实验表明MUSE在生成和编辑任务上均优于现有方法,提升了情感准确性与语义多样性,同时保持内容、文本对齐和真实感之间的平衡。 Conclusion: MUSE为图像情感合成建立了新范式,实现了无需额外训练的高效统一情感控制,具有广泛的应用潜力。 Abstract: Images evoke emotions that profoundly influence perception, often prioritized over content. Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling. In this work, we introduce MUSE, the first unified framework capable of both emotional generation and editing. By adopting a strategy conceptually aligned with Test-Time Scaling (TTS) that widely used in both LLM and diffusion model communities, it avoids the requirement for additional updating diffusion model and specialized emotional synthesis datasets. More specifically, MUSE addresses three key questions in emotional synthesis: (1) HOW to stably guide synthesis by leveraging an off-the-shelf emotion classifier with gradient-based optimization of emotional tokens; (2) WHEN to introduce emotional guidance by identifying the optimal timing using semantic similarity as a supervisory signal; and (3) WHICH emotion to guide synthesis through a multi-emotion loss that reduces interference from inherent and similar emotions. Experimental results show that MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining an optimal balance between desired content, adherence to text prompts, and realistic emotional expression. It establishes a new paradigm for emotion synthesis.

[92] Long-Term Alzheimers Disease Prediction: A Novel Image Generation Method Using Temporal Parameter Estimation with Normal Inverse Gamma Distribution on Uneven Time Series

Xin Hong,Xinze Sun,Yinhao Li,Yen-Wei Chen

Main category: cs.CV

TL;DR: 提出了一种基于时间参数的正态逆伽马分布(T-NIG)模型,用于在不规则时间间隔下生成脑部图像并长期预测阿尔茨海默病进展,兼顾疾病特征保持与不确定性建模。

Details Motivation: 现有图像生成方法在不规则时间间隔的序列数据下难以长期维持与疾病相关的特征,影响阿尔茨海默病的准确预测。 Method: 提出T-NIG模型,利用两个时间点的脑图像构建中间和未来图像;通过坐标邻域提取特征,并将时间参数嵌入正态逆伽马分布以建模不规则时间间隔下的特征变化;结合不确定性估计降低由稀疏时间数据带来的认知和偶然不确定性。 Result: T-NIG在短期和长期预测任务中均达到最先进性能,能有效保持疾病相关特征,即使在不规则时间分布下也能准确预测疾病进展。 Conclusion: T-NIG通过引入时间感知的概率建模和不确定性估计,显著提升了在不规则时间序列下脑图像生成与AD进展预测的准确性与鲁棒性。 Abstract: Image generation can provide physicians with an imaging diagnosis basis in the prediction of Alzheimer's Disease (AD). Recent research has shown that long-term AD predictions by image generation often face difficulties maintaining disease-related characteristics when dealing with irregular time intervals in sequential data. Considering that the time-related aspects of the distribution can reflect changes in disease-related characteristics when images are distributed unevenly, this research proposes a model to estimate the temporal parameter within the Normal Inverse Gamma Distribution (T-NIG) to assist in generating images over the long term. The T-NIG model employs brain images from two different time points to create intermediate brain images, forecast future images, and predict the disease. T-NIG is designed by identifying features using coordinate neighborhoods. It incorporates a time parameter into the normal inverse gamma distribution to understand how features change in brain imaging sequences that have varying time intervals. Additionally, T-NIG utilizes uncertainty estimation to reduce both epistemic and aleatoric uncertainties in the model, which arise from insufficient temporal data. In particular, the T-NIG model demonstrates state-of-the-art performance in both short-term and long-term prediction tasks within the dataset. Experimental results indicate that T-NIG is proficient in forecasting disease progression while maintaining disease-related characteristics, even when faced with an irregular temporal data distribution.

[93] MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Ziyun Zeng,Hang Hua,Jiebo Luo

Main category: cs.CV

TL;DR: 本文提出了MIRA,一种轻量级、即插即用的多模态推理代理,通过迭代的感知-推理-行动循环来实现基于自然语言指令的图像编辑,显著提升了扩散模型在复杂指令下的语义一致性和视觉质量。

Details Motivation: 现有的基于扩散模型的图像编辑方法在理解复杂的自然语言指令(如组合关系、上下文线索或指代表达)时存在困难,常导致语义偏离或编辑失败。为此,作者旨在提升模型对复杂指令的理解与执行能力。 Method: 提出MIRA(Multimodal Iterative Reasoning Agent),采用迭代的感知-推理-行动框架,逐步预测原子化编辑指令,并利用视觉反馈进行决策;构建包含15万样本的多模态工具使用数据集MIRA-Editing,并采用两阶段的SFT + GRPO训练流程进行模型训练。 Result: MIRA在多个开源图像编辑模型(如Flux.1-Kontext、Step1X-Edit、Qwen-Image-Edit)上均显著提升了语义一致性和感知质量,性能媲美甚至超过GPT-Image和Nano-Banana等专有系统。 Conclusion: MIRA通过模拟人类多轮交互过程,有效解决了复杂指令下图像编辑的语义偏差问题,展现出强大的泛化能力和实用性,为指令引导的图像编辑提供了高效可行的新范式。 Abstract: Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

[94] CLRecogEye : Curriculum Learning towards exploiting convolution features for Dynamic Iris Recognition

Geetanjali Sharma,Gaurav Jaswal,Aditya Nigam,Raghavendra Ramachandra

Main category: cs.CV

TL;DR: 提出一种基于3D-CNN和课程学习的新型虹膜识别框架,通过建模空-空-时特征提升在旋转、尺度、反光和模糊等干扰下的识别鲁棒性与泛化能力。

Details Motivation: 现有虹膜识别方法多依赖点对点距离比较,难以有效利用虹膜模式的空-空-时结构,且在面对旋转、尺度变化、反光和散焦模糊时鲁棒性不足。 Method: 将虹膜图像沿一维切分为子图像序列,输入3D-CNN以捕捉空间和空-空-时特征;采用课程学习策略训练模型,并结合三元组损失和ArcFace损失进行端到端优化,增强特征判别性。 Result: 该方法在复杂干扰下显著提升了虹膜识别的鲁棒性和准确性,学习到的嵌入特征具有更强的判别能力和时空依赖建模效果。 Conclusion: 所提出的框架通过引入3D-CNN和课程学习,有效挖掘了虹膜特征的空-空-时结构信息,实现了更鲁棒、可泛化的虹膜认证解决方案。 Abstract: Iris authentication algorithms have achieved impressive recognition performance, making them highly promising for real-world applications such as border control, citizen identification, and both criminal investigations and commercial systems. However, their robustness is still challenged by variations in rotation, scale, specular reflections, and defocus blur. In addition, most existing approaches rely on straightforward point-to-point comparisons, typically using cosine or L2 distance, without effectively leveraging the spatio-spatial-temporal structure of iris patterns. To address these limitations, we propose a novel and generalized matching pipeline that learns rich spatio-spatial-temporal representations of iris features. Our approach first splits each iris image along one dimension, generating a sequence of sub-images that serve as input to a 3D-CNN, enabling the network to capture both spatial and spatio-spatial-temporal cues. To further enhance the modeling of spatio-spatial-temporal feature dynamics, we train the model in curriculum manner. This design allows the network to embed temporal dependencies directly into the feature space, improving discriminability in the deep metric domain. The framework is trained end-to-end with triplet and ArcFace loss in a curriculum manner, enforcing highly discriminative embeddings despite challenges like rotation, scale, reflections, and blur. This design yields a robust and generalizable solution for iris authentication.Github code: https://github.com/GeetanjaliGTZ/CLRecogEye

[95] Pygmalion Effect in Vision: Image-to-Clay Translation for Reflective Geometry Reconstruction

Gayoung Lee,Junho Kim,Jin-Hwa Kim,Junmo Kim

Main category: cs.CV

TL;DR: 本文提出了一种受皮格马利翁神话启发的“皮格马利翁效应”框架,通过图像到黏土的转换来抑制镜面反射,实现对含复杂反射的多视角图像中物体的鲁棒三维重建。

Details Motivation: 由于视相关反射导致外观与几何的纠缠,理解反射一直是3D重建中的长期挑战。现有方法难以在保留几何一致性的同时处理复杂的表面反射。 Method: 提出双分支网络:一个基于BRDF的反射分支和一个黏土引导分支;利用合成的无反射黏土样图像作为监督信号,联合训练两个分支以稳定几何并优化表面法线。 Result: 在合成与真实数据集上均显著提升了法向精度和网格完整性,优于现有的处理反射的方法。 Conclusion: “通过去光泽化来看”——将辐射转化为中性表征,可作为反射物体几何学习的有效归纳偏置,为3D重建提供了新思路。 Abstract: Understanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically "sculpts" reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.

[96] Scaling Foundation Models for Radar Scene Understanding

Pushkal Mishra,Kshitiz Bansal,Dinesh Bharadia

Main category: cs.CV

TL;DR: 本文提出了RadarFM,一种基于结构化空间语言监督的雷达基础模型,通过统一的场景级表征学习实现跨任务迁移,解决了现有雷达方法碎片化和任务特定的问题。

Details Motivation: 现有的雷达感知方法多为任务特定且架构分散,缺乏跨任务的可迁移性;同时雷达与新兴的基础模型结合研究较少,限制了其在复杂环境下的潜力发挥。 Method: 提出结构化字幕框架以在原生雷达坐标中编码车辆分布,并设计哈希感知的对比学习目标,量化连续场景相似性而非二值匹配,从而支持细粒度的空间推理;利用CARLA模拟器生成大规模标注数据集。 Result: 成功构建了能够学习统一场景表示的雷达基础模型RadarFM,并提出定位感知的评估指标,在传统检测指标之外提升了对空间精度的评估能力。 Conclusion: RadarFM通过结构化空间语言监督和新型对比学习目标,实现了雷达感知中的细粒度空间理解与跨任务迁移,为雷达基础模型的发展提供了新方向。 Abstract: Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.

[97] EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

Ze Feng,Sen Yang,Boqiang Duan,Wankou Yang,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为EM-KD的新知识蒸馏范式,用于提升高效多模态大语言模型(MLLMs)的视觉理解能力,通过曼哈顿距离和匈牙利匹配对齐师生模型的视觉token,并引入两种蒸馏策略:视觉-语言亲和性蒸馏(VLAD)和视觉语义蒸馏(VSD),在多个基准上显著优于现有方法。

Details Motivation: 现有的高效MLLMs因压缩视觉token导致信息丢失,影响视觉理解能力;而传统知识蒸馏方法忽略了师生模型间视觉token不平衡带来的细粒度理解差异,因此需要一种更有效的蒸馏机制来解决该问题。 Method: 首先计算教师与学生模型视觉logits之间的曼哈顿距离,并使用匈牙利算法在空间维度上对齐视觉token;随后引入两种蒸馏策略:1)视觉-语言亲和性蒸馏(VLAD),通过最小化师生模型文本与视觉token间亲和矩阵的平滑L1距离;2)视觉语义蒸馏(VSD),利用反向KL散度衡量对齐后视觉logits在词汇空间上的概率分布差异。 Result: 在多个基准测试中,采用EM-KD训练的模型在准确率和效率方面均显著优于先前的高效MLLMs;与现有蒸馏方法相比,在引入本文提出的视觉token匹配策略后,EM-KD仍表现出更优性能。 Conclusion: EM-KD有效解决了高效MLLMs中因视觉token压缩导致的信息损失问题,通过创新的对齐机制和双重蒸馏策略,显著提升了模型的细粒度视觉理解能力,为高效多模态模型的知识蒸馏提供了新的解决方案。 Abstract: Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.

[98] FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain

YuAn Wang,Xiaofan Li,Chi Huang,Wenhao Zhang,Hao Li,Bosheng Wang,Xun Sun,Jun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为FaithFusion的3DGS与扩散模型融合框架,通过像素级期望信息增益(EIG)实现几何保真与视觉真实感的可控驾驶场景重建,在大视角变化下表现出色,并在多个指标上达到SOTA性能。

Details Motivation: 现有方法在融合几何驱动的3DGS和外观驱动的扩散模型时,因缺乏像素级、3D一致的编辑准则,易导致过度修复和几何漂移,难以兼顾几何保真与视觉质量。 Method: 提出FaithFusion框架,引入像素级期望信息增益(EIG)作为统一策略:利用EIG作为空间先验引导扩散模型优化高不确定性区域,并通过像素级加权将编辑结果蒸馏回3DGS,形成即插即用系统,无需额外先验或结构修改。 Result: 在Waymo数据集上实验表明,该方法在NTA-IoU、NTL-IoU和FID等指标上均达到最先进水平,即使在6米车道偏移下仍保持107.47的FID值。 Conclusion: FaithFusion通过EIG实现了3DGS与扩散模型的有效融合,兼顾了几何一致性与视觉真实性,为可控驾驶场景生成提供了高效、通用的解决方案。 Abstract: In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce \textbf{FaithFusion}, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications.Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift. Our code is available at https://github.com/wangyuanbiubiubiu/FaithFusion.

[99] Deformation-aware Temporal Generation for Early Prediction of Alzheimers Disease

Xin Honga,Jie Lin,Minghui Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Deformation-Aware Temporal Generative Network (DATGN) 的新方法,用于通过自动生成和分析脑部MRI图像中的形态学变化来实现阿尔茨海默病(AD)的早期预测。该方法能够处理时间序列中常见的缺失数据问题,并生成符合疾病进展趋势的未来MRI图像,从而提高分类准确性。

Details Motivation: 阿尔茨海默病是一种退行性脑部疾病,早期预测有助于减缓其发展。现有的预测方法大多依赖于手动提取脑图像的形态学特征,效率较低且难以捕捉复杂的动态变化。因此,需要一种自动化的方法来有效学习与疾病进展相关的脑部结构变化。 Method: 提出DATGN模型:首先对不完整的时间序列MRI数据进行插值以填补缺失值;然后利用双向时间形变感知模块指导网络生成符合AD进展规律的未来MRI图像;最后将生成的合成数据用于SVM、CNN和3DCNN等分类器中提升分类性能。 Result: 在ADNI数据集上验证了DATGN生成未来MRI时间序列的能力,PSNR和MMSE指标表现良好;将生成数据用于分类任务时,AD vs. NC分类准确率提升了6.21%至16%,AD vs. MCI vs. NC分类准确率提升了7.34%至21.25%;可视化结果表明生成的图像符合AD相关的脑萎缩趋势。 Conclusion: DATGN能够有效建模脑部形态随时间的变化,生成具有病理一致性的MRI图像,为阿尔茨海默病的早期预测提供了可靠且可行的自动化框架,并显著提升了现有分类模型的性能。 Abstract: Alzheimer's disease (AD), a degenerative brain condition, can benefit from early prediction to slow its progression. As the disease progresses, patients typically undergo brain atrophy. Current prediction methods for Alzheimers disease largely involve analyzing morphological changes in brain images through manual feature extraction. This paper proposes a novel method, the Deformation-Aware Temporal Generative Network (DATGN), to automate the learning of morphological changes in brain images about disease progression for early prediction. Given the common occurrence of missing data in the temporal sequences of MRI images, DATGN initially interpolates incomplete sequences. Subsequently, a bidirectional temporal deformation-aware module guides the network in generating future MRI images that adhere to the disease's progression, facilitating early prediction of Alzheimer's disease. DATGN was tested for the generation of temporal sequences of future MRI images using the ADNI dataset, and the experimental results are competitive in terms of PSNR and MMSE image quality metrics. Furthermore, when DATGN-generated synthetic data was integrated into the SVM vs. CNN vs. 3DCNN-based classification methods, significant improvements were achieved from 6. 21\% to 16\% in AD vs. NC classification accuracy and from 7. 34\% to 21. 25\% in AD vs. MCI vs. NC classification accuracy. The qualitative visualization results indicate that DATGN produces MRI images consistent with the brain atrophy trend in Alzheimer's disease, enabling early disease prediction.

[100] AnchorOPT: Towards Optimizing Dynamic Anchors for Adaptive Prompt Learning

Zheng Li,Yibing Song,Xin Zhang,Lei Luo,Xiang Li,Jian Yang

Main category: cs.CV

TL;DR: 提出AnchorOPT,一种基于动态锚点的提示学习框架,通过从任务数据中动态学习锚点值并自适应优化锚点与软令牌的位置关系,提升CLIP模型的泛化能力。

Details Motivation: 现有基于CLIP的提示学习方法使用静态文本标记作为锚点,缺乏跨任务和训练阶段的灵活性,限制了模型的适应性和泛化性能。 Method: AnchorOPT在两个维度引入动态性:一是锚点值从任务特定数据中动态学习,而非手工设计;二是通过一个依赖于训练阶段和任务上下文的可学习位置矩阵,自适应地优化锚点与软令牌之间的位置关系。训练分为两阶段:先学习并冻结锚点,再优化软令牌和位置矩阵。 Result: 实验表明,仅使用简单的可学习锚点和位置矩阵,AnchorOPT即可达到或超过一些引入额外模块或正则化技术的方法的性能,并在多种数据集上实现一致的性能提升。 Conclusion: AnchorOPT作为一种即插即用模块,能有效增强现有框架的性能,具有良好的通用性和实用性。 Abstract: Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., "shape", "color"), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.

[101] Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models

Changlin Li,Jiawei Zhang,Zeyi Shi,Zongxin Yang,Zhihui Li,Xiaojun Chang

Main category: cs.CV

TL;DR: 本文提出了一种名为EntPruner的熵引导自动渐进式剪枝框架,用于扩散模型和流模型的高效压缩,在保持生成质量的同时实现高达2.22倍的推理加速。

Details Motivation: 大规模视觉生成模型在下游任务迁移时存在显著的参数冗余问题,需要一种能够自适应、动态地剪枝并保持生成多样性和条件保真度的方法。 Method: 提出熵引导剪枝策略,使用数据依赖的条件熵偏差(CED)作为模块重要性评估指标,并设计零样本自适应剪枝框架,在训练过程中动态决定剪枝时机与程度。 Result: 在DiT和SiT模型上进行了大量实验,结果表明EntPruner可在ImageNet及三个下游数据集上实现最高2.22倍的推理速度提升,同时保持有竞争力的生成质量。 Conclusion: EntPruner有效解决了生成模型在下游任务中的参数冗余问题,通过熵引导和渐进式剪枝机制,在避免模式崩溃的同时实现了高性能的模型压缩。 Abstract: Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive pruning framework for diffusion and flow models. First, we introduce entropy-guided pruning, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require preserving the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes pruning of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive pruning framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot pruning, mitigating mode collapse, and preserving model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.

[102] Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis

Jiyun Bae,Hyunjong Ok,Sangwoo Mo,Jaeho Lee

Main category: cs.CV

TL;DR: 本研究通过引入包含干扰项的视觉问答数据集Idis,探讨了视觉-语言模型中视觉干扰项对测试时扩展性的影响,发现视觉干扰项与文本干扰项存在本质差异。

Details Motivation: 探究多模态环境下(特别是视觉-语言模型)干扰信息如何影响模型推理效果和长度,扩展已有的语言模型逆向扩展研究。 Method: 构建了可系统调节语义、数值和空间维度干扰项的视觉问答数据集Idis,并分析视觉干扰项对模型推理长度和准确率的影响,同时追踪推理过程中属性计数的变化。 Result: 发现视觉干扰项导致准确率下降但不增加推理长度,表现出与文本干扰项不同的逆向扩展现象;该趋势在Waterbirds等视觉偏见基准上也成立。 Conclusion: 视觉干扰项对VLMs的影响机制不同于文本干扰项,提出一种简单的提示策略可缓解由偏见驱动的预测问题。 Abstract: How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.

[103] CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

Dianbing Xi,Jiepeng Wang,Yuanzhi Liang,Xi Qiu,Jialun Liu,Hao Pan,Yuchi Huo,Rui Wang,Haibin Huang,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出CtrlVDiff,一个统一的扩散模型,通过引入图形学多模态线索(如内在图像和语义)来同时提升视频理解和可控视频生成,解决了几何线索不足与多模态融合的挑战。

Details Motivation: 现有基于几何线索(如深度、边缘)的方法在视频理解和生成中无法充分约束外观、材质和光照,导致编辑能力有限且易出现时序漂移。需要更丰富的模态来实现物理上合理的可控生成。 Method: 提出CtrlVDiff模型,结合深度、法线、分割、边缘及图形学内在属性(反照率、粗糙度、金属度)等多模态输入,采用混合模态控制策略(HMCS)进行特征路由与融合,并构建MMVideo数据集提供跨模态对齐的训练支持。 Result: 在视频理解与生成任务中,CtrlVDiff在可控性、生成质量以及时序一致性方面优于现有最先进方法,支持逐层编辑(如重光照、材质更换、物体插入),并在部分模态缺失时保持鲁棒性。 Conclusion: 通过引入更多图形学相关的模态并设计灵活的融合架构,可有效提升视频扩散模型的理解与生成能力,实现更精确、物理合理的视频编辑。 Abstract: We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.

[104] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Wenbo Hu,Jingli Lin,Yilin Long,Yunlong Ran,Lihan Jiang,Yifan Wang,Chenming Zhu,Runsen Xu,Tai Wang,Jiangmiao Pang

Main category: cs.CV

TL;DR: G$^2$VLM是一种几何接地的视觉-语言模型,通过结合3D视觉几何特征提升空间理解与推理能力,在无需额外标注的情况下利用多视角图像和视频数据进行训练,实现了在3D重建和空间推理任务上的优异表现。

Details Motivation: 现有视觉-语言模型在空间智能方面表现不足,主要缺乏从2D图像重建3D空间的几何学习过程,导致空间理解与推理能力受限。 Method: 提出G$^2$VLM模型,将3D视觉几何特征融入视觉-语言框架中,通过上下文学习和交错推理机制,直接预测3D属性并增强空间理解任务;利用多视角图像和视频数据训练,并融合难以获取的3D视觉先验知识。 Result: 实验表明,G$^2$VLM在3D重建任务上达到与当前最优前馈模型相当的性能,在多个空间理解与推理任务上表现优于或媲美现有方法。 Conclusion: G$^2$VLM统一了语义丰富的视觉-语言模型与底层3D视觉任务,为社区提供了强大的空间理解基线,并有望推动如3D场景编辑等未来应用的发展。 Abstract: Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.

[105] DeepRFTv2: Kernel-level Learning for Image Deblurring

Xintian Mao,Haofei Song,Yin-Nian Liu,Qingli Li,Yan Wang

Main category: cs.CV

TL;DR: 提出傅里叶核估计器(FKE),在频域中将卷积问题转化为乘法问题,实现低复杂度、无监督的核级模糊过程学习,并结合解耦多尺度架构提升图像去模糊性能。

Details Motivation: 现有深度网络局限于像素级学习,无法让模型真正理解模糊的本质过程,因此需要一种能实现核级模糊过程学习的方法以提升去模糊性能。 Method: 提出傅里叶核估计器(FKE),在傅里叶空间进行激活操作,将空间域的卷积转换为频域乘法;将卷积对象从图像转为富含语义信息的网络特征,并设计解耦多尺度架构与可逆策略以提升特征提取效率和多尺度表达能力。 Result: 方法在运动去模糊任务上达到SOTA性能,核估计器能学习到物理上有意义的模糊核,且具备处理其他核相关问题的潜力。 Conclusion: FKE实现了高效、无监督的核级模糊建模,通过频域处理和特征级卷积使网络更深入理解模糊本质,显著提升了去模糊效果。 Abstract: It is well-known that if a network aims to learn how to deblur, it should understand the blur process. Blurring is naturally caused by the convolution of the sharp image with the blur kernel. Thus, allowing the network to learn the blur process in the kernel-level can significantly improve the image deblurring performance. But, current deep networks are still at the pixel-level learning stage, either performing end-to-end pixel-level restoration or stage-wise pseudo kernel-level restoration, failing to enable the deblur model to understand the essence of the blur. To this end, we propose Fourier Kernel Estimator (FKE), which considers the activation operation in Fourier space and converts the convolution problem in the spatial domain to a multiplication problem in Fourier space. Our FKE, jointly optimized with the deblur model, enables the network to learn the kernel-level blur process with low complexity and without any additional supervision. Furthermore, we change the convolution object of the kernel from ``image" to network extracted ``feature", whose rich semantic and structural information is more suitable to blur process learning. With the convolution of the feature and the estimated kernel, our model can learn the essence of blur in kernel-level. To further improve the efficiency of feature extraction, we design a decoupled multi-scale architecture with multiple hierarchical sub-unets with a reversible strategy, which allows better multi-scale encoding and decoding in low training memory. Extensive experiments indicate that our method achieves state-of-the-art motion deblurring results and show potential for handling other kernel-related problems. Analysis also shows our kernel estimator is able to learn physically meaningful kernels. The code will be available at https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur.

[106] Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning

Changlin Li,Jiawei Zhang,Shuhao Liu,Sihao Lin,Zeyi Shi,Zhihui Li,Xiaojun Chang

Main category: cs.CV

TL;DR: 提出了一种名为Ent-Prog的高效训练框架,用于人体视频生成中的扩散模型,通过熵引导的优先级训练和自适应渐进策略,显著减少训练时间和显存消耗,同时保持生成性能。

Details Motivation: 由于在高分辨率、多帧数据上训练扩散模型存在高计算成本和大量显存消耗的问题,因此需要一种更高效的训练方法来应对这些挑战。 Method: 提出了Entropy-Guided Prioritized Progressive Learning(Ent-Prog),包括两个关键部分:一是条件熵膨胀(CEI)用于评估模型组件的重要性并实现优先训练;二是自适应渐进训练策略,根据收敛效率动态增加计算复杂度。 Result: 在三个数据集上的实验表明,Ent-Prog最多可实现2.2倍的训练加速和2.4倍的GPU显存减少,且不损害生成性能。 Conclusion: Ent-Prog是一种高效的人体视频生成扩散模型训练框架,能够在降低资源消耗的同时保持生成质量,具有良好的实用性和扩展性。 Abstract: Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.

[107] Referring Video Object Segmentation with Cross-Modality Proxy Queries

Baoli Sun,Xinzhu Ma,Ning Wang,Zhihui Wang,Zhiyong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为ProxyFormer的新型指代表情视频对象分割(RVOS)架构,通过引入代理查询来增强视觉与语言语义的对齐,并在多阶段中传播更新,提升了跨帧依赖建模和目标跟踪的准确性。

Details Motivation: 现有方法在跨模态对齐中缺乏帧间依赖建模,且文本约束集成过晚,导致难以准确追踪目标对象。 Method: 提出ProxyFormer模型,引入可学习的代理查询,在视频特征编码器的多个阶段中动态更新并传播,以实现视觉与文本语义的深度融合;将跨模态交互解耦为时空两个维度以降低计算成本,并设计联合语义一致性(JSC)训练策略。 Result: 在四个主流RVOS基准上的实验表明,ProxyFormer性能优于现有最先进方法。 Conclusion: ProxyFormer通过代理查询机制有效增强了跨模态对齐与帧间依赖建模,显著提升了RVOS任务的准确性和一致性。 Abstract: Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.

[108] TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

Jiaming He,Guanyu Hou,Hongwei Li,Zhicong Huang,Kangjie Chen,Yi Yu,Wenbo Jiang,Guowen Xu,Tianwei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为TEAR的时序感知自动化红队框架,用于发现文本到视频(T2V)模型中与动态时序相关的安全风险。

Details Motivation: 现有的安全评估方法主要针对静态图像和文本生成,无法充分捕捉视频生成中的复杂时序动态,因此需要专门针对T2V模型的时序安全性进行评估。 Method: TEAR采用两阶段优化的时序感知测试生成器,结合初始生成器训练和时序感知在线偏好学习,并通过一个精炼模型循环提升提示的隐蔽性和对抗有效性。 Result: 在开源和商业T2V系统上的大量实验表明,TEAR的攻击成功率超过80%,相比之前最高的57%有显著提升。 Conclusion: TEAR能有效揭示T2V模型在时序动态方面的安全隐患,为未来视频生成模型的安全评估提供了新的方法论支持。 Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80% attack success rate, a significant boost from prior best result of 57%.

[109] LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

Shichu Sun,Yichen Zhang,Haolin Song,Zonghao Guo,Chi Chen,Yidan Zhang,Yuan Yao,Zhiyuan Liu,Maosong Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为LLaVA-UHD v3的多模态大语言模型,其核心是渐进式视觉压缩(PVC)方法,可在保持高性能的同时显著降低推理延迟。

Details Motivation: 研究者希望解决全局原生分辨率视觉编码带来的高计算开销问题,探索更高效的视觉token处理方式。 Method: 提出PVC方法,包括精细化的patch embedding和分层的窗口化token压缩模块,可集成到标准ViT中实现高效编码。 Result: ViT-UHD在多个基准上表现优异,相比MoonViT将首token时间减少2.4倍;LLaVA-UHD v3性能媲美Qwen2-VL,同时首token时间减少1.9倍。 Conclusion: PVC方法能有效平衡多模态模型的效率与性能,为构建高效MLLM提供了可行路径。 Abstract: Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.

[110] Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation

Joonhyung Park,Hyeongwon Jang,Joowon Kim,Eunho Yang

Main category: cs.CV

TL;DR: 提出GridAR框架,通过网格化渐进生成和布局指定的提示重构策略,提升视觉自回归模型在测试时计算扩展下的生成质量与效率。

Details Motivation: 现有视觉自回归模型在测试时计算扩展方面尚未被探索,且存在生成轨迹错误和缺乏整体画布蓝图的问题,导致扩展效果受限。 Method: 引入GridAR框架,采用网格划分的渐进生成方式,早期剪枝无效候选,并以有效结果为锚点引导后续解码;结合布局指定的提示重构策略,基于局部视图推断可行布局以指导生成。 Result: 在N=4时,GridAR在T2I-CompBench++上比Best-of-N(N=8)提升14.4%,成本降低25.6%;在PIE-Bench上图像编辑任务中语义保持提升13.9%。 Conclusion: GridAR有效提升了视觉自回归模型在有限计算资源下的生成质量和效率,优于传统大规模采样方法,并可推广到图像编辑任务。 Abstract: Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.

[111] Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding

Yutao Tang,Cheng Zhao,Gaurav Mittal,Rohith Kukkala,Rama Chellappa,Cheng Peng,Mei Chen

Main category: cs.CV

TL;DR: 本文提出NDTokenizer3D,一种基于多尺度NDT表示的通用3D视觉语言模型,通过新颖的三阶段场景分词 pipeline 实现细粒度、统一的3D场景理解,在多种任务中表现优异。

Details Motivation: 现有的3D视觉语言模型在将3D场景有效分词为整体场景标记并应用于多样化任务方面仍面临挑战,缺乏统一且支持人机交互的架构。 Method: 提出NDTokenizer3D,包含一个多尺度NDT表示构建模块和一个多尺度NDT解码器(MSDec);首先从原始点云构建多尺度NDT表示,再通过MSDec逐步融合跨尺度特征生成可用于LLM的场景标记,并复用MSDec支持交互提示和分割解码。 Result: 在3D指代表达分割、3D视觉问答和3D密集描述等任务上实现了显著性能提升,验证了模型的细粒度理解和通用性。 Conclusion: NDTokenizer3D通过统一的多尺度NDT框架实现了高效、通用的3D场景理解,支持多种任务和人机交互,为3D视觉语言建模提供了新思路。 Abstract: Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.

[112] When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models

Hui Lu,Yi Yu,Yiming Yang,Chenyu Yi,Qixin Zhang,Bingquan Shen,Alex C. Kot,Xudong Jiang

Main category: cs.CV

TL;DR: 提出UPA-RFAS框架,实现对视觉-语言-动作(VLA)模型的通用、可迁移对抗补丁攻击,具备跨模型、任务和视角的强转移性。

Details Motivation: 现有对抗补丁多过拟合于单一模型,在黑盒设置下表现差,缺乏对VLA模型通用且可迁移攻击的研究。 Method: 提出UPA-RFAS框架:1)基于共享特征空间的学习,结合ℓ₁偏差先验和排斥性InfoNCE损失;2)鲁棒增强的两阶段min-max优化过程;3)两个针对VLA的损失——补丁注意力主导与语义错位,无需标签即可攻击文本到视觉注意力和图文匹配。 Result: 在多种VLA模型、操作环境和真实物理实验中验证了UPA-RFAS的有效性,表现出跨模型、任务和视角的一致可迁移性。 Conclusion: UPA-RFAS揭示了VLA驱动机器人面临的实用化补丁攻击风险,为后续防御研究提供了强基线。 Abstract: Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.

[113] You Can Trust Your Clustering Model: A Parameter-free Self-Boosting Plug-in for Deep Clustering

Hanyang Li,Yuheng Jia,Hui Liu,Junhui Hou

Main category: cs.CV

TL;DR: 提出DCBoost,一种无参数的插件式方法,通过利用局部结构一致性增强深度聚类中的全局特征结构,显著提升现有模型的聚类性能。

Details Motivation: 现有深度聚类方法存在全局与局部特征结构不一致的问题,导致聚类边界交织、分离效果差。 Method: 通过自适应k近邻一致性筛选高置信度样本作为可靠锚点,并利用这些样本构建判别性损失,优化网络以增强类内紧凑性和类间可分性。 Result: 在多个基准数据集上显著提升了多种深度聚类模型的性能,相比当前最优方法(如ProPos)提升超过3%,轮廓系数提高7倍以上。 Conclusion: DCBoost能有效提升深度聚类模型的全局结构质量,是一种通用且高效的增强模块。 Abstract: Recent deep clustering models have produced impressive clustering performance. However, a common issue with existing methods is the disparity between global and local feature structures. While local structures typically show strong consistency and compactness within class samples, global features often present intertwined boundaries and poorly separated clusters. Motivated by this observation, we propose DCBoost, a parameter-free plug-in designed to enhance the global feature structures of current deep clustering models. By harnessing reliable local structural cues, our method aims to elevate clustering performance effectively. Specifically, we first identify high-confidence samples through adaptive $k$-nearest neighbors-based consistency filtering, aiming to select a sufficient number of samples with high label reliability to serve as trustworthy anchors for self-supervision. Subsequently, these samples are utilized to compute a discriminative loss, which promotes both intra-class compactness and inter-class separability, to guide network optimization. Extensive experiments across various benchmark datasets showcase that our DCBoost significantly improves the clustering performance of diverse existing deep clustering models. Notably, our method improves the performance of current state-of-the-art baselines (e.g., ProPos) by more than 3% and amplifies the silhouette coefficient by over $7\times$. Code is available at .

[114] BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data

Selene Cerna,Sara Si-Moussi,Wilfried Thuiller,Hadrien Hendrikx,Vincent Miele

Main category: cs.CV

TL;DR: 本文提出BotaCLIP,一种轻量级多模态对比框架,用于将领域知识(植物调查数据)融入预训练地球观测基础模型(DOFA),通过对比学习和正则化策略缓解灾难性遗忘,在植物存在预测、蝴蝶出现建模和土壤营养群落丰度估计等生态任务中显著优于原始DOFA和监督基线。

Details Motivation: 在生物多样性建模等数据稀缺的领域,如何低成本地将专家知识注入预训练基础模型,以提升其在特定领域下游任务中的表现,是一个关键挑战。 Method: 提出BotaCLIP框架,采用轻量级多模态对比学习,将高分辨率航拍图像与植物调查(botanical relevés)对齐,从而适应预训练的地球观测基础模型(DOFA),并引入正则化策略防止灾难性遗忘。 Result: 在三个生态学下游任务中,BotaCLIP生成的嵌入表示 consistently 优于原始DOFA模型和有监督基线方法,验证了其有效性。 Conclusion: 领域感知的基础模型适应方法能够有效将专家知识注入数据稀缺场景,实现高效、节约的表示学习,为专业领域应用提供了可行路径。 Abstract: Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.

[115] Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

Baoli Sun,Yihan Wang,Xinzhu Ma,Zhihui Wang,Kun Lu,Zhiyong Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于细粒度动作识别的新框架ART,通过查询-响应机制发现并跟踪视频中局部区域的动态变化,利用文本约束的语义查询和多级对比学习提升对相似动作的区分能力。

Details Motivation: 现有动作识别方法难以捕捉细粒度动作类别间的细微差异,尤其是在局部时空区域中的精细动态变化。 Method: 提出Action-Region Tracking (ART) 框架,结合区域特定语义激活模块,使用判别性和文本约束的语义作为查询来捕获每帧中最相关的区域响应,并构建动作轨迹(action tracklets);引入多级轨迹对比约束和任务特定微调机制优化表征。 Result: 在多个主流动作识别基准上进行了广泛实验,结果表明该方法优于先前最先进的基线方法。 Conclusion: ART框架有效提升了细粒度动作识别性能,通过显式建模局部区域动态和融合VLM提取的精细语义信息,实现了对相似动作的精准区分。 Abstract: Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.

[116] From Diffusion to One-Step Generation: A Comparative Study of Flow-Based Models with Application to Image Inpainting

Umang Agarwal,Rudraksh Sangore,Sumit Laddha

Main category: cs.CV

TL;DR: 本论文比较了三种生成模型范式(DDPM、CFM和MeanFlow),在统一的小型网络架构下进行实验,发现CFM在50步采样下表现最佳,而MeanFlow支持单步生成,显著提升推理速度,并成功将CFM扩展到图像修复任务中。

Details Motivation: 旨在系统比较不同生成模型在相同条件下的性能,并探索高效采样方法与实际应用(如图像修复)的结合。 Method: 采用统一的TinyUNet架构,在CIFAR-10数据集上实现并比较DDPM、CFM和MeanFlow;使用FID评估生成质量,同时将CFM应用于图像修复任务,设计掩码引导采样策略。 Result: CFM在50步时达到FID 24.15,显著优于DDPM(FID 402.98);MeanFlow单步生成达到FID 29.15,推理速度快50倍;CFM在图像修复中PSNR和SSIM分别提升73%和45%。 Conclusion: CFM在生成质量上表现优异,MeanFlow在推理效率方面具有明显优势,且CFM可有效扩展至图像修复等实际任务,验证了其灵活性与实用性。 Abstract: We present a comprehensive comparative study of three generative modeling paradigms: Denoising Diffusion Probabilistic Models (DDPM), Conditional Flow Matching (CFM), and MeanFlow. While DDPM and CFM require iterative sampling, MeanFlow enables direct one-step generation by modeling the average velocity over time intervals. We implement all three methods using a unified TinyUNet architecture (<1.5M parameters) on CIFAR-10, demonstrating that CFM achieves an FID of 24.15 with 50 steps, significantly outperforming DDPM (FID 402.98). MeanFlow achieves FID 29.15 with single-step sampling -- a 50X reduction in inference time. We further extend CFM to image inpainting, implementing mask-guided sampling with four mask types (center, random bbox, irregular, half). Our fine-tuned inpainting model achieves substantial improvements: PSNR increases from 4.95 to 8.57 dB on center masks (+73%), and SSIM improves from 0.289 to 0.418 (+45%), demonstrating the effectiveness of inpainting-aware training.

[117] 3-Tracer: A Tri-level Temporal-Aware Framework for Audio Forgery Detection and Localization

Shuhan Xia,Xuannan Liu,Xing Cui,Peipei Li

Main category: cs.CV

TL;DR: 本文提出T3-Tracer,首个联合帧、片段和音频三级分析的框架,用于检测部分音频伪造,通过FA-FAM和SMDAM模块实现细粒度伪造检测与边界识别。

Details Motivation: 现有方法缺乏多层次时序建模能力,难以捕捉部分音频伪造中短暂且持续的异常,需更全面的层次化检测机制。 Method: 提出T3-Tracer框架,包含FA-FAM模块融合帧级与音频级特征检测帧真实性,SMDAM模块通过双分支多尺度建模识别片段级伪造边界。 Result: 在三个具挑战性数据集上实验表明,该方法在部分音频伪造检测任务中达到最先进性能。 Conclusion: T3-Tracer通过多层次联合分析有效提升了部分音频伪造检测能力,尤其在定位伪造边界和捕捉语义不一致方面表现优越。 Abstract: Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3-Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3-Tracer consists of two complementary core modules: the Frame-Audio Feature Aggregation Module (FA-FAM) and the Segment-level Multi-Scale Discrepancy-Aware Module (SMDAM). FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state-of-the-art performance.

[118] FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision

Chen Ling,Henglin Shi,Hedvig Kjellström

Main category: cs.CV

TL;DR: 本文提出FIELDS方法,通过引入直接的3D表情参数监督和情感识别分支,解决了现有3D人脸重建中因依赖2D监督而丢失细微情感细节的问题。

Details Motivation: 现有3D人脸重建方法在处理自发性面部表情时常常忽略细微的情感信息,主要由于缺乏真实的3D标注数据以及过度依赖2D图像监督信号。 Method: FIELDS结合了自监督2D图像一致性约束、来自4D面部扫描的真实3D表情参数监督,以及一个强度感知的情感识别损失函数,以提升3D表情建模的真实性与情感准确性。 Result: FIELDS能够从单张图像生成高度逼真且富含情感的3D人脸模型,在野外场景下的面部表情识别性能显著提升,同时保持自然性。 Conclusion: 该方法有效弥合了2D与3D域之间的差距,缓解了表情强度偏差问题,实现了对细微情感线索的高保真重建。 Abstract: Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.

[119] Shift-Equivariant Complex-Valued Convolutional Neural Networks

Quentin Gabot,Teck-Yian Lim,Jérémy Fix,Joana Frontera-Pons,Chengfang Ren,Jean-Philippe Ovarlez

Main category: cs.CV

TL;DR: 本文将Learnable Polyphase Sampling (LPS) 方法扩展到复数神经网络,并提出一种从复数域到实数域的投影层,结合Gumbel Softmax,在极化SAR图像的分类、重建和语义分割任务中验证了其在平移不变性和等变性上的有效性。

Details Motivation: 传统卷积神经网络因下采样和上采样操作破坏了平移等变性和不变性,缺乏系统性的结构保障;尽管数据增强可部分缓解,但仍需理论保证的构造方法。 Method: 将LPS方法推广至复数神经网络,并设计了一个从复数域(ℂ)到实数域(ℝ)的投影层,用于在Gumbel Softmax前处理复数特征,从而保持平移等变性/不变性。 Result: 在极化SAR图像的多个视觉任务(分类、重建、语义分割)中验证了所提方法的有效性,展示了其在理论性质和实际性能上的优势。 Conclusion: 通过构建具有理论保证的复数域LPS模块,可有效实现神经网络中的平移等变与不变性,为复数网络在遥感图像分析中的应用提供了新思路。 Abstract: Convolutional neural networks have shown remarkable performance in recent years on various computer vision problems. However, the traditional convolutional neural network architecture lacks a critical property: shift equivariance and invariance, broken by downsampling and upsampling operations. Although data augmentation techniques can help the model learn the latter property empirically, a consistent and systematic way to achieve this goal is by designing downsampling and upsampling layers that theoretically guarantee these properties by construction. Adaptive Polyphase Sampling (APS) introduced the cornerstone for shift invariance, later extended to shift equivariance with Learnable Polyphase up/downsampling (LPS) applied to real-valued neural networks. In this paper, we extend the work on LPS to complex-valued neural networks both from a theoretical perspective and with a novel building block of a projection layer from $\mathbb{C}$ to $\mathbb{R}$ before the Gumbel Softmax. We finally evaluate this extension on several computer vision problems, specifically for either the invariance property in classification tasks or the equivariance property in both reconstruction and semantic segmentation problems, using polarimetric Synthetic Aperture Radar images.

[120] AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

Shuhan Xia,Peipei Li,Xuannan Liu,Dongsen Zhang,Xinyu Guo,Zekun Li

Main category: cs.CV

TL;DR: AVFakeBench是首个全面的音频-视频伪造检测基准,涵盖丰富的伪造语义和多层级标注,支持多任务评估,并揭示了现有音频-视频大模型在细粒度感知与推理上的不足。

Details Motivation: 现有伪造检测基准局限于DeepFake和单一粒度标注,无法反映真实世界中复杂多样的伪造场景,亟需一个更全面、多样化的评估基准。 Method: 提出AVFakeBench,包含12K个精心策划的音视频问题,覆盖七类伪造类型和四个层级的标注;构建多阶段混合伪造框架,结合专有任务规划模型与专家生成模型以生成高质量、多样性伪造样本;设计多任务评估框架,包括二分类判断、伪造类型分类、细节选择和解释性推理。 Result: 在AVFakeBench上评估了11种音频-视频大语言模型和2种主流检测方法,结果显示AV-LMMs具备作为新兴伪造检测工具的潜力,但在细粒度感知和推理方面表现薄弱。 Conclusion: AVFakeBench为音频-视频伪造检测提供了更全面、贴近现实的评估平台,推动了该领域向更复杂、多语义方向发展,同时揭示了当前模型在精细分析能力上的局限性。 Abstract: The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.

[121] LaGen: Towards Autoregressive LiDAR Scene Generation

Sizhuo Zhou,Xiaosong Jia,Fanrui Zhang,Junjie Li,Juyong Zhang,Yukang Feng,Jianwen Sun,Songbur Wong,Junqi You,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出了LaGen,首个能够实现长时距逐帧自回归生成LiDAR场景的框架,支持基于单帧输入和边界框条件的高保真4D点云生成,并通过场景解耦估计和噪声调制模块提升交互性和减少误差累积。

Details Motivation: 现有LiDAR数据生成方法仅支持单帧生成,预测方法缺乏交互性且无法进行长时距逐帧生成,难以满足自动驾驶中对交互式世界模型的需求。 Method: 提出LaGen框架,采用自回归方式逐帧生成LiDAR序列;引入场景解耦估计模块以增强对象级内容的交互生成能力,并设计噪声调制模块抑制长时距生成中的误差累积;利用nuScenes数据集构建评估协议。 Result: 实验表明LaGen在长时距LiDAR场景生成任务中优于现有的生成与预测模型,尤其在后续帧的生成质量上表现更优。 Conclusion: LaGen是首个支持长时距交互式LiDAR场景生成的框架,为自动驾驶中的生成式世界模型提供了新思路,并在生成质量和交互性方面取得显著进展。 Abstract: Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long-horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame-by-frame autoregressive generation of long-horizon LiDAR scenes. LaGen is able to take a single-frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high-fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model's interactive generation capability for object-level content, as well as a noise modulation module to mitigate error accumulation during long-horizon generation. We construct a protocol based on nuScenes for evaluating long-horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on the later frames.

[122] Unlocking Zero-shot Potential of Semi-dense Image Matching via Gaussian Splatting

Juncheng Chen,Chao Xu,Yanjun Cao

Main category: cs.CV

TL;DR: 本文提出了MatchGS,首个利用3D高斯点阵(3DGS)进行鲁棒零样本图像匹配的框架,通过几何校正和2D-3D表征对齐策略,显著提升匹配性能。

Details Motivation: 基于学习的图像匹配依赖高质量训练数据,而现有3DGS存在几何不准确和深度渲染偏差问题,难以生成可靠对应标签。 Method: 提出两阶段方法:1)几何保真数据生成流程,优化3DGS几何以生成精确对应标签;2)2D-3D表征对齐策略,将3D先验知识注入2D匹配器,提升跨视角一致性。 Result: 生成的真值对应关系使极线误差降低达40倍,支持极端视角变化下的监督,自监督信号来自高斯属性;仅使用该数据训练的最先进匹配器在公开基准上零样本性能提升高达17.7%。 Conclusion: 经过适当几何优化后,3DGS可作为可扩展、高保真且结构丰富的数据源,推动新一代鲁棒零样本图像匹配器的发展。 Abstract: Learning-based image matching critically depends on large-scale, diverse, and geometrically accurate training data. 3D Gaussian Splatting (3DGS) enables photorealistic novel-view synthesis and thus is attractive for data generation. However, its geometric inaccuracies and biased depth rendering currently prevent robust correspondence labeling. To address this, we introduce MatchGS, the first framework designed to systematically correct and leverage 3DGS for robust, zero-shot image matching. Our approach is twofold: (1) a geometrically-faithful data generation pipeline that refines 3DGS geometry to produce highly precise correspondence labels, enabling the synthesis of a vast and diverse range of viewpoints without compromising rendering fidelity; and (2) a 2D-3D representation alignment strategy that infuses 3DGS' explicit 3D knowledge into the 2D matcher, guiding 2D semi-dense matchers to learn viewpoint-invariant 3D representations. Our generated ground-truth correspondences reduce the epipolar error by up to 40 times compared to existing datasets, enable supervision under extreme viewpoint changes, and provide self-supervisory signals through Gaussian attributes. Consequently, state-of-the-art matchers trained solely on our data achieve significant zero-shot performance gains on public benchmarks, with improvements of up to 17.7%. Our work demonstrates that with proper geometric refinement, 3DGS can serve as a scalable, high-fidelity, and structurally-rich data source, paving the way for a new generation of robust zero-shot image matchers.

[123] Co-Training Vision Language Models for Remote Sensing Multi-task Learning

Qingyun Li,Shuran Ma,Junwei Luo,Yi Yu,Yue Zhou,Fengxiang Wang,Xudong Lu,Xiaoxing Wang,Xin He,Yushi Chen,Xue Yang,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出RSCoVLM,一个简单而灵活的遥感视觉语言模型(VLM)基线,支持多任务学习(MTL),通过统一的文本接口和动态分辨率策略,在多种遥感任务上实现最先进的性能,并开源全部工具与数据。

Details Motivation: 随着Transformer在单一遥感任务上的成功,研究者希望构建一个能在多个任务上同时表现优异的统一模型。多任务学习(MTL)具有更好泛化性、可扩展性和实用性,而现有方法在处理复杂遥感数据、多尺度图像和公平评估方面仍存在挑战。因此,需要一个统一且高效的VLM基线来推动通用遥感模型的发展。 Method: 提出RSCoVLM,包含:1)数据整理引擎,用于数据获取、离线处理与在线加载及加权;2)统一的动态分辨率策略,适应不同图像尺度;3)针对超高分辨率(UHR)图像的Zoom-in Chain机制及配套数据集LRS-VQA-Zoom;4)增强对象检测能力并设计新的评估协议以实现与传统检测模型的公平比较。 Result: RSCoVLM在多项遥感任务上达到最先进水平,优于现有遥感VLM,甚至媲美专用专家模型。所提策略有效降低计算负担,提升对象检测性能。所有工具、模型权重和数据集均已开源。 Conclusion: RSCoVLM作为一个简单且灵活的VLM基线,有效支持遥感领域的多任务学习,推动了通用遥感模型的发展,其开源资源有助于促进后续研究的可复现性与进步。 Abstract: With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.

[124] PathMamba: A Hybrid Mamba-Transformer for Topologically Coherent Road Segmentation in Satellite Imagery

Jules Decaestecker,Nicolas Vigne

Main category: cs.CV

TL;DR: 本文提出了一种名为PathMamba的新型混合架构,结合Mamba的状态空间模型与Transformer的全局推理能力,用于卫星图像中的道路分割,在保持高精度的同时显著提升拓扑连续性,并具有线性计算效率。

Details Motivation: 现有基于Vision Transformer的方法在道路分割中虽精度高但计算复杂度为二次型,难以高效部署;而道路网络具有长连续结构,需要更高效的建模方式。 Method: 提出PathMamba,融合Mamba块捕捉道路的连续拓扑结构和Transformer块引入全局上下文信息,形成互补的混合架构。 Result: 在DeepGlobe和Massachusetts Roads数据集上达到最先进性能,尤其在APLS指标上显著提升拓扑连续性,同时保持较低计算成本。 Conclusion: PathMamba通过结合Mamba的线性效率与Transformer的全局建模能力,实现了高精度、强拓扑一致性的道路分割,适用于资源受限场景的部署。 Abstract: Achieving both high accuracy and topological continuity in road segmentation from satellite imagery is a critical goal for applications ranging from urban planning to disaster response. State-of-the-art methods often rely on Vision Transformers, which excel at capturing global context, yet their quadratic complexity is a significant barrier to efficient deployment, particularly for on-board processing in resource-constrained platforms. In contrast, emerging State Space Models like Mamba offer linear-time efficiency and are inherently suited to modeling long, continuous structures. We posit that these architectures have complementary strengths. To this end, we introduce PathMamba, a novel hybrid architecture that integrates Mamba's sequential modeling with the Transformer's global reasoning. Our design strategically uses Mamba blocks to trace the continuous nature of road networks, preserving topological structure, while integrating Transformer blocks to refine features with global context. This approach yields topologically superior segmentation maps without the prohibitive scaling costs of pure attention-based models. Our experiments on the DeepGlobe Road Extraction and Massachusetts Roads datasets demonstrate that PathMamba sets a new state-of-the-art. Notably, it significantly improves topological continuity, as measured by the APLS metric, setting a new benchmark while remaining computationally competitive.

[125] CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation

Chenyu Liu,Hongze Chen,Jingzhi Bao,Lingting Zhu,Runze Zhang,Weikai Chen,Zeyu Hu,Yingda Yin,Keyang Luo,Xin Wang

Main category: cs.CV

TL;DR: 本文提出CaliTex,一种基于几何校准注意力的3D纹理生成框架,通过结构化注意力机制解决跨视角不一致问题。

Details Motivation: 现有扩散模型在3D纹理生成中存在跨视角不一致问题,源于注意力机制的歧义性导致几何与外观耦合不稳定。 Method: 引入几何校准注意力框架CaliTex,包含Part-Aligned Attention实现语义部件的空间对齐,以及Condition-Routed Attention通过几何条件路径传递外观信息;结合两阶段扩散Transformer提升几何一致性。 Result: CaliTex在视觉上生成无缝且视角一致的纹理,在开源和商业基准上均优于现有方法。 Conclusion: 将注意力机制与3D结构显式对齐可有效提升生成纹理的几何连贯性,使一致性成为网络内在属性而非优化副产物。 Abstract: Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling. To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure. It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity. Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization. Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.

[126] HTTM: Head-wise Temporal Token Merging for Faster VGGT

Weitian Wang,Lukas Meiner,Rai Shubham,Cecilia De La Parra,Akash Kumar

Main category: cs.CV

TL;DR: 提出了一种无需训练的3D token合并方法HTTM,用于加速VGGT模型,在保持性能的同时实现最高7倍的推理加速。

Details Motivation: VGGT模型在3D场景重建中需全局注意力机制,导致大场景下推理延迟高,亟需加速方法。 Method: 提出头级时间合并(HTTM),在多头注意力粒度上进行token合并,利用头级别的空间局部性和时间对应性,实现更高合并比率和更低计算成本。 Result: HTTM在GPU推理中实现了最高7倍的加速,且性能下降可忽略不计。 Conclusion: HTTM通过头级细粒度token合并,有效缓解了VGGT的计算瓶颈,是一种高效、即插即用的加速方案。 Abstract: The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.

[127] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Stefanos Koutoupis,Michaela Areti Zervou,Konstantinos Kontras,Maarten De Vos,Panagiotis Tsakalides,Grigorios Tsagatakis

Main category: cs.CV

TL;DR: 本文提出了Contrastive Fusion (ConFu)框架,通过扩展传统的对比学习目标,联合嵌入单个模态及其融合表示,以同时捕捉多模态间的高阶依赖和保持良好的成对对齐。

Details Motivation: 现有方法在处理多模态表示时主要集中在成对对齐,难以有效建模高阶交互,且在单模态任务中表现受限。因此需要一种既能捕获高阶依赖又能保持成对关系的方法。 Method: 提出ConFu框架,在传统成对对比学习基础上引入融合模态对比项,将单个模态及其融合组合共同嵌入统一表示空间,并对齐模态与融合后的表示。 Result: 在合成和真实多模态基准上验证了ConFu的有效性,能更好利用跨模态互补性、捕获高阶依赖,并支持统一的一对一和二对一检索任务。 Conclusion: ConFu能够有效联合建模多模态数据中的高阶关系与成对对齐,在多种任务中表现出色,为多模态表示学习提供了一种统一且可扩展的解决方案。 Abstract: Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

[128] Hybrid SIFT-SNN for Efficient Anomaly Detection of Traffic Flow-Control Infrastructure

Munish Rathee,Boris Bačić,Maryam Doborjeh

Main category: cs.CV

TL;DR: 提出SIFT-SNN框架,用于交通基础设施结构异常的实时检测,结合SIFT特征提取与脉冲神经网络分类,实现高精度、低延迟和低功耗的边缘部署。

Details Motivation: 为实现交通基础设施(如可移动混凝土屏障)在复杂环境下的实时、低功耗结构安全监测,需克服传统CNN方法在延迟、功耗和可解释性方面的局限。 Method: 提出SIFT-SNN框架:使用SIFT进行空间特征编码,结合基于延迟的脉冲转换层和Leaky Integrate-and-Fire(LIF)脉冲神经网络(SNN)进行分类,并在嵌入式系统上部署验证。 Result: 在包含6000帧的Auckland Harbour Bridge数据集上达到92.3%的分类准确率(±0.8%),单帧推理时间为9.5 ms,稀疏脉冲活动为8.1%,实现亚10毫秒延迟。 Conclusion: SIFT-SNN框架实现了高精度、低延迟和低功耗的实时异常检测,具备良好的可解释性和嵌入式部署能力,适用于全球多个城市的交通控制基础设施监测,但需进一步验证在未见现场条件下的泛化能力。 Abstract: This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport infrastructure. The proposed approach integrates Scale-Invariant Feature Transform (SIFT) for spatial feature encoding with a latency-driven spike conversion layer and a Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN) for classification. The Auckland Harbour Bridge dataset is recorded under various weather and lighting conditions, comprising 6,000 labelled frames that include both real and synthetically augmented unsafe cases. The presented system achieves a classification accuracy of 92.3% (+- 0.8%) with a per-frame inference time of 9.5 ms. Achieved sub-10 millisecond latency, combined with sparse spike activity (8.1%), enables real-time, low-power edge deployment. Unlike conventional CNN-based approaches, the hybrid SIFT-SNN pipeline explicitly preserves spatial feature grounding, enhances interpretability, supports transparent decision-making, and operates efficiently on embedded hardware. Although synthetic augmentation improved robustness, generalisation to unseen field conditions remains to be validated. The SIFT-SNN framework is validated through a working prototype deployed on a consumer-grade system and framed as a generalisable case study in structural safety monitoring for movable concrete barriers, which, as a traffic flow-control infrastructure, is deployed in over 20 cities worldwide.

[129] SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Tae-Min Choi,Tae Kyeong Jeong,Garam Kim,Jaemin Lee,Yeongyoon Koh,In Cheul Choi,Jae-Ho Chung,Jong Woong Park,Juyoun Park

Main category: cs.CV

TL;DR: SurgMLLMBench是一个统一的多模态基准,用于开发和评估面向手术场景理解的交互式多模态大语言模型,包含像素级分割掩码和结构化VQA标注,并提出新的MAVIS数据集。

Details Motivation: 现有手术数据集多采用异构分类体系的视觉问答格式,缺乏像素级分割支持,限制了多模态模型的一致性评估与应用。 Method: 构建统一的多模态基准SurgMLLMBench,整合腹腔镜、机器人辅助和显微手术三种场景下的像素级器械分割掩码与结构化VQA标注,并提出新的MAVIS数据集,在统一分类体系下支持更全面的评估。 Result: 在SurgMLLMBench上训练的单一模型在多个手术领域内表现一致,并能有效泛化到未见数据集。 Conclusion: SurgMLLMBench为多模态手术AI研究提供了公开、鲁棒的资源,推动可重复评估及交互式手术推理模型的发展。 Abstract: Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

[130] PFF-Net: Patch Feature Fitting for Point Cloud Normal Estimation

Qing Li,Huifang Feng,Kanle Shi,Yue Gao,Yi Fang,Yu-Shen Liu,Zhizhong Han

Main category: cs.CV

TL;DR: 提出了一种基于多尺度特征融合的点云法线估计新方法,通过多尺度特征聚合和跨尺度特征补偿实现鲁棒且高效的法线估计。

Details Motivation: 现有方法在处理不同数据或几何形状时难以选择合适的局部邻域大小,且参数量大、效率低,难以准确高效地预测法线。 Method: 提出多尺度特征融合策略,设计了补丁特征拟合(PFF)模型,包含多尺度特征聚合模块(逐步将多尺度特征聚集到中心并缩小补丁)和跨尺度特征补偿模块(提升大尺度特征重用与信息关联)。 Result: 在合成和真实世界数据集上均达到最先进性能,具有更少的网络参数和更低的运行时间。 Conclusion: 所提方法能有效适应不同尺度的局部补丁,提供最优特征描述,显著提升点云法线估计的精度与效率。 Abstract: Estimating the normal of a point requires constructing a local patch to provide center-surrounding context, but determining the appropriate neighborhood size is difficult when dealing with different data or geometries. Existing methods commonly employ various parameter-heavy strategies to extract a full feature description from the input patch. However, they still have difficulties in accurately and efficiently predicting normals for various point clouds. In this work, we present a new idea of feature extraction for robust normal estimation of point clouds. We use the fusion of multi-scale features from different neighborhood sizes to address the issue of selecting reasonable patch sizes for various data or geometries. We seek to model a patch feature fitting (PFF) based on multi-scale features to approximate the optimal geometric description for normal estimation and implement the approximation process via multi-scale feature aggregation and cross-scale feature compensation. The feature aggregation module progressively aggregates the patch features of different scales to the center of the patch and shrinks the patch size by removing points far from the center. It not only enables the network to precisely capture the structure characteristic in a wide range, but also describes highly detailed geometries. The feature compensation module ensures the reusability of features from earlier layers of large scales and reveals associated information in different patch sizes. Our approximation strategy based on aggregating the features of multiple scales enables the model to achieve scale adaptation of varying local patches and deliver the optimal feature description. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets with fewer network parameters and running time.

[131] Endo-G$^{2}$T: Geometry-Guided & Temporally Aware Time-Embedded 4DGS For Endoscopic Scenes

Yangle Liu,Fengze Li,Kan Liu,Jieming Ma

Main category: cs.CV

TL;DR: 提出Endo-G²T,一种面向动态内窥镜场景的几何引导、时间感知的4D高斯溅射训练框架,通过几何先验蒸馏、时变高斯场建模和关键帧约束流式优化,在单目重建中实现先进的几何稳定性与效率。

Details Motivation: 内窥镜视频存在强烈的视角依赖效应(如镜面反射、湿反射和遮挡),纯光度监督易导致早期几何漂移,难以纠正错误形状。因此需要在保持时间一致性和效率的同时,尽早锚定几何结构。 Method: 1)几何引导先验蒸馏:将置信门控的单目深度转化为尺度不变的深度和梯度损失进行监督,并采用预热到上限策略避免早期过拟合;2)时间嵌入高斯场:在XYZT中使用类旋量旋转参数化建模动态,辅以轻量正则化提升时间一致性;3)关键帧约束流式训练:在最大点数预算下聚焦关键帧优化,非关键帧轻量更新,提升效率与长时稳定性。 Result: 在EndoNeRF和StereoMIS-P1数据集上,Endo-G²T在单目重建方法中达到最先进性能,显著优于现有4DGS基线方法,表现出更强的几何稳定性和时间连贯性。 Conclusion: Endo-G²T通过引入几何先验、时间感知建模和高效流式策略,有效缓解了内窥镜视频中因强视觉效应导致的几何漂移问题,为动态内窥场景的高质量4D重建提供了可行方案。 Abstract: Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns with geometry and triggers early geometric drift, where erroneous shapes are reinforced during densification and become hard to correct. We ask how to anchor geometry early for 4D Gaussian splatting (4DGS) while maintaining temporal consistency and efficiency in dynamic endoscopic scenes. Thus, we present Endo-G$^{2}$T, a geometry-guided and temporally aware training scheme for time-embedded 4DGS. First, geo-guided prior distillation converts confidence-gated monocular depth into supervision with scale-invariant depth and depth-gradient losses, using a warm-up-to-cap schedule to inject priors softly and avoid early overfitting. Second, a time-embedded Gaussian field represents dynamics in XYZT with a rotor-like rotation parameterization, yielding temporally coherent geometry with lightweight regularization that favors smooth motion and crisp opacity boundaries. Third, keyframe-constrained streaming improves efficiency and long-horizon stability through keyframe-focused optimization under a max-points budget, while non-keyframes advance with lightweight updates. Across EndoNeRF and StereoMIS-P1 datasets, Endo-G$^{2}$T achieves state-of-the-art results among monocular reconstruction baselines.

[132] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Xin Gu,Haoji Zhang,Qihang Fan,Jingxuan Niu,Zhipeng Zhang,Libo Zhang,Guang Chen,Fan Chen,Longyin Wen,Sijie Zhu

Main category: cs.CV

TL;DR: 本文提出了STVG-o1,首个无需修改架构即可实现最先进性能的多模态大语言模型(MLLM)框架,用于视频中的时空定位任务。通过引入边界框链式思维机制和多维度强化奖励函数,显著提升了细粒度对齐能力,在多个基准上取得领先结果。

Details Motivation: 现有的MLLM在时空视频定位(STVG)任务上表现不佳,主要由于训练目标不一致以及视觉编码器中区域与词语的细粒度对齐较弱。因此需要一种新方法来提升MLLM在此类精确理解任务上的性能。 Method: 提出STVG-o1框架,引入边界框链式思维机制,在预测前显式推理时空位置;设计包含格式、一致性、时间、空间和思考奖励的多维强化学习奖励函数,通过强化微调提供几何感知监督。 Result: 在HCSTVG-v1/v2和VidSTG数据集上评估,STVG-o1在HCSTVG上超越最佳专用方法7.3% m_tIoU,与专用模型在VidSTG上表现相当,并大幅超过现有基于MLLM的方法,展现出强大的跨数据集开放词汇泛化能力。 Conclusion: STVG-o1首次证明了现成的MLLM可在不修改结构的情况下达到最先进的时空视频定位性能,为MLLM在精细时空定位任务中的应用提供了有效路径。 Abstract: Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3\% m\_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.

[133] Monet: Reasoning in Latent Visual Space Beyond Images and Language

Qixun Wang,Yang Shi,Yifei Wang,Yuanxing Zhang,Pengfei Wan,Kun Gai,Xianghua Ying,Yisen Wang

Main category: cs.CV

TL;DR: 本文提出了Monet框架,通过在潜视觉空间中生成连续嵌入作为中间视觉思维,使多模态大语言模型能够进行更接近人类的抽象视觉推理。

Details Motivation: 现有方法受限于外部工具,无法实现类人抽象视觉思维,缺乏对潜嵌入的有效监督和训练效率低。 Method: 提出三阶段基于蒸馏的监督微调(SFT)流程,并设计VLPO(视觉潜策略优化)强化学习方法,将潜嵌入纳入策略梯度更新;构建包含125K样本的高质量交错图文CoT数据集Monet-SFT-125K。 Result: Monet-7B在真实场景感知与推理基准上表现优异,具备强分布外泛化能力,在抽象视觉推理任务中显著优于现有方法。 Conclusion: 该工作推动了多模态模型在潜空间中进行视觉推理的发展,提供了有效训练框架与数据支持,为未来视觉潜推理研究提供了实践指导。 Abstract: "Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.

[134] DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models

Mingue Park,Prin Phunyaphibarn,Phillip Y. Lee,Minhyuk Sung

Main category: cs.CV

TL;DR: 提出DiverseVAR框架,在无需重训练或微调的情况下,通过在测试时注入文本嵌入噪声并结合新提出的scale-travel技术,在保持图像质量的同时显著提升视觉自回归模型的生成多样性。

Details Motivation: 视觉自回归模型(VAR)在图像生成中表现优异,但在相同提示下常产生高度相似的图像,缺乏多样性,该问题在当前研究中被忽视。 Method: 首先在文本嵌入中注入噪声以提升多样性;随后提出scale-travel方法,利用多尺度自编码器提取粗粒度token,在生成中间阶段恢复生成过程,以维持图像质量。 Result: 实验表明,该方法在多个指标上显著提升了生成多样性,同时最小化了图像质量下降,实现了多样性与质量间新的帕累托前沿。 Conclusion: DiverseVAR为VAR模型提供了高效、即插即用的多样性增强方案,无需额外训练,有效平衡了生成多样性和图像质量之间的权衡。 Abstract: We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.

[135] SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning

Futian Wang,Mengqi Wang,Xiao Wang,Haowen Wang,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种结合SAM基础模型与知识图谱的遥感图像变化描述方法,通过融合全局视觉特征、语义/运动级变化区域和对象知识,实现了更精准的自然语言变化描述。

Details Motivation: 现有方法在区域感知和时序对齐方面存在不足,缺乏对变化区域的精细刻画和感兴趣对象的知识引入。 Method: 采用CNN/Transformer提取全局特征,利用SAM模型分割语义和运动变化区域,并构建知识图谱提供感兴趣对象信息,通过交叉注意力机制融合多源信息,由Transformer解码器生成描述文本。 Result: 在多个主流遥感变化描述数据集上达到最先进性能。 Conclusion: 引入SAM模型和知识图谱能有效提升遥感图像变化描述的准确性和可解释性,为该任务提供了新的多模态融合框架。 Abstract: Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning

[136] E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework

Adeela Islam,Stefano Fiorini,Manuel Lecha,Theodore Tsesmelis,Stuart James,Pietro Morerio,Alessio Del Bue

Main category: cs.CV

TL;DR: 本文提出了一种名为E-M3RF的等变多模态3D重装配框架,结合几何与颜色特征,利用SE(3)流匹配预测碎片的变换,显著提升了在文化遗产数据集上的重装配精度。

Details Motivation: 现有基于深度学习的3D重装配方法主要依赖几何特征,在几何信息不足或模糊时(如小碎片、腐蚀或对称碎片)表现不佳,且缺乏防止重叠的物理约束。 Method: 提出E-M3RF框架,使用旋转等变编码器提取点云位置的几何特征,用Transformer编码每一点的颜色信息,并融合为多模态表示,通过SE(3)流匹配预测刚体变换。 Result: 在四个数据集(Breaking Bad、Fantastic Breaks、RePAIR和Presious)上实验表明,E-M3RF在RePAIR数据集上相比现有方法旋转误差降低23.1%,平移误差降低13.2%,Chamfer距离减少18.4%。 Conclusion: E-M3RF通过融合几何与颜色的多模态表示和等变建模,有效提升了复杂碎片的3D重装配性能,尤其适用于文化遗产修复等实际场景。 Abstract: 3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.

[137] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Jiajie Zhang,Sören Schwertfeger,Alexander Kleiner

Main category: cs.CV

TL;DR: 提出了一种无监督框架,从连续工业视频流中自动提取和组织视觉-语言-动作(VLA)预训练数据,首次实现端到端的无标注人类操作数据挖掘。

Details Motivation: 大量工业视频中的无标签人类操作数据难以被有效利用,缺乏自动化方法来提取可用于VLA模型预训练的结构化动作数据。 Method: 首先训练轻量级运动分词器编码运动动态,然后使用基于“潜在动作能量”新指标的无监督动作分割器发现并分割语义一致的动作原语,输出分段视频及其对应的潜在动作序列。 Result: 在公开基准和自建电机装配数据集上验证了关键任务的有效分割,通过视觉-语言模型聚类和定量评估证实了所发现动作原语的语义一致性。 Conclusion: 这是首个完全自动化的端到端系统,能从非结构化工业视频中提取VLA预训练数据,为制造业中具身AI的集成提供了可扩展的解决方案。 Abstract: We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.

[138] EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream Aggregation

Futian Wang,Fan Zhang,Xiao Wang,Mengqi Wang,Dexing Huang,Jin Tang

Main category: cs.CV

TL;DR: 提出一种基于超图的时空事件流补全机制,通过超图连接不同时空位置的事件令牌,并利用上下文信息传递来弥补事件的空间稀疏性问题,同时支持融合RGB模态进行多模态特征学习与融合。

Details Motivation: 现有事件表示学习方法在处理事件相机产生的空间稀疏但时间密集的异步事件流时,难以解决因空间稀疏导致的欠采样问题。 Method: 提出超图引导的时空事件流补全机制,将事件令牌构建成超图并通过消息传递补全稀疏事件;引入RGB令牌作为超图节点实现多模态信息融合,并利用自注意力聚合不同时刻的节点信息以实现多模态特征的有效学习。 Result: 在单标签和多标签事件分类任务上的大量实验验证了所提框架的有效性,性能优于主流方法。 Conclusion: 该方法有效缓解了事件数据的空间稀疏性问题,实现了更完整的事件表示学习,并通过多模态融合提升了事件分类性能,具有良好的应用前景。 Abstract: Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.

[139] MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices

Shuai Zhang,Bao Tang,Siyuan Yu,Yueting Zhu,Jingfeng Yao,Ya Zou,Shanglin Yuan,Li Yu,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: 本文提出了MobileI2V,一个具有270M参数的轻量级扩散模型,首次实现了在移动设备上实时生成高质量720p分辨率的图像到视频合成。其核心包括线性混合架构去噪器、时间步蒸馏策略和移动端注意力优化,显著提升了生成速度并保持了生成质量。

Details Motivation: 由于扩散模型计算复杂度高、生成速度慢,现有方法难以在资源受限的移动设备上实现实时、高分辨率的图像到视频生成,因此需要一种高效且轻量的解决方案。 Method: 1) 提出线性混合架构去噪器,在移动设备上平衡生成效率与质量;2) 设计时间步蒸馏策略,将采样步数从20步以上压缩至仅2步;3) 应用移动端专用的注意力优化技术,提升推理速度。 Result: MobileI2V可在移动设备上以每帧不到100毫秒的速度生成720p视频,在单步条件下实现约10倍的速度提升,同时保持与现有模型相当的生成质量。 Conclusion: MobileI2V首次实现了在移动设备上的实时、高分辨率图像到视频生成,为移动端视频生成提供了高效可行的解决方案。 Abstract: Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the substantial computational complexity and slow generation speed of diffusion models pose significant challenges for real-time, high-resolution video generation on resource-constrained mobile devices. In this work, we propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices. The core lies in: (1) We analyzed the performance of linear attention modules and softmax attention modules on mobile devices, and proposed a linear hybrid architecture denoiser that balances generation efficiency and quality. (2) We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss, resulting in a 10-fold increase in generation speed. (3) We apply mobile-specific attention optimizations that yield a 2-fold speed-up for attention operations during on-device inference. MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models. Under one-step conditions, the generation speed of each frame of 720p video is less than 100 ms. Our code is available at: https://github.com/hustvl/MobileI2V.

[140] Frequency-Aware Token Reduction for Efficient Vision Transformer

Dong-Jae Lee,Jiwan Hur,Jaehyun Choi,Jaemyung Yu,Junmo Kim

Main category: cs.CV

TL;DR: 提出一种频率感知的token压缩策略,通过分离高频和低频token并聚合低频分量来缓解秩坍缩问题,在降低计算开销的同时提升模型精度。

Details Motivation: 现有Vision Transformer的token压缩方法忽略了自注意力中的频率特性(如秩坍缩和过平滑现象),导致性能下降。 Method: 将token划分为高频和低频部分,选择性保留高频token,并将低频token聚合为一个紧凑的直流token以保留关键低频信息。 Result: 实验表明该方法显著降低了计算开销,同时缓解了秩坍缩和过平滑问题,提升了模型准确率;并对现有方法的隐式频率特性进行了分析。 Conclusion: 所提出的频率感知token压缩策略有效平衡了计算效率与模型性能,揭示了频率特性在视觉Transformer中的重要作用。 Abstract: Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations.

[141] Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning

Taehoon Kim,Donghwan Jang,Bohyung Han

Main category: cs.CV

TL;DR: 提出了一种名为Merge-and-Bound(M&B)的新型类增量学习训练方法,通过在参数空间中直接操作模型权重来优化,并引入跨任务和任务内权重合并以及有界更新策略,有效减少灾难性遗忘,在标准基准上表现出优于现有方法的性能。

Details Motivation: 为了解决类增量学习中的灾难性遗忘问题,探索不依赖架构修改或目标重定义的更有效的优化方式。 Method: 提出Merge-and-Bound(M&B)方法,包含跨任务权重合并(平均之前阶段的模型权重)和任务内权重合并(融合当前阶段内的参数),并采用有界更新技术以最小化累积更新,保持旧知识。 Result: 在多个标准类增量学习基准上进行了广泛评估,结果显示M&B显著优于现有的最先进方法。 Conclusion: M&B通过直接在参数空间进行权重操作,提供了一种高效、即插即用的类增量学习优化框架,有效缓解灾难性遗忘,且无需更改原有模型结构或学习目标。 Abstract: We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the parameter space for optimization. Our algorithm involves two types of weight merging: inter-task weight merging and intra-task weight merging. Inter-task weight merging unifies previous models by averaging the weights of models from all previous stages. On the other hand, intra-task weight merging facilitates the learning of current task by combining the model parameters within current stage. For reliable weight merging, we also propose a bounded update technique that aims to optimize the target model with minimal cumulative updates and preserve knowledge from previous tasks; this strategy reveals that it is possible to effectively obtain new models near old ones, reducing catastrophic forgetting. M&B is seamlessly integrated into existing CIL methods without modifying architecture components or revising learning objectives. We extensively evaluate our algorithm on standard CIL benchmarks and demonstrate superior performance compared to state-of-the-art methods.

[142] CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation

Shizhe Sun,Wataru Ohyama

Main category: cs.CV

TL;DR: 提出了一种基于交叉注意力的非局部知识蒸馏方法(CanKD),通过增强像素级特征关系的建模,显著提升目标检测与图像分割中的知识迁移效果。

Details Motivation: 传统基于自注意力的知识蒸馏方法独立对齐教师和学生特征图,难以充分捕捉跨空间位置的长距离依赖关系,限制了知识转移效率。 Method: 引入交叉注意力机制,使学生特征图的每个像素能够动态关注教师特征图中所有像素,实现非局部知识传递,并设计新的损失函数以优化该过程。 Result: 在目标检测和图像分割任务上显著优于现有的特征和混合蒸馏方法,验证了其有效性。 Conclusion: CanKD为注意力引导的知识蒸馏提供了一种新范式,具有较强的通用性和性能优势。 Abstract: We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD's potential as a new paradigm for attention-guided distillation in computer vision tasks. Code is available at https://github.com/tori-hotaru/CanKD

[143] Generalized Design Choices for Deepfake Detectors

Lorenzo Pellegrini,Serafino Pandolfini,Davide Maltoni,Matteo Ferrara,Marco Prati,Marco Ramilli

Main category: cs.CV

TL;DR: 本文系统地研究了训练、推理和增量更新等设计选择对深度伪造检测模型准确性和泛化能力的影响,旨在建立与架构无关的最佳实践,以提升检测性能并在AI-GenBench基准上实现最先进结果。

Details Motivation: 深度伪造检测方法的性能往往受实现细节(如数据预处理、增强策略和优化技术)影响较大,导致难以公平比较和确定关键影响因素。 Method: 通过隔离各个设计因素的影响,系统性地评估不同训练、推理和模型更新策略对检测性能的作用。 Result: 实验确定了一组能持续提升深度伪造检测性能的设计选择,并在AI-GenBench基准上达到了最先进的性能。 Conclusion: 提出了一套鲁棒且与架构无关的深度伪造检测最佳实践,有助于未来检测系统的设计与开发。 Abstract: The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentation strategies, and optimization techniques. These factors make it difficult to fairly compare detectors and to understand which factors truly contribute to their performance. To address this, we systematically investigate how different design choices influence the accuracy and generalization capabilities of deepfake detection models, focusing on aspects related to training, inference, and incremental updates. By isolating the impact of individual factors, we aim to establish robust, architecture-agnostic best practices for the design and development of future deepfake detection systems. Our experiments identify a set of design choices that consistently improve deepfake detection and enable state-of-the-art performance on the AI-GenBench benchmark.

[144] Self-Paced Learning for Images of Antinuclear Antibodies

Yiyang Jiang,Guangwu Qian,Jiaxin Wu,Qi Huang,Qing Li,Yongkang Wu,Xiao-Yong Wei

Main category: cs.CV

TL;DR: 提出了一种用于抗核抗体(ANA)检测的新框架,该框架通过实例采样、伪标签分配和自步学习解决多实例多标签学习的复杂性,在多个数据集上实现了最先进的性能。

Details Motivation: 手动ANA检测耗时、劳动密集且需要大量训练,同时存在超过100种抗体类型导致荧光模式组合复杂,现有机器学习方法难以应对临床实际中的多实例多标签(MIML)挑战。 Method: 设计了一个端到端的深度学习框架,包含三个任务特定组件:实例采样器(抑制低置信度实例)、概率伪标签分配器(根据可区分性自适应分配标签)和自步学习权重调整机制(依据标签观测动态调节训练过程),直接使用原始显微镜图像进行训练。 Result: 在ANA数据集上比先前最优方法提升了+7.0% F1-Macro和+12.6% mAP,在公共MIML医学数据集上排名前二,Hamming loss和one-error最多降低18.2%和26.9%。 Conclusion: 所提框架有效解决了ANA检测中的MIML问题,无需人工预处理即可实现高性能自动化检测,具有较强的临床应用潜力,并为医学图像MIML任务提供了新思路。 Abstract: Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren's syndrome, and scleroderma. Despite its importance, manual ANA detection is slow, labor-intensive, and demands years of training. ANA detection is complicated by over 100 coexisting antibody types, resulting in vast fluorescent pattern combinations. Although machine learning and deep learning have enabled automation, ANA detection in real-world clinical settings presents unique challenges as it involves multi-instance, multi-label (MIML) learning. In this paper, a novel framework for ANA detection is proposed that handles the complexities of MIML tasks using unaltered microscope images without manual preprocessing. Inspired by human labeling logic, it identifies consistent ANA sub-regions and assigns aggregated labels accordingly. These steps are implemented using three task-specific components: an instance sampler, a probabilistic pseudo-label dispatcher, and self-paced weight learning rate coefficients. The instance sampler suppresses low-confidence instances by modeling pattern confidence, while the dispatcher adaptively assigns labels based on instance distinguishability. Self-paced learning adjusts training according to empirical label observations. Our framework overcomes limitations of traditional MIML methods and supports end-to-end optimization. Extensive experiments on one ANA dataset and three public medical MIML benchmarks demonstrate the superiority of our framework. On the ANA dataset, our model achieves up to +7.0% F1-Macro and +12.6% mAP gains over the best prior method, setting new state-of-the-art results. It also ranks top-2 across all key metrics on public datasets, reducing Hamming loss and one-error by up to 18.2% and 26.9%, respectively. The source code can be accessed at https://github.com/fletcherjiang/ANA-SelfPacedLearning.

[145] EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

Pierre Adorni,Minh-Tan Pham,Stéphane May,Sébastien Lefèvre

Main category: cs.CV

TL;DR: 提出了一种高效的遥感基础模型(RSFM)的“专家集成”框架,通过轻量级、可复用的任务特定专家模块提升效率、可解释性和可扩展性,并支持联邦学习与持续集成。

Details Motivation: 现有遥感基础模型依赖大规模模型和数据,计算资源消耗大,不利于可持续发展且难以普及,亟需高效、环保且可协作的替代方案。 Method: 采用“专家集成”框架,将训练分解为多个轻量化的任务特定ConvNeXtV2专家模型,这些专家可冻结并重复使用,支持联邦训练、剪枝和持续集成。 Result: 实现了更高效、模块化且资源友好的遥感基础模型训练方式,具备良好的可扩展性与协作潜力。 Conclusion: 该框架为构建可持续、可扩展的遥感基础模型提供了新方向,特别适合资源受限和协作环境。 Abstract: Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs.

[146] The Age-specific Alzheimer 's Disease Prediction with Characteristic Constraints in Nonuniform Time Span

Xin Hong,Kaifeng Huang

Main category: cs.CV

TL;DR: 本研究提出了一种基于定量指标和年龄缩放因子的顺序图像生成方法,用于阿尔茨海默病的长期预测,显著提升了MRI图像合成的准确性和相似性。

Details Motivation: 由于输入序列在不规则时间间隔采集,导致难以准确表征疾病特征,因此需要一种能够保持疾病进展关键特征的图像生成方法。 Method: 提出一种由定量指标引导的顺序图像生成方法,并引入年龄缩放因子以生成特定年龄的MRI图像,采用年龄缩放的像素损失优化图像迭代生成过程。 Result: 消融实验表明,引入定量指标显著提高了MRI图像合成的准确性,年龄缩放的像素损失改善了图像生成效果;在长期疾病预测中,结构相似性指数达到0.882,表明合成图像具有高度相似性。 Conclusion: 该方法能有效生成反映阿尔茨海默病进展的年龄特异性MRI图像,有助于提升疾病长期预测的精度,为个性化治疗提供支持。 Abstract: Alzheimer's disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development of personalized treatment strategies that aim to mitigate its progression. The application of generated images for the prediction of Alzheimer's disease poses challenges, particularly in accurately representing the disease's characteristics when input sequences are captured at irregular time intervals. This study presents an innovative methodology for sequential image generation, guided by quantitative metrics, to maintain the essential features indicative of disease progression. Furthermore, an age-scaling factor is integrated into the process to produce age-specific MRI images, facilitating the prediction of advanced stages of the disease. The results obtained from the ablation study suggest that the inclusion of quantitative metrics significantly improves the accuracy of MRI image synthesis. Furthermore, the application of age-scaled pixel loss contributed to the enhanced iterative generation of MRI images. In terms of long-term disease prognosis, the Structural Similarity Index reached a peak value of 0.882, indicating a substantial degree of similarity in the synthesized images.

[147] Video Generation Models Are Good Latent Reward Models

Xiaoyue Mi,Wenqing Yu,Jiesong Lian,Shibo Jie,Ruizhe Zhong,Zijun Liu,Guozhen Zhang,Zixiang Zhou,Zhiyong Xu,Yuan Zhou,Qinglin Lu,Fan Tang

Main category: cs.CV

TL;DR: 本文提出了一种在潜在空间中进行视频生成偏好优化的新框架PRFL,避免了传统方法中昂贵的VAE解码和像素空间计算,实现了更高效、更优的人类偏好对齐。

Details Motivation: 现有的视频奖励模型依赖于为像素空间设计的视觉-语言模型,导致计算开销大、优化阶段晚,难以有效提升视频的运动动态和结构连贯性。 Method: 利用预训练视频生成模型在含噪潜在空间中的天然优势,直接在潜在空间中进行奖励建模和偏好优化,全程无需VAE解码,实现全去噪链路的梯度回传。 Result: 实验表明,PRFL在人类偏好对齐方面显著优于RGB ReFL,同时大幅降低了内存消耗和训练时间。 Conclusion: PRFL为视频生成中的奖励反馈学习提供了一种高效且有效的替代方案,推动了基于潜在空间的偏好学习发展。 Abstract: Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

[148] UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes

Kang Du,Xue Liao,Junpeng Xia,Chaozheng Guo,Yi Gu,Yirui Guan,Duotun Wang,ShengHuang,Zeyu Wang

Main category: cs.CV

TL;DR: UAVLight是一个针对光照变化下3D重建鲁棒性的新基准数据集,通过无人机在不同时间对同一场景进行多视角拍摄,提供几何一致但光照多样的真实场景,用于评估和提升现有重建方法在户外复杂光照下的性能。

Details Motivation: 光照不一致严重影响多视角3D重建效果,而现有数据集无法在保持几何不变的前提下提供足够的光照变化,难以有效评估算法的光照鲁棒性。 Method: 构建名为UAVLight的受控真实数据集,利用地理配准的重复飞行路径,在多个固定时间采集同一场景的多视角图像,确保几何、标定和视角一致,仅引入自然光照变化,并制定标准化评测协议。 Result: 提供了具有丰富光照变化但几何不变的多时段多视角图像数据,支持对MVS、SfM和神经渲染等方法在光照鲁棒性方面的系统评估。 Conclusion: UAVLight为光照鲁棒的3D重建提供了一个可靠且实用的基准,推动了在真实户外环境中实现一致、保真且可重光照的重建方法的发展。 Abstract: Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable. However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness. We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.

[149] Multimodal Robust Prompt Distillation for 3D Point Cloud Models

Xiang Gu,Liming Lu,Xu Zheng,Anan Du,Yongbin Zhou,Shuchao Pang

Main category: cs.CV

TL;DR: 提出了一种高效的多模态鲁棒提示蒸馏框架(MRPD),用于增强3D点云模型对抗攻击的鲁棒性,训练时通过教师-学生模式蒸馏多模态知识,推理时无额外开销。

Details Motivation: 现有3D点云防御方法存在计算开销高和对不同攻击泛化能力差的问题,亟需一种高效且通用的防御机制。 Method: 设计教师-学生框架MRPD,利用视觉模型(深度投影)、高性能3D模型和文本编码器作为多模态教师,通过特征对齐学习轻量级提示;引入置信度门控机制动态融合多模态信息,仅在训练阶段进行知识蒸馏。 Result: MRPD在多种白盒和黑盒攻击下显著优于现有防御方法,同时在干净数据上也提升了性能,且推理无额外计算成本。 Conclusion: MRPD提供了一种实用的新范式,通过高效利用多模态知识构建鲁棒的3D视觉系统,兼具高性能与低推理开销。 Abstract: Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.

[150] Enhanced Landmark Detection Model in Pelvic Fluoroscopy using 2D/3D Registration Loss

Chou Mo,Yehyun Suh,J. Ryan Martin,Daniel Moyer

Main category: cs.CV

TL;DR: 提出了一种结合2D/3D标志点配准的U-Net框架,用于在可变患者体位下提高术中骨盆X光图像中解剖标志点检测的准确性。

Details Motivation: 现有骨盆X光标志点检测方法大多假设为固定的前后位视角,无法应对术中成像角度和患者体位变化的问题。 Method: 将2D/3D标志点配准信息融入U-Net模型训练,引入姿态估计损失(Pose Estimation Loss),并在可变体位条件下进行训练与微调。 Result: 相比基线U-Net,使用姿态估计损失训练或微调的模型在真实术中可变体位条件下展现出更高的标志点检测精度。 Conclusion: 所提框架能有效提升在非标准视角和患者体位变化情况下的标志点检测鲁棒性和准确性,具有较强的临床应用潜力。 Abstract: Automated landmark detection offers an efficient approach for medical professionals to understand patient anatomic structure and positioning using intra-operative imaging. While current detection methods for pelvic fluoroscopy demonstrate promising accuracy, most assume a fixed Antero-Posterior view of the pelvis. However, orientation often deviates from this standard view, either due to repositioning of the imaging unit or of the target structure itself. To address this limitation, we propose a novel framework that incorporates 2D/3D landmark registration into the training of a U-Net landmark prediction model. We analyze the performance difference by comparing landmark detection accuracy between the baseline U-Net, U-Net trained with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss under realistic intra-operative conditions where patient pose is variable.

[151] Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Teng Hu,Zhentao Yu,Guozhen Zhang,Zihan Su,Zhengguang Zhou,Youliang Zhang,Yuan Zhou,Qinglin Lu,Ran Yi

Main category: cs.CV

TL;DR: 本文提出了Harmony框架,通过解决同步生成音视频内容中的三个关键问题——对应漂移、低效的全局注意力机制和单模态偏差,实现了更精细的音视频同步生成。

Details Motivation: 现有的开源生成模型在音视频对齐方面存在不足,主要由于联合扩散过程中存在的对应漂移、注意力机制效率低下以及分类器自由引导中的单模态偏好问题。 Method: 提出Harmony框架:1)跨任务协同训练范式以减少漂移;2)全局-局部解耦交互模块以提升时序对齐精度;3)同步增强型CFG(SyncCFG)在推理中强化对齐信号。 Result: 实验表明,Harmony在生成质量与音视频细粒度同步上均显著优于现有方法,达到新的SOTA水平。 Conclusion: Harmony通过机制化设计有效解决了音视频同步生成中的核心挑战,为多模态生成提供了新思路。 Abstract: The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

[152] Deep Learning-Based Multiclass Classification of Oral Lesions with Stratified Augmentation

Joy Naoum,Revana Salama,Ali Hamdi

Main category: cs.CV

TL;DR: 本研究提出了一种基于深度学习的多类分类器,用于16种不同口腔病变的分类,通过分层数据划分、数据增强和过采样技术应对数据不足和不平衡问题,在准确率、精确率和召回率方面优于现有方法,展现出在口腔癌早期检测中的潜力。

Details Motivation: 由于口腔癌在早期难以通过视觉区分良性和恶性病变,常在晚期才被诊断,因此需要一种可靠的计算机辅助诊断系统以提高早期检测能力。 Method: 采用深度学习构建多类分类器,结合分层数据分割、高级数据增强和过采样技术来处理有限且不平衡的数据集。 Result: 实验结果达到83.33%的准确率、89.12%的精确率和77.31%的召回率,显著优于当前最先进的方法,尤其在少数类分类上表现突出。 Conclusion: 所提出的框架在提升口腔病变分类性能方面有效,是迈向临床可信赖的口腔癌早期计算机辅助诊断系统的重要一步。 Abstract: Oral cancer is highly common across the globe and is mostly diagnosed during the later stages due to the close visual similarity to benign, precancerous, and malignant lesions in the oral cavity. Implementing computer aided diagnosis systems early on has the potential to greatly improve clinical outcomes. This research intends to use deep learning to build a multiclass classifier for sixteen different oral lesions. To overcome the challenges of limited and imbalanced datasets, the proposed technique combines stratified data splitting and advanced data augmentation and oversampling to perform the classification. The experimental results, which achieved 83.33 percent accuracy, 89.12 percent precision, and 77.31 percent recall, demonstrate the superiority of the suggested model over state of the art methods now in use. The suggested model effectively conveys the effectiveness of oversampling and augmentation strategies in situations where the minority class classification performance is noteworthy. As a first step toward trustworthy computer aided diagnostic systems for the early detection of oral cancer in clinical settings, the suggested framework shows promise.

[153] MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

Haotian Xue,Qi Chen,Zhonghao Wang,Xun Huang,Eli Shechtman,Jinrong Xie,Yongxin Chen

Main category: cs.CV

TL;DR: MoGAN是一种无需奖励模型或人类偏好数据的运动中心型后训练框架,通过基于DiT的光流判别器和分布匹配正则化项,显著提升视频扩散模型的运动真实感与时间一致性,同时保持视觉保真度。

Details Motivation: 现有视频扩散模型在帧级别保真度上表现良好,但缺乏对时间一致性的直接监督,导致生成的视频存在抖动、重影或不合理的动态问题。 Method: 在三步蒸馏的视频扩散模型基础上,构建一个基于DiT的光流判别器以区分真实与生成的运动,并引入分布匹配正则化来保持视觉质量。 Result: 在Wan2.1-T2V-1.3B上的实验表明,MoGAN在VBench上比50步教师模型提升+7.3%的运动得分,比3步DMD模型提升+13.3%;在VideoJAM-Bench上分别提升+7.4%和+8.8%,且保持甚至提升了美学与图像质量评分。人类研究也显示其在运动质量上更受偏好。 Conclusion: MoGAN能有效提升视频生成中的运动真实感,同时不牺牲视觉质量和推理效率,为快速高质量视频生成提供了实用路径。 Abstract: Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.

[154] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

M. Naseer Subhani

Main category: cs.CV

TL;DR: 提出一种自提示、点监督的框架,通过Refine-Requery-Reinforce循环提升SAM在遥感图像上的分割性能。

Details Motivation: 现有交互式分割模型(如SAM)在自然图像上表现良好,但在遥感图像上因域偏移和密集标注稀缺而表现不佳。 Method: 采用Refine-Requery-Reinforce循环:从初始点生成粗略伪掩码(Refine),利用自构建的框提示改进结果(Requery),并通过迭代对齐嵌入减少确认偏差(Reinforce)。 Result: 在WHU、HRSID和NWPU VHR-10三个遥感图像基准上均优于预训练SAM及近期点监督方法。 Conclusion: 自提示与语义对齐为基于点级标注的基础分割模型在遥感应用中的可扩展适应提供了有效路径。 Abstract: Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM's segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.

[155] Active Learning for GCN-based Action Recognition

Hichem Sahbi

Main category: cs.CV

TL;DR: 提出了一种标签高效的图卷积网络(GCN)模型,通过新颖的获取函数和双向稳定GCN架构,在标签数据稀缺的情况下显著提升了骨架动作识别性能。

Details Motivation: 现有的GCN在骨架动作识别中依赖大量标注数据,但在实际场景中标注数据往往稀缺,因此需要提高标签效率。 Method: 设计了一个基于对抗策略的新颖获取函数,用于选择具有代表性、多样性和不确定性的关键样本;同时引入了双向稳定的GCN架构,以更好建模环境与潜在空间之间的映射。 Result: 在两个具有挑战性的骨架动作识别基准上进行了广泛评估,结果表明所提方法相比先前工作有显著性能提升。 Conclusion: 所提出的标签高效GCN模型能有效减少对标注数据的依赖,同时在多个基准上实现了更优的动作识别性能。 Abstract: Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of labeled data, which are frequently scarce in practical settings. To address this limitation, we propose a novel label-efficient GCN model. Our work makes two primary contributions. First, we develop a novel acquisition function that employs an adversarial strategy to identify a compact set of informative exemplars for labeling. This selection process balances representativeness, diversity, and uncertainty. Second, we introduce bidirectional and stable GCN architectures. These enhanced networks facilitate a more effective mapping between the ambient and latent data spaces, enabling a better understanding of the learned exemplar distribution. Extensive evaluations on two challenging skeleton-based action recognition benchmarks reveal significant improvements achieved by our label-efficient GCNs compared to prior work.

[156] Qwen3-VL Technical Report

Shuai Bai,Yuxuan Cai,Ruizhe Chen,Keqin Chen,Xionghui Chen,Zesen Cheng,Lianghao Deng,Wei Ding,Chang Gao,Chunjiang Ge,Wenbin Ge,Zhifang Guo,Qidong Huang,Jie Huang,Fei Huang,Binyuan Hui,Shutong Jiang,Zhaohai Li,Mingsheng Li,Mei Li,Kaixin Li,Zicheng Lin,Junyang Lin,Xuejing Liu,Jiawei Liu,Chenglong Liu,Yang Liu,Dayiheng Liu,Shixuan Liu,Dunjie Lu,Ruilin Luo,Chenxu Lv,Rui Men,Lingchen Meng,Xuancheng Ren,Xingzhang Ren,Sibo Song,Yuchong Sun,Jun Tang,Jianhong Tu,Jianqiang Wan,Peng Wang,Pengfei Wang,Qiuyue Wang,Yuxuan Wang,Tianbao Xie,Yiheng Xu,Haiyang Xu,Jin Xu,Zhibo Yang,Mingkun Yang,Jianxin Yang,An Yang,Bowen Yu,Fei Zhang,Hang Zhang,Xi Zhang,Bo Zheng,Humen Zhong,Jingren Zhou,Fan Zhou,Jing Zhou,Yuanzhi Zhu,Ke Zhu

Main category: cs.CV

TL;DR: Qwen3-VL是Qwen系列中功能最强大的视觉语言模型,支持高达256K token的交错上下文,涵盖文本、图像和视频,在多种多模态基准测试中表现领先。

Details Motivation: 提升视觉语言模型在长上下文理解、跨模态对齐和复杂推理任务中的性能,满足实际应用中对高质量多模态处理的需求。 Method: 引入增强的交错MRoPE、DeepStack集成和基于文本的时间对齐机制,构建包含密集型和混合专家(MoE)架构的模型家族。 Result: 在纯文本理解、长上下文处理和多模态推理(如MMMU、MathVista等)方面显著优于同类模型,支持256K token的原生上下文窗口,并实现精确的时间定位。 Conclusion: Qwen3-VL在多种规模和架构下均表现出卓越性能,有望成为现实场景中图像推理、智能体决策和多模态代码理解的基础引擎。 Abstract: We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

[157] Continual Error Correction on Low-Resource Devices

Kirill Paramonov,Mete Ozay,Aristeidis Mystakidis,Nikolaos Tsalikidis,Dimitrios Sotos,Anastasios Drosou,Dimitrios Tzovaras,Hyunjun Kim,Kiseok Chang,Sangdok Mo,Namwoong Kim,Woojong Yoo,Jijoong Moon,Umberto Michieli

Main category: cs.CV

TL;DR: 提出一种基于原型更新的轻量级AI错误纠正系统,结合服务器端知识蒸馏与设备端原型适应,在资源受限设备上实现高效、低开销的少样本错误纠正。

Details Motivation: 现有AI错误检测方法缺乏对资源受限设备的高效纠正机制,导致用户体验下降。 Method: 采用服务器端基础模型进行知识蒸馏,训练轻量级设备模型,并在设备端通过原型更新实现少样本错误纠正,避免完整模型重训练。 Result: 在Food-101和Flowers-102数据集上的一次性纠错中实现超过50%的错误纠正率,遗忘率低于0.02%,计算开销极低,并通过Android应用验证了实用性。 Conclusion: 该系统在保持低存储和计算成本的同时,显著提升了边缘设备上AI模型的可维护性与用户体验,适用于实际部署。 Abstract: The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system's effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system's practicality in real-world scenarios.

[158] CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow

Ruisheng Han,Kanglei Zhou,Shuang Chen,Amir Atapour-Abarghouei,Hubert P. H. Shum

Main category: cs.CV

TL;DR: 提出CaFlow框架,结合反事实去混杂和双向时间条件流,用于长时动作质量评估,实现SOTA性能。

Details Motivation: 现有方法依赖昂贵的标注或单向时序建模,易受虚假相关性和上下文混杂因素影响,难以建模长期动作的复杂动态。 Method: 设计CaFlow框架,包含因果反事实正则化(CCR)模块以自监督方式分离因果与混杂特征,并通过反事实干预增强鲁棒性;引入BiT-Flow模块,利用循环一致性约束建模前向和后向动态,生成更平滑、连贯的表示。 Result: 在多个长时AQA基准上实验表明,CaFlow显著优于现有方法,取得最先进性能。 Conclusion: CaFlow通过双向时序建模与反事实去混杂有效提升了长时动作质量评估的准确性和稳定性。 Abstract: Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. The Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions, while the BiT-Flow module models forward and backward dynamics with a cycle-consistency constraint to produce smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance. Code is available at https://github.com/Harrison21/CaFlow

[159] Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Tianyi Xiong,Yi Ge,Ming Li,Zuolong Zhang,Pranav Kulkarni,Kaishen Wang,Qi He,Zeying Zhu,Chenxi Liu,Ruibo Chen,Tong Zheng,Yanshuo Chen,Xiyao Wang,Renrui Zhang,Wenhu Chen,Heng Huang

Main category: cs.CV

TL;DR: Multi-Crit 是一个用于评估多模态模型在多样化、细粒度评价标准下判断能力的基准,揭示了现有大模型在遵循多元标准和灵活切换标准方面的不足。

Details Motivation: 探索大 multimodal 模型作为评价者时对多样化、细粒度评价标准的遵循能力,当前研究对此尚不充分。 Method: 构建 Multi-Crit 基准,包含开放式生成和可验证推理任务,通过严格的数据整理流程收集具有多标准人工标注的挑战性样本,并提出三个新指标评估多元标准遵循、标准切换灵活性及偏好冲突识别能力。 Result: 对25个大模型的分析显示:1)专有模型在保持对多元标准的一致遵循方面仍有困难,尤其在开放式评价中;2)开源模型在灵活遵循多样标准方面更落后;3)基于整体判断信号的批评微调增强了视觉定位但无法泛化到多元标准级判断。 Conclusion: Multi-Crit 为构建可靠且可引导的多模态AI评估系统奠定了基础,是该领域的开创性研究。 Abstract: Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.

[160] Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Naifu Zhang,Wei Tao,Xi Xiao,Qianpu Sun,Yuxin Zheng,Wentao Mo,Peiqiang Wang,Nan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为ADVLA的新框架,通过在视觉编码器投影到文本特征空间的特征上直接施加对抗性扰动,高效且隐蔽地破坏视觉-语言-动作(VLA)模型的动作预测,避免了传统基于补丁攻击的高成本和明显扰动问题。

Details Motivation: 现有VLA模型的对抗攻击方法需要昂贵的端到端训练,且生成的扰动补丁通常明显可见,限制了其实际应用。因此,亟需一种更高效、低幅度且稀疏的攻击方式。 Method: ADVLA框架将对抗扰动直接作用于视觉编码器输出的文本特征空间,并引入注意力引导机制和三种策略:增强敏感性、强制稀疏性和集中扰动;结合Top-K掩码,在L∞=4/255约束下实现少于10%图像块修改。 Result: 在L∞=4/255约束下,ADVLA仅修改不到10%的图像块,攻击成功率接近100%,扰动集中在关键区域,几乎不可察觉,单步迭代耗时约0.06秒,显著优于传统补丁攻击。 Conclusion: ADVLA能在低幅度和局部稀疏条件下有效削弱VLA模型的下游动作预测能力,避免了传统方法的高训练成本和显著扰动,展现出对VLA特征空间攻击的独特有效性与实用价值。 Abstract: In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.

[161] Revolutionizing Glioma Segmentation & Grading Using 3D MRI - Guided Hybrid Deep Learning Models

Pandiyaraju V,Sreya Mynampati,Abishek Karthik,Poovarasan L,D. Saraswathi

Main category: cs.CV

TL;DR: 提出一种结合U-Net分割与DenseNet-VGG分类的混合深度学习模型,引入多头注意力和空间-通道注意力机制,实现对胶质瘤MRI数据的高精度3D分割与分类,实验显示Dice系数达98%,分类准确率高达99%,显著优于传统CNN模型。

Details Motivation: 胶质瘤具有高致死率,早期精准诊断对治疗至关重要;现有方法在特征提取和模型性能上存在局限,需更高效、可解释性强的自动化诊断模型。 Method: 构建一个混合框架:使用U-Net进行3D MRI肿瘤分割,结合DenseNet与VGG的双分支网络进行分类,并引入多头注意力和空间-通道注意力机制;采用归一化、重采样和数据增强处理高维3D MRI数据。 Result: 分割性能达到Dice系数98%、Mean IoU较高;分类性能达到99%准确率,以及相应的高精度、召回率和F1分数,优于传统CNN和无注意力机制的方法。 Conclusion: 该混合深度学习框架在胶质瘤分割与分类任务中表现出卓越性能,提升了临床相关特征的关注度和模型可解释性,具有辅助医生实现快速、可靠诊断与分级的潜力,有助于优化患者治疗方案。 Abstract: Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors. To address this difficulty, the proposed research will develop a hybrid deep learning model which integrates U-Net based segmentation and a hybrid DenseNet-VGG classification network with multihead attention and spatial-channel attention capabilities. The segmentation model will precisely demarcate the tumors in a 3D volume of MRI data guided by spatial and contextual information. The classification network which combines a branch of both DenseNet and VGG, will incorporate the demarcated tumor on which features with attention mechanisms would be focused on clinically relevant features. High-dimensional 3D MRI data could successfully be utilized in the model through preprocessing steps which are normalization, resampling, and data augmentation. Through a variety of measures the framework is evaluated: measures of performance in segmentation are Dice coefficient and Mean Intersection over Union (IoU) and measures of performance in classification are accuracy precision, recall, and F1-score. The hybrid framework that has been proposed has demonstrated through physical testing that it has the capability of obtaining a Dice coefficient of 98% in tumor segmentation, and 99% on classification accuracy, outperforming traditional CNN models and attention-free methods. Utilizing multi-head attention mechanisms enhances notions of priority in aspects of the tumor that are clinically significant, and enhances interpretability and accuracy. The results suggest a great potential of the framework in facilitating the timely and reliable diagnosis and grading of glioma by clinicians is promising, allowing for better planning of patient treatment.

[162] Seeing without Pixels: Perception from Camera Trajectories

Zihui Xue,Kristen Grauman,Dima Damen,Andrew Zisserman,Tengda Han

Main category: cs.CV

TL;DR: 本文首次系统研究了仅通过相机轨迹(而非视频像素)感知视频内容的可能性,提出CamFormer模型将相机姿态轨迹映射到与自然语言对齐的联合嵌入空间,证明相机轨迹是一种轻量、鲁棒且多功能的视频内容感知模态。

Details Motivation: 探索在不依赖视频像素的情况下,仅通过相机运动轨迹来理解视频内容的可能性,挖掘这一看似简单但潜在信息丰富的信号。 Method: 提出一种对比学习框架,训练名为CamFormer的编码器,将相机姿态轨迹投影到与自然语言对齐的联合嵌入空间,并在多种下游任务中验证其有效性。 Result: 实验表明,相机轨迹能有效揭示视频内容(如‘你在做什么’或‘你观察到了什么’),且CamFormer在跨模态对齐、分类和时序分析等任务中表现出色,对不同相机位姿估计方法均具有鲁棒性。 Conclusion: 相机轨迹是一种轻量、鲁棒且多功能的模态,能够有效反映视频内容,为视频理解提供了新的视角。 Abstract: Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

[163] Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Yusuf Dalva,Guocheng Gordon Qian,Maya Goldenberg,Tsai-Shien Chen,Kfir Aberman,Sergey Tulyakov,Pinar Yanardag,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: 提出Canvas-to-Image框架,通过统一画布界面整合多种异构控制信号,实现高保真图像生成。

Details Motivation: 现有扩散模型在多模态、组合性控制(如文本、参考图、姿态、布局等)下难以精确遵循用户意图。 Method: 将多种控制信号编码为单一复合画布图像,并采用多任务画布训练策略,在统一范式下联合优化模型对异构控制的理解与集成。 Result: 在多人组合、姿态控制、布局约束和多控制生成等任务上显著优于现有方法,尤其在身份保持和控制一致性方面表现突出。 Conclusion: Canvas-to-Image实现了对复杂用户意图的高保真还原,支持灵活、统一的多模态控制图像生成。 Abstract: While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.