Skip to content

Table of Contents

cs.CL [Back]

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

Hen-Hsen Huang

Main category: cs.CL

TL;DR: 本文提出在资源有限和专业知识不足的环境下,应将研究重点从大规模复杂性转向稳健的简单性,通过改进预训练模型架构、轻量级微调等方法实现高效且可持续的大语言模型部署。

Details Motivation: 现有的大语言模型效率方法(如MoE、推测解码和复杂RAG)主要适用于拥有庞大基础设施的大型科技公司,而在其他资源受限的场景下反而带来开销与环境负担,导致技术不平等。 Method: 提出一种新的研究议程:在不重新训练的前提下优化预训练模型架构,开发保持对齐性的轻量级微调技术,简化推理过程,实现无需重型RAG管道的动态知识管理,并采用‘开销感知效率’(OAE)作为新基准。 Result: 为非超大规模组织提供可行的高效LLM部署路径,降低采用成本、碳排放和技术壁垒。 Conclusion: 重新定义效率标准以包含可持续性、公平性和可及性,有助于实现大语言模型的技术民主化,避免加剧不平等和资源浪费。 Abstract: Large language models (LLMs) have become indispensable, but the most celebrated efficiency methods -- mixture-of-experts (MoE), speculative decoding, and complex retrieval-augmented generation (RAG) -- were built for hyperscale providers with vast infrastructure and elite teams. Outside that context, their benefits collapse into overhead, fragility, and wasted carbon. The result is that a handful of Big Tech companies benefit, while thousands of hospitals, schools, governments, and enterprises are left without viable options. We argue that the next frontier is not greater sophistication at scale, but robust simplicity: efficiency that thrives under modest resources and minimal expertise. We propose a new research agenda: retrofitting pretrained models with more efficient architectures without retraining, inventing lightweight fine-tuning that preserves alignment, making reasoning economical despite long chains of thought, enabling dynamic knowledge management without heavy RAG pipelines, and adopting Overhead-Aware Efficiency (OAE) as a standard benchmark. By redefining efficiency to include adoption cost, sustainability, and fairness, we can democratize LLM deployment -- ensuring that optimization reduces inequality and carbon waste rather than amplifying them.

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

Tcharlies Schmitz

Main category: cs.CL

TL;DR: 本文提出了Harmonic Token Projection (HTP),一种无需训练、词汇表或随机参数的可逆且确定性的文本嵌入生成框架,通过Unicode整数表示将每个词元解析为谐波轨迹,实验证明其在语义相似性任务中具有高效、低延迟且跨语言稳定的性能。

Details Motivation: 为了提供一种透明、高效且无需训练的文本嵌入方法,避免传统神经嵌入对数据统计和优化过程的依赖。 Method: HTP利用词元的Unicode整数表示,将其解析为谐波轨迹,建立离散符号与连续向量空间之间的双射映射,通过几何对齐实现语义相似性估计。 Result: 在STS-B及其多语言扩展数据集上,HTP在英语中达到Spearman相关系数ρ = 0.68,并在十种语言中保持稳定性能,计算成本极低,每句对延迟低于1毫秒。 Conclusion: HTP证明了仅通过确定性几何结构即可生成有意义的语义关系,为数据驱动嵌入提供了透明、高效的替代方案。 Abstract: This paper introduces the Harmonic Token Projection (HTP), a reversible and deterministic framework for generating text embeddings without training, vocabularies, or stochastic parameters. Unlike neural embeddings that rely on statistical co-occurrence or optimization, HTP encodes each token analytically as a harmonic trajectory derived from its Unicode integer representation, establishing a bijective and interpretable mapping between discrete symbols and continuous vector space. The harmonic formulation provides phase-coherent projections that preserve both structure and reversibility, enabling semantic similarity estimation from purely geometric alignment. Experimental evaluation on the Semantic Textual Similarity Benchmark (STS-B) and its multilingual extension shows that HTP achieves a Spearman correlation of \r{ho} = 0.68 in English, maintaining stable performance across ten languages with negligible computational cost and sub-millisecond latency per sentence pair. This demonstrates that meaningful semantic relations can emerge from deterministic geometry, offering a transparent and efficient alternative to data-driven embeddings. Keywords: Harmonic Token Projection, reversible embedding, deterministic encoding, semantic similarity, multilingual representation.

[3] A centroid based framework for text classification in itsm environments

Hossein Mohanna,Ali Ait-Bachir

Main category: cs.CL

TL;DR: 提出了一种基于双嵌入质心的文本分类框架,用于IT服务管理中的分层分类,兼具高效率、可解释性和快速增量更新能力。

Details Motivation: 在IT服务管理中,支持工单需要按树状分类体系进行归类,传统方法在效率和可解释性之间难以平衡。 Method: 采用双嵌入(语义与词法)质心表示法,并通过互逆排序融合在推理时结合两者,实现高效且可解释的分类。 Result: 在8,968个工单、123个类别上验证,性能与SVM相当(hierarchical F1: 0.731 vs 0.727),训练快5.9倍,增量更新快达152倍,在排除嵌入计算后批量推理速度提升8.6-8.8倍。 Conclusion: 该方法在保持竞争性性能的同时,显著提升了训练和更新效率,并具备良好可解释性,适合注重运维效率的生产级ITSM系统应用。 Abstract: Text classification with hierarchical taxonomies is a fundamental requirement in IT Service Management (ITSM) systems, where support tickets must be categorized into tree-structured taxonomies. We present a dual-embedding centroid-based classification framework that maintains separate semantic and lexical centroid representations per category, combining them through reciprocal rank fusion at inference time. The framework achieves performance competitive with Support Vector Machines (hierarchical F1: 0.731 vs 0.727) while providing interpretability through centroid representations. Evaluated on 8,968 ITSM tickets across 123 categories, this method achieves 5.9 times faster training and up to 152 times faster incremental updates. With 8.6-8.8 times speedup across batch sizes (100-1000 samples) when excluding embedding computation. These results make the method suitable for production ITSM environments prioritizing interpretability and operational efficiency.

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

Yongfu Xue

Main category: cs.CL

TL;DR: PIRA是一种新的训练范式,通过重构指令、聚合多任务奖励和稳定价值头输出,提升奖励模型的数据效率并缓解过优化问题。

Details Motivation: 传统判别式奖励模型存在数据效率低和易受奖励过优化影响的问题,需要更有效的训练方法来提升对齐效果。 Method: 提出PIRA框架:(1) 将问答对重构为基于偏好的指令;(2) 聚合来自不同偏好任务的奖励;(3) 在不同dropout率下对价值头输出进行平均。 Result: 大量实验表明,PIRA在提升数据效率、减少偏差和稳定奖励输出方面均有效。 Conclusion: PIRA有效解决了奖励模型中的数据效率和过优化问题,增强了模型的鲁棒性和对齐性能。 Abstract: Reward models are crucial for aligning Large Language Models (LLMs) with human preferences but face two representative challenges. First, traditional discriminative reward models usually concatenate questions and responses directly as input, resulting in low data efficiency. Second, reward models are vulnerable to reward overoptimization. We propose PIRA, a training paradigm addressing these issues through three strategies: (1) Reformulating question-answer pairs into preference-based instructions for clearer and more explicit task specification, (2) aggregating rewards from diverse preference tasks to reduce bias and improve robustness, and (3) averaging value-head outputs under varying dropout rates to stabilize rewards. Extensive experiments have demonstrated the effectiveness of PIRA.

Mann Khatri,Mirza Yusuf,Rajiv Ratn Shah,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: 本文研究了如何通过重构法律文档、定义修辞角色和模拟法院推理来提升大语言模型在法律领域的零样本表现,实验结果表明这些方法显著提高了F1分数。

Details Motivation: 大语言模型在通用领域表现出色,但在法律等专业领域因缺乏领域特定预训练而表现不佳,且法律文本通常冗长复杂,难以有效处理。 Method: 在三个印度法律判决预测数据集上进行零样本实验,探索三种方法:按修辞角色重组文档、定义修辞角色以引入法律术语、模拟法院逐步推理过程。 Result: 组织数据或解释关键法律术语显著提升了模型性能,F1分数相比基线最少提高约1.5%,最高提升达4.36%。 Conclusion: 通过结构化信息呈现和引入法律术语知识,可有效增强大语言模型在法律领域的理解和推理能力,无需额外的领域微调。 Abstract: Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.

[6] MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

Saad Mankarious,Ayah Zirikly,Daniel Wiechmann,Elma Kerz,Edward Kempa,Yu Qiao

Main category: cs.CL

TL;DR: 本文提出了一个名为MindSET的新基准数据集,用于心理健康分析,基于Reddit上自我报告的诊断信息构建,包含超过1300万条标注帖子,涵盖七种心理疾病,数据质量和模型性能均优于现有基准。

Details Motivation: 现有心理健康研究的基准数据集存在过时、数据不足、清洗不充分以及难以应对社交媒体内容多样性(如多语言和有害内容)等问题,亟需更新和改进。 Method: 从Reddit收集自我报告诊断的数据,经过严格预处理(包括语言过滤、去除NSFW和重复内容),构建大规模标注数据集MindSET,并使用LIWC进行语言学分析;通过微调语言模型和Bag-of-Words特征进行二分类实验以验证数据集有效性。 Result: MindSET包含超过1300万条标注帖子,规模是此前基准的两倍以上;在自闭症检测任务中,模型F1分数最高提升了18个百分点;语言模型在MindSET上表现 consistently 优于在旧基准上的表现。 Conclusion: MindSET为社交媒体与心理健康交叉研究提供了更强大、高质量的基础数据,有助于早期风险识别和心理趋势的深入分析。 Abstract: Social media data has become a vital resource for studying mental health, offering real-time insights into thoughts, emotions, and behaviors that traditional methods often miss. Progress in this area has been facilitated by benchmark datasets for mental health analysis; however, most existing benchmarks have become outdated due to limited data availability, inadequate cleaning, and the inherently diverse nature of social media content (e.g., multilingual and harmful material). We present a new benchmark dataset, \textbf{MindSET}, curated from Reddit using self-reported diagnoses to address these limitations. The annotated dataset contains over \textbf{13M} annotated posts across seven mental health conditions, more than twice the size of previous benchmarks. To ensure data quality, we applied rigorous preprocessing steps, including language filtering, and removal of Not Safe for Work (NSFW) and duplicate content. We further performed a linguistic analysis using LIWC to examine psychological term frequencies across the eight groups represented in the dataset. To demonstrate the dataset utility, we conducted binary classification experiments for diagnosis detection using both fine-tuned language models and Bag-of-Words (BoW) features. Models trained on MindSET consistently outperformed those trained on previous benchmarks, achieving up to an \textbf{18-point} improvement in F1 for Autism detection. Overall, MindSET provides a robust foundation for researchers exploring the intersection of social media and mental health, supporting both early risk detection and deeper analysis of emerging psychological trends.

[7] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

Zheng Hui,Xiaokai Wei,Reza Shirkavand,Chen Wang,Weizhi Zhang,Alejandro Peláez,Michelle Gong

Main category: cs.CL

TL;DR: 提出FlexCode,一种基于流行度感知的生成式推荐框架,通过协同过滤与语义码本的自适应分配提升长尾物品推荐性能。

Details Motivation: 现有生成式推荐方法使用单一码本编码所有物品,忽视了热门物品与长尾物品在协同信号和语义依赖上的差异,导致表示效率低和泛化能力受限。 Method: 设计FlexCode框架,采用双码本结构(协同过滤码本和语义码本),通过轻量级MoE动态分配固定token预算,并引入对齐与平滑目标以保持流行度谱上的表示一致性。 Result: 在公开和工业规模数据集上实验表明,FlexCode显著优于强基线模型,在整体准确性和长尾物品推荐鲁棒性方面均有提升。 Conclusion: FlexCode为生成式推荐中的token表示提供了新机制,有效平衡了记忆化与泛化,推动了基于token的推荐模型的发展。 Abstract: Generative recommendation has recently emerged as a powerful paradigm that unifies retrieval and generation, representing items as discrete semantic tokens and enabling flexible sequence modeling with autoregressive models. Despite its success, existing approaches rely on a single, uniform codebook to encode all items, overlooking the inherent imbalance between popular items rich in collaborative signals and long-tail items that depend on semantic understanding. We argue that this uniform treatment limits representational efficiency and hinders generalization. To address this, we introduce FlexCode, a popularity-aware framework that adaptively allocates a fixed token budget between a collaborative filtering (CF) codebook and a semantic codebook. A lightweight MoE dynamically balances CF-specific precision and semantic generalization, while an alignment and smoothness objective maintains coherence across the popularity spectrum. We perform experiments on both public and industrial-scale datasets, showing that FlexCode consistently outperform strong baselines. FlexCode provides a new mechanism for token representation in generative recommenders, achieving stronger accuracy and tail robustness, and offering a new perspective on balancing memorization and generalization in token-based recommendation models.

[8] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

Saleh Almohaimeed,May Alsofyani,Saad Almohaimeed,Mansour Al Ghanim,Liqiang Wang

Main category: cs.CL

TL;DR: 本文提出了首个阿拉伯语跨领域、上下文相关的文本到SQL数据集Ar-SParC,并基于大语言模型和多种提示工程技术进行了系统实验,提出了一种新的GAT校正器方法,显著提升了阿拉伯语文本到SQL的转换性能。

Details Motivation: 阿拉伯语在跨领域、上下文相关的文本到SQL任务中缺乏相关研究和数据集,限制了该语言在自然语言接口与数据库交互中的发展,因此需要构建专门的数据集并探索有效的解决方案。 Method: 构建了包含3,450个问题序列(共10,225个问题)的Ar-SParC数据集;采用GPT-3.5-turbo和GPT-4.5-turbo两个大模型,结合四种问题表示方法和六种上下文学习技术进行40组实验;提出GAT corrector方法以提升生成SQL的准确性,并通过消融实验分析其有效性。 Result: GAT corrector在零样本设置下平均提升1.9%执行准确率(EX)和1.9%交互准确率(IX),在上下文学习设置下提升1.72% EX和0.92% IX;实验验证了该方法在阿拉伯语文本到SQL任务中的优越性。 Conclusion: Ar-SParC填补了阿拉伯语在上下文相关文本到SQL任务上的空白,GAT corrector的有效性表明针对语言特性设计后处理机制可显著提升性能,为非英语语种的文本到SQL研究提供了新思路。 Abstract: In recent years, the task of cross-domain, context-dependent text-to-SQL has received significant attention. Enables users with no prior knowledge of SQL to have a conversation with databases using natural language. However, most of the available datasets and research have been conducted in English, along with some work in Chinese. To this date, no effort has been made to address this task in the Arabic language. In this paper, we introduce Ar-SParC, the first Arabic cross-domain, context-dependent text-to-SQL dataset. The dataset consists of 3,450 sequences of interrelated questions, each sequence containing an average of approximately three questions, which results in a total of 10225 questions along with their corresponding SQL queries. We conducted 40 experiments on the Ar-SParC dataset using two large language models, GPT-3.5-turbo and GPT-4.5-turbo, applying 10 different prompt engineering techniques, including four question representation methods and six in-context learning techniques. Furthermore, we developed a novel approach named GAT corrector, which enhanced the performance across all 40 experiments, yielding an average improvement of 1.9% in execution accuracy (EX) and 1.9% in interaction accuracy (IX) under zero-shot settings, and an average increase of 1.72% EX and 0.92% IX under in-context learning settings. Finally, we conducted an ablation study with two more experiments to explain why the GAT corrector outperformed the previous GAT verifier technique, particularly for the Arabic language.

[9] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

Matthew W. Kenaston,Umair Ayub,Mihir Parmar,Muhammad Umair Anjum,Syed Arsalan Ahmed Naqvi,Priya Kumar,Samarth Rawal,Aadel A. Chaudhuri,Yousef Zakharia,Elizabeth I. Heath,Tanios S. Bekaii-Saab,Cui Tao,Eliezer M. Van Allen,Ben Zhou,YooJung Choi,Chitta Baral,Irbaz Bin Riaz

Main category: cs.CL

TL;DR: 该研究开发了一个分层分类法来识别GPT-4在真实肿瘤学笔记中的推理错误,发现23%的解读存在推理错误,其中确认偏见和锚定偏见最常见,且这些错误与指南不一致及潜在有害的临床建议相关。

Details Motivation: 尽管大语言模型在临床基准测试中表现优异,但其可能通过错误的推理得出正确结论,这种缺陷在肿瘤学决策支持中存在安全隐患,而基于准确性的评估无法捕捉此类问题。 Method: 研究采用两队列回顾性设计,基于CORAL数据集中的乳腺癌和胰腺癌笔记,标注600条推理链以构建三级分类法,并在前列腺癌会诊笔记的822条响应中验证该分类法,涵盖从局部到转移性疾病的多种任务。 Result: 推理错误出现在23%的解读中,为主要错误类型;确认偏见和锚定偏见最为常见,且与偏离指南和潜在有害的建议显著相关;最先进的自动化评估模型能检测错误存在,但无法可靠分类错误子类型。 Conclusion: 大语言模型可能因推理缺陷而生成流畅但临床不安全的建议,所提出的分类法为评估和提升推理保真度提供了可推广的框架,应在临床部署前加以应用。 Abstract: Despite high performance on clinical benchmarks, large language models may reach correct conclusions through faulty reasoning, a failure mode with safety implications for oncology decision support that is not captured by accuracy-based evaluation. In this two-cohort retrospective study, we developed a hierarchical taxonomy of reasoning errors from GPT-4 chain-of-thought responses to real oncology notes and tested its clinical relevance. Using breast and pancreatic cancer notes from the CORAL dataset, we annotated 600 reasoning traces to define a three-tier taxonomy mapping computational failures to cognitive bias frameworks. We validated the taxonomy on 822 responses from prostate cancer consult notes spanning localized through metastatic disease, simulating extraction, analysis, and clinical recommendation tasks. Reasoning errors occurred in 23 percent of interpretations and dominated overall errors, with confirmation bias and anchoring bias most common. Reasoning failures were associated with guideline-discordant and potentially harmful recommendations, particularly in advanced disease management. Automated evaluators using state-of-the-art language models detected error presence but could not reliably classify subtypes. These findings show that large language models may provide fluent but clinically unsafe recommendations when reasoning is flawed. The taxonomy provides a generalizable framework for evaluating and improving reasoning fidelity before clinical deployment.

[10] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

Bharadwaj Yadavalli

Main category: cs.CL

TL;DR: 本文提出了动态模板选择(DTS)方法,根据查询复杂度自适应匹配响应模板,显著降低大语言模型部署中的输出令牌成本,且不损害响应质量。

Details Motivation: 现有的统一提示策略在处理不同复杂度的查询时效率低下,尤其在输出令牌成本远高于输入令牌的情况下,导致资源浪费。 Method: 提出动态模板选择(DTS),使用MLP和RoBERTa两种路由机制,基于查询复杂性选择合适的响应模板,并在MMLU数据集上进行评估。 Result: MLP路由器在保留测试数据上达到90.5%的路由准确率,略高于RoBERTa的89.5%,且参数量更少;跨三个主流LLM提供商(GPT-4、Gemini、Claude)验证了路由决策的通用性,令牌消耗减少32.6%至33.9%。 Conclusion: DTS能有效实现令牌节省,具有跨平台泛化能力,为降低大模型部署成本提供了实用解决方案。 Abstract: Contemporary large language model deployments typically employ uniform prompting strategies across diverse query types, applying verbose response patterns to both complex analytical tasks and straightforward factual questions. This one-size-fits-all methodology leads to substantial token inefficiency, a concern amplified by the significant cost differential between input and output tokens--the latter commanding 4-8x higher prices across major providers. We present Dynamic Template Selection (DTS), which adaptively matches response templates to query complexity, achieving significant cost reductions without compromising response quality. We compared two routing approaches: a simple MLP that uses pre-computed embeddings and a more complex fine-tuned RoBERTa transformer. Through comprehensive evaluation on 1,000 MMLU questions, we find that the MLP router achieves 90.5% routing accuracy on held-out test data, marginally exceeding RoBERTa's performance (89.5%) despite utilizing 125M fewer parameters. Notably, our empirical analysis reveals provider-agnostic behavior in template selection--routing decisions generalize effectively across 3 major LLM providers (OpenAI GPT-4, Google Gemini, and Anthropic Claude), as validated through 9,000 production API calls. While routing accuracy remains consistent at 90.5% across providers, observed token reductions vary from 32.6% to 33.9%, reflecting provider-specific generation characteristics. This work contributes several key elements: formal problem formulation with theoretical grounding in machine learning, four algorithms with corresponding complexity analyses, and extensive empirical validation across production systems.

[11] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

Lijun Shang,Yadong Yu,Wenqiang Kang,Jian Zhou,Dongyue Gao,Pan Xiang,Zhe Liu,Mengyan Dai,Zhonglu Guo,Zhimei Sun

Main category: cs.CL

TL;DR: 本文探讨了二维材料在能源存储和转换中的应用,强调了从已发表论文中提取关键信息的重要性。

Details Motivation: 由于二维材料的独特物理化学和电子特性,其在能源领域的应用广泛,但相关合成方法和性质的信息分散在大量文献中,难以高效获取。 Method: 通过分析已发表的研究论文,提取有关二维材料的性质和制备方法的关键信息。 Result: 能够系统地整理出二维材料的重要信息,有助于加速材料研发进程。 Conclusion: 从文献中系统提取信息是推动二维材料在能源领域应用的有效途径。 Abstract: Two-dimensional (2D) materials have showed widespread applications in energy storage and conversion owning to their unique physicochemical, and electronic properties. Most of the valuable information for the materials, such as their properties and preparation methods, is included in the published research papers. However, due to the dispersion of synthe

[12] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

Trung Cuong Dang,David Mohaisen

Main category: cs.CL

TL;DR: 提出了一种新的多前缀记忆化框架,用于检测大语言模型中训练数据的记忆化现象,通过衡量记忆序列的检索路径多样性来更全面地识别数据泄露。

Details Motivation: 现有记忆化定义在捕捉对齐模型中的记忆化现象方面存在不足,需要一种更全面的方法来评估大语言模型对训练数据的逐字记忆及其带来的隐私与版权风险。 Method: 定义一个序列被记忆的标准为:外部对抗性搜索能找到足够多的不同前缀来触发该序列;通过多前缀可提取性衡量记忆的鲁棒性和编码深度。 Result: 在开源和对齐的对话模型上的实验表明,该方法能有效区分被记忆和未被记忆的内容,比单一前缀方法更具鲁棒性。 Conclusion: 多前缀记忆化框架提供了一个更可靠、实用的工具,用于审计大语言模型中的数据泄露问题。 Abstract: Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.

[13] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

Jiaojiao Han,Wujiang Xu,Mingyu Jin,Mengnan Du

Main category: cs.CL

TL;DR: SAGE是一个基于代理的框架,通过主动的、解释驱动的过程来解释稀疏自动编码器(SAE)提取的特征,显著提高了生成和预测准确性。

Details Motivation: 大型语言模型的内部机制不透明,SAE虽有助于分解可解释特征,但其特征解释仍具挑战性。 Method: 提出SAGE框架,系统化生成多个特征解释,设计针对性实验进行验证,并基于激活反馈迭代优化解释。 Result: 在多种语言模型的SAE特征上实验表明,SAGE在生成和预测准确性上显著优于现有最先进基线方法。 Conclusion: SAGE通过主动推理和实证反馈机制,有效提升了对LLM中SAE特征的可解释性水平。 Abstract: Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

[14] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

Asad Aali,Muhammad Ahmed Mohsin,Vasiliki Bikia,Arnav Singhvi,Richard Gaus,Suhana Bedi,Hejie Cui,Miguel Fuentes,Alyssa Unell,Yifan Mai,Jordan Cahoon,Michael Pfeffer,Roxana Daneshjou,Sanmi Koyejo,Emily Alsentzer,Percy Liang,Christopher Potts,Nigam H. Shah,Akshay S. Chaudhari

Main category: cs.CL

TL;DR: 本文提出了一种结合DSPy与HELM的可复现框架,通过结构化提示方法(尤其是引入推理链)来更准确地评估大语言模型的性能,发现传统固定提示会低估模型表现、扭曲排行榜排名,而结构化提示能更稳定地逼近模型性能上限,并开源了相关工具。

Details Motivation: 现有的语言模型评测框架(如HELM)依赖固定提示,无法泛化到不同模型,导致性能估计不准确,可能低估模型真实能力;需要一种能够适应不同模型并系统优化提示的方法以获得更可靠的评测结果。 Method: 构建了一个可复现的DSPy+HELM集成框架,采用四种结构化提示方法(包括引入思维链的推理机制),在七个通用和医学领域的基准任务上对四个前沿大语言模型进行评估,并与原始HELM基线结果对比分析。 Result: 研究发现:(i) 传统HELM平均低估模型性能4%;(ii) 性能估计在不同基准间波动增加(标准差+2%);(iii) 3/7个基准上的排行榜排名发生反转;(iv) 引入推理显著降低模型对提示设计的敏感性。结构化提示能更接近模型的真实性能上限。 Conclusion: 结构化提示(特别是支持推理的提示)对于准确评估语言模型至关重要,可生成更具决策参考价值的基准结果;本研究是首个大规模实证分析跨基准和提示方法下模型行为的研究,推动了可扩展、可优化的评测范式发展。 Abstract: As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we estimate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (+2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing reasoning (chain-of-thought) reduces LM sensitivity to prompt design (smaller Δ across prompts). To our knowledge, this is the first large-scale benchmarking study to empirically characterize LM behavior across benchmarks and prompting methods, showing that scalable performance ceiling estimation enables more decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).

[15] Length-MAX Tokenizer for Language Models

Dong Dong,Weijie Su

Main category: cs.CL

TL;DR: 本文提出了一种名为Length-MAX的新分词器,通过最小化平均token长度来减少语言模型训练和推理中的token数量,在多个指标上优于传统的BPE方法。

Details Motivation: 传统分词方法如BPE主要基于字符频率合并,未充分考虑生成文本的长度效率;本文旨在通过优化平均token每字符比率来提升语言模型的训练与推理效率。 Method: 将最小化平均token长度的目标转化为图划分问题,设计了一种贪心近似算法来构建词汇表,从而得到Length-MAX分词器。 Result: 在FineWeb等数据集上,相比BPE减少了14%-18%的token数(64K时减少13%);GPT-2模型训练收敛步数减少17.2%-18.5%,推理延迟降低12.7%-13.7%,吞吐提升16%,下游任务表现更优,LAMBADA困惑度下降11.7%,HellaSwag准确率提升4.3%;词汇覆盖率达99.62%,OOV率为0.12%。 Conclusion: 优化平均token长度而非仅依赖频率,能有效提升语言模型效率,同时不牺牲甚至改善下游任务性能,且具备生产环境可用性。 Abstract: We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.

[16] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei,Noveen Sachdeva,Benjamin Coleman,Zhankui He,Yuanchen Bei,Xuying Ning,Mengting Ai,Yunzhe Li,Jingrui He,Ed H. Chi,Chi Wang,Shuo Chen,Fernando Pereira,Wang-Cheng Kang,Derek Zhiyuan Cheng

Main category: cs.CL

TL;DR: 本文提出了Evo-Memory,一个用于评估大语言模型代理在连续任务流中自演化记忆能力的基准和框架,强调记忆的动态积累与复用,并引入了ExpRAG和ReMem方法以提升经验利用和持续学习能力。

Details Motivation: 现有记忆评估主要集中在静态对话场景,忽略了在动态任务流中积累和复用经验的能力,而实际应用中LLM需要持续从交互中学习,因此需要测试时演化的机制。 Method: 构建了一个名为Evo-Memory的流式基准,将数据集组织为顺序任务流,要求LLM在每次交互后搜索、适应并更新记忆;统一实现了十多种代表性记忆模块,并在10个多样化多轮目标导向及单轮推理与问答数据集上进行评估;提出了ExpRAG基线方法和ReMem(动作-思考-记忆优化)流程。 Result: 通过Evo-Memory框架验证了不同记忆模块在动态环境中的表现,ReMem能够有效整合推理、动作与记忆更新,实现持续性能提升,显著优于传统静态记忆方法。 Conclusion: Evo-Memory填补了LLM代理在动态任务流中记忆演化评估的空白,展示了持续记忆更新对长期规划和问题解决的重要性,推动了具备自我进化记忆的智能代理的发展。 Abstract: Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

[17] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

Ali Jahan,Masood Ghayoomi,Annette Hautli-Janisz

Main category: cs.CL

TL;DR: 本文提出了一种跨语言方法用于低资源语言的论点挖掘,通过构建三种训练场景并在英语和波斯语数据上进行评估,结果表明轻量级的跨语言模型优于基于大语言模型增强的方法。

Details Motivation: 针对低资源语言在论点挖掘任务中面临的数据稀缺问题,探索有效的跨语言迁移方法。 Method: 设计了三种训练场景:零样本迁移、基于大语言模型生成合成样本的英语训练增强,以及结合英波双语数据的跨语言混合训练。 Result: 零样本模型在波斯语测试集上达到50.7%的F1分数;LLM增强模型提升至69.3%;而跨语言模型进一步将F1提高到74.8%。 Conclusion: 轻量级的跨语言数据融合策略优于复杂且资源消耗大的数据增强方案,为低资源语言的论点挖掘提供了高效可行的路径。 Abstract: Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus \citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2\% on the English test set and 50.7\% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2\% on English and 69.3\% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8\%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.

[18] Emergence and Localisation of Semantic Role Circuits in LLMs

Nura Aljaafari,Danilo S. Carvalho,André Freitas

Main category: cs.CL

TL;DR: 提出一种结合角色交叉最小对、时间演生分析和跨模型比较的方法,研究大语言模型中语义角色的实现机制,发现其具有高度集中的电路、渐进式结构优化以及跨尺度的部分保守性。

Details Motivation: 大语言模型虽然表现出语义能力,但其内部支撑抽象语义结构的机制尚不明确,需要更系统的方法来揭示其工作机制。 Method: 引入角色交叉最小对、时间演生分析和跨模型比较三种方法的整合框架,以识别和分析大语言模型中语义角色的神经机制。 Result: 发现了高度集中的神经回路(89-94%归因于28个节点),结构逐步精细化而非突变,较大模型有时绕过局部电路,跨尺度组件重叠率为24-59%,且谱相似性高。 Conclusion: 大语言模型形成了紧凑且因果隔离的抽象语义结构机制,这些机制在不同规模和架构间表现出部分可迁移性。 Abstract: Despite displaying semantic competence, large language models' internal mechanisms that ground abstract semantic structure remain insufficiently characterised. We propose a method integrating role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study how LLMs implement semantic roles. Our analysis uncovers: (i) highly concentrated circuits (89-94% attribution within 28 nodes); (ii) gradual structural refinement rather than phase transitions, with larger models sometimes bypassing localised circuits; and (iii) moderate cross-scale conservation (24-59% component overlap) alongside high spectral similarity. These findings suggest that LLMs form compact, causally isolated mechanisms for abstract semantic structure, and these mechanisms exhibit partial transfer across scales and architectures.

[19] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

Reham Omar,Abdelghny Orogat,Ibrahim Abdelaziz,Omij Mangukiya,Panos Kalnis,Essam Mansour

Main category: cs.CL

TL;DR: Chatty-KG是一个模块化的多智能体系统,通过结合检索增强生成与结构化执行,实现对知识图谱的高效、准确的多轮对话式问答。

Details Motivation: 现有方法在处理多轮对话式知识图谱问答时存在局限:大语言模型缺乏对私有动态知识图谱的直接访问,RAG系统常丢失图结构且难以维持多轮上下文,传统KGQA系统则多限于单轮问答、延迟高且难以处理指代和上下文追踪。 Method: 提出Chatty-KG,采用任务专用的LLM智能体协同工作,将自然语言问题转化为SPARQL查询。系统结合RAG风格的检索与结构化执行,各智能体负责上下文理解、对话跟踪、实体与关系链接及查询规划。 Result: 在大型多样化知识图谱上的实验表明,Chatty-KG在单轮和多轮设置下均显著优于现有最先进基线,F1和P@1得分更高;支持商业和开源LLM,性能稳定;模块化设计无需微调或预处理即可适应演化中的知识图谱。 Conclusion: Chatty-KG成功融合了对话灵活性与知识图谱的结构化基础,提供了一种可扩展、可扩展的可靠多轮KGQA解决方案。 Abstract: Conversational Question Answering over Knowledge Graphs (KGs) combines the factual grounding of KG-based QA with the interactive nature of dialogue systems. KGs are widely used in enterprise and domain applications to provide structured, evolving, and reliable knowledge. Large language models (LLMs) enable natural and context-aware conversations, but lack direct access to private and dynamic KGs. Retrieval-augmented generation (RAG) systems can retrieve graph content but often serialize structure, struggle with multi-turn context, and require heavy indexing. Traditional KGQA systems preserve structure but typically support only single-turn QA, incur high latency, and struggle with coreference and context tracking. To address these limitations, we propose Chatty-KG, a modular multi-agent system for conversational QA over KGs. Chatty-KG combines RAG-style retrieval with structured execution by generating SPARQL queries through task-specialized LLM agents. These agents collaborate for contextual interpretation, dialogue tracking, entity and relation linking, and efficient query planning, enabling accurate and low-latency translation of natural questions into executable queries. Experiments on large and diverse KGs show that Chatty-KG significantly outperforms state-of-the-art baselines in both single-turn and multi-turn settings, achieving higher F1 and P@1 scores. Its modular design preserves dialogue coherence and supports evolving KGs without fine-tuning or pre-processing. Evaluations with commercial (e.g., GPT-4o, Gemini-2.0) and open-weight (e.g., Phi-4, Gemma 3) LLMs confirm broad compatibility and stable performance. Overall, Chatty-KG unifies conversational flexibility with structured KG grounding, offering a scalable and extensible approach for reliable multi-turn KGQA.

[20] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

Ioana Buhnila,Aman Sinha,Mathieu Constant

Main category: cs.CL

TL;DR: 该研究通过TrackList分析管道和RefoMed-EN数据集,评估了大语言模型在不同类型语言查询上的表现,发现模型在定义类问题上表现最佳,而在举例类问题上表现最差,且更倾向于复述高频、流行的知识而非低频专业内容。

Details Motivation: 探究大语言模型在非定义类语言查询(如举例、释义等)上表现下降的原因,并分析预训练数据对模型输出的影响。 Method: 使用TrackList这一细粒度语言与统计分析流程,结合新构建的英文医学数据集RefoMed-EN(包含6170个人工标注的医学术语及其定义、命名、示例、解释或释义),分析概念频率(头部 vs. 尾部)对模型性能的影响,并采用句法与语义相似性指标、统计相关性和嵌入表示来评估模型输出质量。 Result: 实验结果显示,大语言模型在定义类问题上表现最好,在举例类问题上表现最差;此外,模型更倾向于对高频、常见知识进行复述,而对低频、专业技术知识(尤其是在专家文本中)的复述能力较弱。 Conclusion: 大语言模型在处理多样化的语言任务时存在显著局限性,尤其在生成示例和处理低频专业术语方面表现不足,反映出其依赖训练数据频率分布的问题,需改进以增强对长尾知识和多样化输出的支持。 Abstract: Large Language Models (LLMs) have proven efficient in giving definition-type answers to user input queries. While for humans giving various types of answers, such as examples and paraphrases, is an easy task, LLMs struggle to provide correct answers for other than definition-type queries. In this study, we evaluated this drop in performance using TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also introduce RefoMed-EN, an English dataset consisting of 6170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We studied whether the high frequency of a concept (head) or low frequency (tail) impacts the language model's performance. We evaluated the quality of the LLM's output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM's task performance for definition type questions is the highest, while for the exemplification type it is the lowest. Additionally, we showed that for definition-type questions, large language models are prone to paraphrase more on popular and frequent knowledge and less on tail and technical knowledge, especially in the expert texts.

[21] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

Anantha Padmanaban Krishna Kumar

Main category: cs.CL

TL;DR: 本文研究了上下文学习(ICL)是否能覆盖预训练模型中的标签语义,还是仅仅在其基础上进行微调。通过对比“自然”和“反转”标签的演示,发现ICL主要依赖于预训练中形成的稳定语义方向,难以真正反转语义,支持“语义锚定”观点。

Details Motivation: 探究ICL在改变预训练模型语义方面的灵活性,理解其工作机制和局限性。 Method: 将大语言模型视为提示诱导分类器,引入三种对齐度量(真实、先验、提示对齐)和语义覆盖率,在八项分类任务和八种开源LLM上进行实验。 Result: 在自然演示下,ICL提升准确率且保持强先验对齐;在反转演示下,模型无法建立一致的反语义分类器,语义覆盖率为零。 Conclusion: ICL主要调整输入到预训练语义方向的映射,而非灵活重映射标签含义,表明在当前规模下仅靠ICL无法覆盖语义,需额外干预。 Abstract: Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1--12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1--12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl.

[22] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

Michael Iskandardinata,William Christian,Derwin Suhartono

Main category: cs.CL

TL;DR: 本文提出了一种检索感知的讽刺检测方法,结合外部检索和模型自有的知识来增强上下文理解,在多个数据集上显著提升了基于大语言模型的讽刺检测性能。

Details Motivation: 现有的预训练语言模型和大语言模型在处理具有语言多样性和文化差异的讽刺文本时表现不佳,尤其对需要额外背景知识的词汇缺乏可靠识别能力。 Method: 基于Pragmatic Metacognitive Prompting (PMP) 方法,引入两种上下文增强策略:一是通过网络检索补充非参数化知识,二是激发模型自身内部知识以实现自我认知意识。 Result: 在Twitter Indonesia Sarcastic数据集上,非参数检索使macro-F1提升9.87%;在SemEval-2018和MUStARD上,自知识检索分别提升macro-F1 3.29%和4.08%。 Conclusion: 上下文信息(尤其是文化特定用语和模型未知术语)对提升大语言模型在讽刺检测中的表现至关重要,未来将优化检索质量及其对性能的影响。 Abstract: Detecting sarcasm remains a challenging task in the areas of Natural Language Processing (NLP) despite recent advances in neural network approaches. Currently, Pre-trained Language Models (PLMs) and Large Language Models (LLMs) are the preferred approach for sarcasm detection. However, the complexity of sarcastic text, combined with linguistic diversity and cultural variation across communities, has made the task more difficult even for PLMs and LLMs. Beyond that, those models also exhibit unreliable detection of words or tokens that require extra grounding for analysis. Building on a state-of-the-art prompting method in LLMs for sarcasm detection called Pragmatic Metacognitive Prompting (PMP), we introduce a retrieval-aware approach that incorporates retrieved contextual information for each target text. Our pipeline explores two complementary ways to provide context: adding non-parametric knowledge using web-based retrieval when the model lacks necessary background, and eliciting the model's own internal knowledge for a self-knowledge awareness strategy. We evaluated our approach with three datasets, such as Twitter Indonesia Sarcastic, SemEval-2018 Task 3, and MUStARD. Non-parametric retrieval resulted in a significant 9.87% macro-F1 improvement on Twitter Indonesia Sarcastic compared to the original PMP method. Self-knowledge retrieval improves macro-F1 by 3.29% on Semeval and by 4.08% on MUStARD. These findings highlight the importance of context in enhancing LLMs performance in sarcasm detection task, particularly the involvement of culturally specific slang, references, or unknown terms to the LLMs. Future work will focus on optimizing the retrieval of relevant contextual information and examining how retrieval quality affects performance. The experiment code is available at: https://github.com/wllchrst/sarcasm-detection_pmp_knowledge-base.

[23] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

Thura Aung,Eaint Kay Khaing Kyaw,Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi

Main category: cs.CL

TL;DR: 该研究探索了在低资源语言(如缅甸语)分类任务中,使用Kolmogorov-Arnold网络(KANs)作为分类头替代传统的多层感知机(MLPs),在多种嵌入表示下验证其有效性。

Details Motivation: 传统MLP分类头由于固定的非线性限制了表达能力且计算成本较高,而在低资源语言场景下通常仅微调分类层,因此需要更高效、更具表达力的替代方案。 Method: 采用三种KAN变体(FourierKAN、EfficientKAN和FasterKAN),在TF-IDF、fastText和多语言Transformer(mBERT、Distil-mBERT)等不同嵌入上进行分类性能评估,并与MLP对比。 Result: 实验表明,基于KAN的分类头在多数情况下优于或媲美MLP:EfficientKAN结合fastText取得最高F1值(0.928);FasterKAN在速度与准确率间表现最佳平衡;在mBERT上EfficientKAN达到0.917 F1,与MLP相当或略优。 Conclusion: KANs是低资源语言文本分类中比MLP更具表达性和效率的有前景替代方案。 Abstract: In low-resource languages like Burmese, classification tasks often fine-tune only the final classification layer, keeping pre-trained encoder weights frozen. While Multi-Layer Perceptrons (MLPs) are commonly used, their fixed non-linearity can limit expressiveness and increase computational cost. This work explores Kolmogorov-Arnold Networks (KANs) as alternative classification heads, evaluating Fourier-based FourierKAN, Spline-based EfficientKAN, and Grid-based FasterKAN-across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT). Experimental results show that KAN-based heads are competitive with or superior to MLPs. EfficientKAN with fastText achieved the highest F1-score (0.928), while FasterKAN offered the best trade-off between speed and accuracy. On transformer embeddings, EfficientKAN matched or slightly outperformed MLPs with mBERT (0.917 F1). These findings highlight KANs as expressive, efficient alternatives to MLPs for low-resource language classification.

[24] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

Bryan E. Tuck,Rakesh M. Verma

Main category: cs.CL

TL;DR: 该研究评估了28种大语言模型在58个需要字符级约束满足的字谜任务上的表现,发现架构差异对性能的影响远大于参数规模的影响,且高容量模型在增加推理预算时表现更好,而中等规模模型则趋于饱和或下降。此外,模型在人类容易解决但拼写不常见的常见词上系统性失败,表明其过度依赖统计规律而忽视有效拼写模式,提示需要专门的架构创新来改进约束满足能力。

Details Motivation: 大语言模型在受控文本生成中需满足严格的拼写约束,但目前缺乏跨架构的系统性评估。研究旨在探究不同模型家族在字符级约束任务上的表现差异及其影响因素。 Method: 在58个字谜任务上评估了涵盖Qwen3、Claude Haiku-4.5和GPT-5-mini三个模型家族的28种配置,并结合人类解答者(每题1万人)的难度评分,分析模型在不同参数规模和推理预算下的F1分数与校准情况。 Result: 发现架构差异导致的性能差距(F1=0.761 vs. 0.343,相差2.0-2.2倍)远大于家族内八倍参数扩展带来的提升(83%增益);高容量模型随思考预算增加显著改善(+0.102至+0.136 F1),而中等模型趋于饱和或退化;所有模型均在人类高成功率(86-95%)但拼写异常的常见词(如'data'、'poop'、'loll')上失败率极高(89-96%)。 Conclusion: 字符级约束满足能力不仅依赖参数规模或计算资源扩展,更受模型架构和训练目标的影响;当前模型过度依赖分布先验,难以处理拼写不典型但合法的词汇,提示未来需设计专门的架构机制以增强硬性约束下的生成可靠性。 Abstract: Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography ("data", "poop", "loll": 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.

[25] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Ye Bhone Lin,Thura Aung,Ye Kyaw Thu,Thazin Myint Oo

Main category: cs.CL

TL;DR: 本文研究了在低资源缅甸语中使用序列到序列Transformer模型进行自动语音识别(ASR)错误纠正,重点探讨了包括国际音标(IPA)和对齐信息在内的不同特征融合策略。据我们所知,这是首次针对缅甸语ASR错误纠正的研究。通过评估五种ASR基线模型,结果表明所提出的ASR错误纠正(AEC)方法在词级和字符级准确率上均优于基线输出。结合IPA和对齐特征的AEC模型将ASR模型的平均WER从51.56降低至39.82(未增强情况下),增强后从51.56降至43.59,并将chrF++分数从0.5864提升至0.627,显示出相对于无AEC基线的一致改进。结果强调了AEC的鲁棒性以及在低资源环境下特征设计对提升ASR性能的重要性。

Details Motivation: 由于缅甸语属于低资源语言,现有ASR系统存在较高的错误率,且缺乏专门针对该语言的ASR错误纠正研究,因此有必要探索有效的错误纠正方法以提升识别准确率。 Method: 采用序列到序列的Transformer模型进行ASR错误纠正,尝试多种特征融合策略,包括引入国际音标(IPA)表示和强制对齐生成的边界与错误类型标签,并在五种不同的ASR基线系统上进行评估。 Result: 所提出的AEC模型显著降低了词错误率(WER):未增强时从51.56降至39.82,增强后从51.56降至43.59;同时chrF++得分从0.5864提升至0.627,在词级和字符级指标上均优于所有基线ASR系统。 Conclusion: 在低资源条件下,结合IPA和对齐信息的ASR错误纠正方法能有效提升缅甸语语音识别的准确性,特征设计对性能提升至关重要,验证了AEC框架的鲁棒性和实用性。 Abstract: This paper investigates sequence-to-sequence Transformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IPA and alignment information. To our knowledge, this is the first study addressing ASR error correction specifically for Burmese. We evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word- and character-level accuracy over baseline outputs. The proposed AEC model, combining IPA and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.

[26] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

Manish Jain,Satheesh Kumar Ponnambalam,Salman Faroz,Chandrakanth Lns,Vinay Sharma

Main category: cs.CL

TL;DR: MortgageLLM 是一个针对抵押贷款金融领域的双专家大语言模型,通过指令残差技术在保持指令遵循能力的同时实现领域专业化,显著优于基线模型。

Details Motivation: 在专业领域如抵押贷款金融中,通用大语言模型缺乏足够的领域知识,而单一多任务微调会导致结构化任务与对话能力之间的性能权衡,因此需要一种能兼顾两者的方法。 Method: 采用双轨专业化框架,从同一个基模型(LLaMA-3.1-8B)衍生出两个专家模型:一个用于对话问答,另一个处理分类与摘要等结构化任务;使用指令残差技术恢复领域适配后的指令遵循能力,并设计基于少样本分类的智能任务路由机制。 Result: 在领域基准测试中,MortgageLLM(MLM v2)显著优于基线模型:总结评分为4.58 vs 3.99,问答得分为4.09 vs 4.0,分类得分为2.6 vs 1.2;BERTScore方面也全面领先。 Conclusion: 双专家架构结合指令残差技术和智能路由机制,有效解决了领域适应与指令遵循之间的冲突,为专业领域大模型提供了高效可行的方案。 Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across general domains, yet their application to specialized sectors such as mortgage finance requires domain-specific knowledge augmentation while preserving instruction-following fidelity. We present MortgageLLM, a novel domain-specific large language model that addresses this dual challenge. It is developed using a dual-track specialization framework from a single base model (LLaMA-3.1-8B). We opted for this dual-expert approach as a single multi-task model suffers from performance trade-offs, where optimizing for structured tasks (via SFT) degrades conversational fidelity (via DPO). Our dual-track method solves this by creating two specialists, allowing each to be optimally trained for its distinct capability. Our approach applies the instruction residual technique to restore instruction-following capabilities post-domain adaptation without supervised fine-tuning. We contribute: (1) application of this residual technique to the highly specialized mortgage finance domain; (2) a dual-expert architecture combining a conversational Q&A model and a structured task model for classification and summarization; and (3) an intelligent task routing mechanism using few-shot classification performed by one of the expert models itself. We validate our approach on domain-specific benchmarks, where our final model (MLM v2) significantly outperforms the base LLaMA-3.1-8B-Instruct, achieving an LLM-as-a-Judge summarization score of 4.58 (vs. 3.99), a Q&A score of 4.09 (vs. 4.0), and a classification score of 2.6 (vs. 1.2). On semantic similarity, our model achieved a BERTScore of 0.77 for summarization (vs. 0.74), 0.68 for Q&A (vs. 0.58), and 0.75 for classification (vs. 0.73), substantially outperforming baseline approaches.

[27] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang,Yanxu Zhu,Dongyuan Lu,Jitao Sang

Main category: cs.CL

TL;DR: 本文提出了一种基于综合准则的自适应安全对齐框架(SGASA),通过模型自生成的安全准则增强对对抗性越狱提示的防御能力,同时减少对良性请求的误拒。

Details Motivation: 由于对抗性越狱提示具有隐蔽和欺骗性,常能绕过内置安全机制,导致生成有害内容,因此需要一种能够自适应强化防御的安全对齐方法。 Method: SGASA框架包含两个阶段:数据预合成阶段生成安全准则和增强提示;对齐微调阶段利用监督微调(SFT)和直接偏好优化(DPO)将这些准则嵌入模型。 Result: 在多个数据集上的实验表明,SGASA显著提升了模型的安全性,有效增强了对有害对抗提示的鲁棒性,并减少了对良性请求的不必要拒绝。 Conclusion: SGASA是一种可扩展且自适应的安全对齐方法,能够使模型自主强化其安全防御机制,提升推理模型在复杂环境下的安全性。 Abstract: Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.

[28] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

Steven Wang,Kyle Hunt,Shaojie Tang,Kenneth Joseph

Main category: cs.CL

TL;DR: 本研究探讨了在小型人类调查数据上微调大语言模型(LLM)是否能使其更真实地模拟人类行为,发现微调可提升多样性、子群体对齐和信念-行为一致性,但仍无法复现原始研究的回归系数,表明LLM生成的数据仍不适合作为人类参与者的替代用于正式推断分析。

Details Motivation: 近年来有研究尝试用大语言模型(LLM)模拟人类参与者进行调查或实验研究,但存在与真实人类行为不一致的问题,如缺乏多样性、对少数群体偏差明显、组内变异不足以及信念与行为脱节。本文旨在探究在少量人类样本上微调是否能缓解这些问题。 Method: 通过一项关于信息披露的行为实验,比较人类与LLM生成的回答在分布差异、子群体对齐、信念-行为一致性以及回归系数还原等多个维度的表现,并评估基于小规模人类数据微调前后模型性能的变化。 Result: 微调显著提升了LLM在响应异质性、子群体对齐和信念-行动一致性方面的表现,接近甚至部分逼近人类数据;然而,所有微调后的模型均未能准确复现原始研究中的回归系数。 Conclusion: 尽管在小样本数据上微调能显著改善LLM模拟人类行为的能力,但由于无法稳定还原统计推断结果,LLM生成的数据目前仍不适合作为人类参与者的完全替代品,特别是在需要严谨因果推断的研究中。 Abstract: There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.

[29] Developing an Open Conversational Speech Corpus for the Isan Language

Adisai Na-Thalang,Chanakan Wittayasakpan,Kritsadha Phatcharoen,Supakit Buakaw

Main category: cs.CL

TL;DR: 本文介绍了首个开放的伊善语会话语音数据集,旨在捕捉自然语言现象并推动少数语言的技术研究。

Details Motivation: 由于现有语音语料库多基于朗读或脚本化语音,缺乏对真实会话语音的覆盖,且伊善语无标准化正字法,给语音转写带来挑战,因此需要构建一个反映真实语言使用、同时兼顾计算处理需求的开放语音数据集。 Method: 开发了一个以自然会话为基础的伊善语语音数据集,制定兼顾语言真实性与可计算性的转录规范,解决因泰语与伊善语声调差异导致的书写不一致问题。 Result: 成功构建并公开发布了首个开放获取的伊善语会话语音数据集,包含口语特征如方言表达、即兴语调、不流畅现象及与泰语的语码转换。 Conclusion: 该数据集有助于促进包容性人工智能发展,支持濒危或被忽视语言的研究,并为建模自然会话语音中的语言和技术挑战提供基础。 Abstract: This paper introduces the development of the first open conversational speech dataset for the Isan language, the most widely spoken regional dialect in Thailand. Unlike existing speech corpora that are primarily based on read or scripted speech, this dataset consists of natural speech, thereby capturing authentic linguistic phenomena such as colloquials, spontaneous prosody, disfluencies, and frequent code-switching with central Thai. A key challenge in building this resource lies in the lack of a standardized orthography for Isan. Current writing practices vary considerably, due to the different lexical tones between Thai and Isan. This variability complicates the design of transcription guidelines and poses questions regarding consistency, usability, and linguistic authenticity. To address these issues, we establish practical transcription protocols that balance the need for representational accuracy with the requirements of computational processing. By releasing this dataset as an open resource, we aim to contribute to inclusive AI development, support research on underrepresented languages, and provide a basis for addressing the linguistic and technical challenges inherent in modeling conversational speech.

[30] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

Robert Belanec,Branislav Pecher,Ivan Srba,Maria Bielikova

Main category: cs.CL

TL;DR: 本文提出了PEFT-Bench,一个用于评估自回归大语言模型上多种参数高效微调(PEFT)方法的统一端到端基准,并引入了综合考虑可训练参数量、推理速度和训练内存消耗的PSCP评分指标。

Details Motivation: 尽管大型语言模型在许多任务上表现出色,但其庞大的规模导致计算和环境成本高昂,限制了可访问性;现有PEFT方法评估范围有限且难以复现,因此需要一个统一、可复现的评估基准。 Method: 构建了一个涵盖27个NLP数据集和6种PEFT方法的综合性基准平台PEFT-Bench,并提出新的评估指标PSCP,以综合衡量不同PEFT方法在参数效率、推理速度和训练内存方面的表现。 Result: 展示了PEFT-Bench在多数据集和多方法上的应用效果,并通过PSCP指标揭示了不同PEFT方法在实际部署中的权衡差异。 Conclusion: PEFT-Bench为系统评估和比较PEFT方法提供了标准化工具,有助于推动高效微调技术的发展与实际应用。 Abstract: Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-efficient fine-tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the increased development in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 6 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Score Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.

[31] Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

Kai Kugler

Main category: cs.CL

TL;DR: 首次系统研究神经语言模型训练过程中生成文本的Martin定律(词频与多义性之间的关系),发现其呈现非单调发展轨迹,存在一个最佳语义窗口。

Details Motivation: 探究神经语言模型在训练过程中是否以及如何遵循人类语言中的统计规律——Martin定律,从而理解语言模型中词汇多义性的动态演化。 Method: 使用DBSCAN聚类上下文化词向量来识别词义,并分析四个不同规模Pythia模型(70M-1B)在30个训练检查点上的词频与多义性关系。 Result: Martin定律在约第100个检查点出现,第104个达到峰值(r > 0.6),之后迅速下降;小模型后期出现灾难性语义崩溃,大模型则表现平缓退化;词频特异性权衡在整个训练过程中保持稳定(r ≈ -0.3)。 Conclusion: 语言模型对语言统计规律的符合程度并非随训练单调提升,而是存在一个最优的语义发展阶段,揭示了训练过程中的平衡演化路径。 Abstract: We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.

[32] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

Joshua Fonseca Rivera

Main category: cs.CL

TL;DR: 通过微调,7B参数语言模型被训练为可靠检测和报告注入的单token‘思想’,在准确性、接地性和内部性三个标准上表现良好,展示了可训练的内省行为,提升了AI透明度。

Details Motivation: Lindsey (2025) 发现语言模型对注入激活模式的内省识别能力不可靠(约20%成功率),本文旨在探索是否可通过直接训练提升该能力,而非依赖其自然涌现。 Method: 在瞬时单token注入数据上对7B语言模型进行微调,训练其检测并报告注入的语义内容,并评估其在准确率、假阳性率及泛化能力上的表现。 Result: 模型从近乎完全失败(0.4%准确率)提升至85%准确率(α=40),假阳性率为0%;满足准确性、接地性和内部性三项标准;在未见概念上显示一定泛化能力(性能差距7.5个百分点)。 Conclusion: 至少一种内省行为可通过训练直接诱导,表明AI的内省能力可被构建,为实现内置的AI透明性提供了可行路径,回应了Lindsey关于训练是否能消除模型差异的开放问题。 Abstract: Lindsey (2025) investigates introspective awareness in language models through four experiments, finding that models can sometimes detect and identify injected activation patterns -- but unreliably (~20% success in the best model). We focus on the first of these experiments -- self-report of injected "thoughts" -- and ask whether this capability can be directly trained rather than waiting for emergence. Through fine-tuning on transient single-token injections, we transform a 7B parameter model from near-complete failure (0.4% accuracy, 6.7% false positive rate) to reliable detection (85% accuracy on held-out concepts at α=40, 0% false positives). Our model detects fleeting "thoughts" injected at a single token position, retains that information, and reports the semantic content across subsequent generation steps. On this task, our trained model satisfies three of Lindsey's criteria: accuracy (correct identification), grounding (0/60 false positives), and internality (detection precedes verbalization). Generalization to unseen concept vectors (7.5pp gap) demonstrates the model learns a transferable skill rather than memorizing specific vectors, though this does not establish metacognitive representation in Lindsey's sense. These results address an open question raised by Lindsey: whether "training for introspection would help eliminate cross-model differences." We show that at least one component of introspective behavior can be directly induced, offering a pathway to built-in AI transparency.

[33] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Antonín Jarolím,Martin Fajčík,Lucia Makaiová

Main category: cs.CL

TL;DR: 本文研究了在捷克语和斯洛伐克语评论中细粒度证据提取的问题,构建了新的标注数据集,并评估了多种大语言模型在该任务上的表现,发现某些较小模型在对齐人类标注方面表现更优。

Details Motivation: 在线新闻评论中虚假信息传播频繁,需要有效方法检测错误事实,特别是找出支持或反驳主张的精确文本片段。 Method: 创建了一个由付费标注者完成的双向标注的细粒度证据数据集,并在该数据集上评估多个大语言模型的表现,分析其与人类标注的一致性。 Result: 实验结果显示,大语言模型常未能逐字复制原文证据,导致输出无效;其中llama3.1:8b模型尽管参数较少但正确率较高,而gpt-oss-120b模型表现不佳;qwen3:14b、deepseek-r1:32b和gpt-oss:20b在模型大小与人类标注对齐之间表现出良好平衡。 Conclusion: 在细粒度证据提取任务中,模型参数数量并非决定性能的关键因素,部分中小规模模型在对齐人类标注方面表现更佳,具备实际应用潜力。 Abstract: Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task -- fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.

[34] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

Zhifeng Hao,Qibin Song,Ruichu Cai,Boyan Xu

Main category: cs.CL

TL;DR: DSR-SQL是一种双状态推理框架,通过建模上下文和生成状态的交互来提升大模型在复杂数据库上的Text-to-SQL性能,无需后训练或示例即可取得良好效果。

Details Motivation: 现有基于链式思维的方法在处理复杂企业数据库时因上下文容量有限、模式链接不可靠及语义基础薄弱而难以保持连贯推理。 Method: 提出DSR-SQL框架,包含自适应上下文状态(用于压缩并选择相关模式结构)和渐进生成状态(通过反馈引导的状态转移实现SQL生成与自我修正)。 Result: 在Spider 2.0-Snow上达到35.28%执行准确率,在BIRD开发集上达到68.32%,表现具有竞争力。 Conclusion: DSR-SQL能有效提升大模型在复杂数据库上的Text-to-SQL能力,且无需额外训练或上下文示例。 Abstract: Recent divide-and-conquer reasoning approaches, particularly those based on Chain-of-Thought (CoT), have substantially improved the Text-to-SQL capabilities of Large Language Models (LLMs). However, when applied to complex enterprise databases, such methods struggle to maintain coherent reasoning due to limited context capacity, unreliable schema linking, and weak grounding in database semantics. To overcome these issues, we introduce DSR-SQL, a \textbf{D}ual-\textbf{S}tate \textbf{R}easoning framework that models Text-to-SQL as an interaction between an adaptive context state and a progressive generation state. The first constructs a compact, semantically faithful environment by refining large schemas and selecting relevant structures, while the second formalizes SQL synthesis as feedback-guided state transitions, enabling the model to self-correct and align with user intent. Without any post-training or in-context examples, DSR-SQL achieves competitive performance, reaching 35.28\% execution accuracy on Spider 2.0-Snow and 68.32\% on BIRD development set. Our implementation will be open-sourced at: https://github.com/DMIRLAB-Group/DSR-SQL.

[35] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

Kaifeng Hong,Yinglong Zhang,Xiaoying Hong,Xuewen Xia,Xing Xu

Main category: cs.CL

TL;DR: 本文提出了Odin,一种通过定向双模块机制在特定Transformer层注入图结构的新架构,实现了文本与图结构的有效融合。与依赖多跳扩散的GNN不同,Odin在全局[CLS]表示上进行聚合,避免了过平滑问题,并且其表达能力严格包含纯Transformer和GNN。为提升效率,还提出了轻量版Light Odin,在多个文本图基准上达到SOTA性能,同时显著降低计算成本。

Details Motivation: 现有方法在处理文本图时存在局限:GNN受限于过平滑和_hop-dependent_扩散,而Transformer忽略图拓扑结构。因此需要一种能有效结合强文本理解与结构推理的新模型。 Method: 提出Odin架构,通过在Transformer的特定深度引入定向双模块机制来注入图结构,多跳结构信息在不同层集成,形成与语义层次对齐的低、中、高层结构抽象;聚合基于全局[CLS]表示,避免过平滑;进一步设计了轻量版Light Odin以提高效率。 Result: 在多个文本丰富的图基准测试上,Odin实现了最先进的准确率,Light Odin则在显著降低计算开销的同时保持有竞争力的性能。 Conclusion: Odin及其轻量版本Light Odin构成了一种统一、无需多跳的结构-文本融合框架,有效解决了GNN和Transformer在文本图上的固有问题,兼具高性能与高效率。 Abstract: Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs--limited by over-smoothing and hop-dependent diffusion--or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism.Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model's semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin's expressive power strictly contains that of both pure Transformers and GNNs.To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.

[36] A Systematic Study of Model Merging Techniques in Large Language Models

Oğuz Kağan Hitit,Leander Girrbach,Zeynep Akata

Main category: cs.CL

TL;DR: 对六种主流模型融合方法在大语言模型上的大规模系统评估表明,最简单的方法Task Arithmetic表现最优,而其他先进方法常导致性能下降,说明现有融合技术难以直接迁移到现代大语言模型,需设计专门的LLM融合算法。

Details Motivation: 探究在小模型和分类器上有效的模型融合方法是否能推广到大语言模型(LLMs),并系统评估现有方法在LLMs上的有效性。 Method: 在四个开源LLM、每个基础模型十二个微调检查点和十六个标准基准上,对六种最先进的融合方法(包括子空间方法)进行了大规模系统评估,使用标准化基准衡量融合模型相对于基础模型和最佳单个检查点的性能增益。 Result: Task Arithmetic是唯一能稳定提升LLM性能的融合方法;其他干扰感知和子空间融合方法通常导致显著性能下降。 Conclusion: 当前的模型融合技术不能直接适用于现代大语言模型,需要设计针对LLM的融合算法和融合感知的微调方法。 Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.

[37] Hierarchical Ranking Neural Network for Long Document Readability Assessment

Yurui Zheng,Yijun Chen,Shaohong Zhang

Main category: cs.CL

TL;DR: 本文提出了一种双向可读性评估机制和成对排序算法,以解决现有深度学习方法在文本长度和可读性标签序数关系上的不足,通过捕捉上下文信息和建模标签间的顺序关系,在中英文数据集上取得了优于基线模型的性能。

Details Motivation: 现有的可读性评估深度学习方法大多未充分考虑文本长度或可读性标签之间的序数关系,导致预测精度受限,因此需要一种能同时捕捉句子级语义信息和标签顺序关系的新模型。 Method: 提出双向可读性评估机制,利用上下文信息识别文本中富含语义的区域以进行句子级可读性预测,并将这些预测结果用于辅助文档级可读性判断;同时引入基于标签相减的成对排序算法来建模可读性等级间的序数关系。 Result: 在中文和英文数据集上的实验表明,所提模型具有竞争力的表现,并优于其他基线模型。 Conclusion: 该方法有效提升了文本可读性评估的准确性,尤其在融合句子级预测与序数关系建模方面表现出优势,适用于多语言环境下的可读性分析。 Abstract: Readability assessment aims to evaluate the reading difficulty of a text. In recent years, while deep learning technology has been gradually applied to readability assessment, most approaches fail to consider either the length of the text or the ordinal relationship of readability labels. This paper proposes a bidirectional readability assessment mechanism that captures contextual information to identify regions with rich semantic information in the text, thereby predicting the readability level of individual sentences. These sentence-level labels are then used to assist in predicting the overall readability level of the document. Additionally, a pairwise sorting algorithm is introduced to model the ordinal relationship between readability levels through label subtraction. Experimental results on Chinese and English datasets demonstrate that the proposed model achieves competitive performance and outperforms other baseline models.

[38] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

Lina Conti,Dennis Fucci,Marco Gaido,Matteo Negri,Guillaume Wisniewski,Luisa Bentivogli

Main category: cs.CL

TL;DR: 该研究探讨了语音翻译(ST)模型在处理语言中的性别指代时,如何利用声学线索和训练数据模式进行性别分配,并揭示了一种通过第一人称代词关联说话者性别的新机制。

Details Motivation: 由于语音包含说话者性别等信息,在将无性别标记的语言翻译为有语法性别的语言时,可能存在基于声音特征的性别误判风险,当前对此类偏见机制理解不足。 Method: 研究针对三种语言对(en-es/fr/it),分析训练数据模式、内部语言模型(ILM)偏差与声学信息之间的交互作用,使用对比特征归因方法分析频谱图中的关键特征。 Result: 发现模型并未简单复制训练数据中的性别关联,而是学习到更广泛的男性主导模式;尽管ILM存在强烈男性偏向,模型能根据声学输入调整;高准确率模型利用第一人称代词将性别信息与说话者关联,并从全频谱而非仅音高获取性别信息。 Conclusion: 语音翻译模型采用复杂机制进行性别分配,依赖于上下文代词和分布式声学特征,这为缓解模态特异性偏见提供了新视角。 Abstract: Unlike text, speech conveys information about the speaker, such as gender, through acoustic cues like pitch. This gives rise to modality-specific bias concerns. For example, in speech translation (ST), when translating from languages with notional gender, such as English, into languages where gender-ambiguous terms referring to the speaker are assigned grammatical gender, the speaker's vocal characteristics may play a role in gender assignment. This risks misgendering speakers, whether through masculine defaults or vocal-based assumptions. Yet, how ST models make these decisions remains poorly understood. We investigate the mechanisms ST models use to assign gender to speaker-referring terms across three language pairs (en-es/fr/it), examining how training data patterns, internal language model (ILM) biases, and acoustic information interact. We find that models do not simply replicate term-specific gender associations from training data, but learn broader patterns of masculine prevalence. While the ILM exhibits strong masculine bias, models can override these preferences based on acoustic input. Using contrastive feature attribution on spectrograms, we reveal that the model with higher gender accuracy relies on a previously unknown mechanism: using first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across the frequency spectrum rather than concentrated in pitch.

[39] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

Husne Ara Rubaiyeat,Hasan Mahmud,Md Kamrul Hasan

Main category: cs.CL

TL;DR: 本文介绍了一个名为IsharaKhobor的孟加拉手语翻译(BdSLT)数据集及其子集,旨在推动低资源语言下的AI辅助工具研究。作者讨论了数据集构建的挑战,并基于地标和嵌入方法进行了基准测试,同时通过词汇限制和规范化进行了消融实验,生成了两个更小版本的数据集。

Details Motivation: 由于孟加拉手语资源极度匮乏,缺乏标准句子级别的数据集,严重限制了面向听障人群的AI辅助技术发展,因此亟需构建高质量、公开可用的BdSLT数据集。 Method: 构建了IsharaKhobor数据集及其子集,采用地标(landmark)和RQE嵌入进行基准测试,并对词汇限制与句法规范化进行消融研究,生成IsharaKhobor_small和IsharaKhobor_canonical_small两个简化版本。 Result: 成功发布了IsharaKhobor数据集及其两个子集,提供了基准性能评估结果,验证了词汇规范化和规模缩减对模型训练的影响,为后续研究提供了基础资源与参考。 Conclusion: IsharaKhobor数据集的发布填补了孟加拉手语翻译领域的空白,促进了低资源手语AI研究的发展,未来可通过持续扩展和优化数据提升翻译性能。 Abstract: Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creation for BdSLT is of immense importance for developing AI based assistive tools for deaf and hard of hearing people of Bangla speaking community. In this paper, we present a dataset, IsharaKhobor , and two subset of it for enabling research. We also present the challenges towards developing the dataset and present some way forward by benchmarking with landmark based raw and RQE embedding. We do some ablation on vocabulary restriction and canonicalization of the same within the dataset, which resulted in two more datasets, IsharaKhobor_small and IsharaKhobor_canonical_small. The dataset is publicly available at: www.kaggle.com/datasets/hasanssl/isharakhobor [1].

[40] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

Minjoon Choi

Main category: cs.CL

TL;DR: 本文提出了RoParQ基准和XParaCon评估指标,用于衡量大语言模型在回答改写问题时的一致性,并通过一种基于推理的微调策略提升模型对语义不变性的理解,实验表明该方法能显著增强模型鲁棒性。

Details Motivation: 大语言模型在面对同义改写的问题时常常表现不一致,说明其依赖表面模式而非真正语义理解,因此需要更有效的评估手段和训练方法来提升一致性。 Method: 构建了RoParQ基准,利用专有模型生成标准数据集的改写样本,并筛选出导致判别模型置信度不一致的样例;提出XParaCon指标,通过计算不同问题变体准确率的标准差来量化模型鲁棒性;采用基于推理的、关注改写的监督微调策略进行模型对齐。 Result: 实验显示,经过所提SFT策略微调的轻量级模型在跨改写一致性上表现显著提升,达到甚至媲美更大规模预训练模型的水平。 Conclusion: 所提出的RoParQ基准、XParaCon指标和微调策略有效提升了模型在问答中对语义不变性的把握,减少了表层记忆,增强了模型的可靠性和鲁棒性。 Abstract: Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions, suggesting a reliance on surface-level patterns rather than true semantic understanding. To address this limitation, we introduce RoParQ, a benchmark specifically constructed to evaluate cross-paraphrase consistency in closed-book multiple-choice QA. This benchmark is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. We further propose XParaCon, a novel evaluation metric that quantifies a model's robustness by measuring the standard deviation of accuracies across question variants. Additionally, we implement a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy designed to align models toward semantic invariance. Our experiments demonstrate that this targeted alignment significantly enhances robustness. Notably, fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models. These results highlight the efficacy of our approach in mitigating superficial memorization and fostering more robust, reliable LLMs.

[41] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

Yixiu Zhao,Xiaozhi Wang,Zijun Yao,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 本文提出一种轻量、通用的方法,通过关联神经元激活与外部标签或模型自信度等辅助指标,来识别大语言模型中编码特定技能的神经元,无需手动标记即可发现复杂任务中的可解释行为和推理捷径。

Details Motivation: 大语言模型能力强大但内部机制不透明,现有方法多局限于分类任务,难以扩展到多技能复杂场景,因此需要一种更通用、自动化的技能神经元探测方法。 Method: 基于软提示训练思想,引入辅助指标(如外部标签、模型自信度)与神经元激活进行相关性分析,从而定位与特定技能相关的神经元,避免了人工聚合token的需要。 Result: 在开放生成和自然语言推理等任务上验证了方法有效性,成功检测到驱动已知技能的神经元,并在BigBench的算术推理任务中发现了此前未知的捷径行为。 Conclusion: 所提方法是一种简单且广泛适用的技术,能够有效揭示大语言模型中与技能相关的神经元,增强了模型的可解释性,尤其适用于涉及多种技能的复杂任务。 Abstract: Large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, yet their internal mechanisms remain largely opaque. In this paper, we introduce a simple, lightweight, and broadly applicable method with a focus on isolating neurons that encode specific skills. Building upon prior work that identified "skill neurons" via soft prompt training on classification tasks, our approach extends the analysis to complex scenarios involving multiple skills. We correlate neuron activations with auxiliary metrics -- such as external labels and the model's own confidence score -- thereby uncovering interpretable and task-specific behaviors without the need for manual token aggregation. We empirically validate our method on tasks spanning open-ended text generation and natural language inference, demonstrating its ability to detect neurons that not only drive known skills but also reveal previously unidentified shortcuts in arithmetic reasoning on BigBench.

[42] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Dongyang Fan,Diba Hashemi,Sai Praneeth Karimireddy,Martin Jaggi

Main category: cs.CL

TL;DR: 本研究探讨了在大语言模型预训练中引入多种元数据(如文档质量细粒度指标)以加速训练的方法,发现细粒度编码的元数据更有效,并提出通过附加元数据作为辅助任务和可学习元标记来提升训练效率。

Details Motivation: 先前工作仅利用URL这一种元数据信号来加速LLM预训练,本文旨在探索更多类型的元数据是否能带来更大收益。 Method: 研究考察了多种元数据类型,分析其在预训练前缀和后缀中的作用,引入可学习元标记并结合掩码损失进行训练,同时使用探针方法分析元数据对表示学习的影响。 Result: 发现细粒度的文档质量指标等元数据能有效加速预训练;元数据附加作为辅助任务以及可学习元标记均可提升训练效率;探针结果显示元数据能诱导出质量感知的潜在结构。 Conclusion: 不同形式的元数据,尤其是具有细粒度编码特征的,能够显著提高LLM预训练的效率和效果,为实际应用提供了整合元数据的实用指南。 Abstract: Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.

[43] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

Anna Marklová,Ondřej Vinš,Martina Vokáčová,Jiří Milička

Main category: cs.CL

TL;DR: 本研究探讨了捷克语AI生成诗歌与人类创作诗歌在母语者中的辨识度和审美评价,发现两者在文本上难以区分,但读者的作者身份偏见显著影响其审美判断。

Details Motivation: 由于大多数关于AI生成诗歌的研究集中于英语,而捷克语等低资源语言被忽视,本文旨在探究在形态复杂、训练数据较少的斯拉夫语言中,AI生成诗歌的质量及其接受度。 Method: 通过让捷克语母语者判断诗歌的作者(AI或人类)并进行审美评分,分析其识别准确率和评价倾向,并使用逻辑回归模型探索影响判断的因素。 Result: 参与者平均仅45.8%正确识别作者,表明AI生成诗歌与人类创作难以区分;当人们认为诗歌由AI创作时,评分更低,存在明显作者身份偏见;但实际上AI诗歌获得的评分等于或高于人类作品;喜爱程度越高,越难正确判断作者身份;诗歌熟悉度或文学背景不影响识别能力。 Conclusion: AI能够在低资源、形态复杂的捷克语中生成具有人类水平的诗歌,且读者对作者身份的信念与其审美评价密切相关,揭示了认知偏见在AI艺术接受中的重要作用。 Abstract: Large language models are increasingly capable of producing creative texts, yet most studies on AI-generated poetry focus on English -- a language that dominates training data. In this paper, we examine the perception of AI- and human-written Czech poetry. We ask if Czech native speakers are able to identify it and how they aesthetically judge it. Participants performed at chance level when guessing authorship (45.8\% correct on average), indicating that Czech AI-generated poems were largely indistinguishable from human-written ones. Aesthetic evaluations revealed a strong authorship bias: when participants believed a poem was AI-generated, they rated it as less favorably, even though AI poems were in fact rated equally or more favorably than human ones on average. The logistic regression model uncovered that the more the people liked a poem, the less probable was that they accurately assign the authorship. Familiarity with poetry or literary background had no effect on recognition accuracy. Our findings show that AI can convincingly produce poetry even in a morphologically complex, low-resource (with respect of the training data of AI models) Slavic language such as Czech. The results suggest that readers' beliefs about authorship and the aesthetic evaluation of the poem are interconnected.

[44] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Dong Wang,Yang Li,Ansong Ni,Ching-Feng Yeh,Youssef Emad,Xinjie Lei,Liam Robbins,Karthik Padthe,Hu Xu,Xian Li,Asli Celikyilmaz,Ramya Raghavendra,Lifei Huang,Carole-Jean Wu,Shang-Wen Li

Main category: cs.CL

TL;DR: 本文提出了一种名为Matrix的去中心化框架,用于支持大规模、灵活且高效的多智能体合成数据生成,通过分布式消息传递消除中心化调度器,显著提升了数据生成吞吐量。

Details Motivation: 现有合成数据生成框架依赖中心化调度器导致可扩展性差,或因领域固化而缺乏灵活性,难以满足多样化和大规模的数据生成需求。 Method: 设计了一个去中心化的框架Matrix,将控制流和数据流表示为通过分布式队列传递的序列化消息,采用点对点架构,任务由轻量级代理独立推进,计算密集型操作由分布式服务处理,并基于Ray实现高并发和模块化配置。 Result: 在多种合成场景(如多智能体对话、网页推理数据提取、客服工具使用轨迹生成)中,Matrix在相同硬件下实现了2到15倍的数据生成吞吐量提升,且不牺牲输出质量。 Conclusion: Matrix是一种高效、可扩展且灵活的多智能体合成数据生成框架,其去中心化设计克服了传统方法的瓶颈,适用于广泛的数据生成任务。 Abstract: Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.

[45] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Hongjin Su,Shizhe Diao,Ximing Lu,Mingjie Liu,Jiacheng Xu,Xin Dong,Yonggan Fu,Peter Belcak,Hanrong Ye,Hongxu Yin,Yi Dong,Evelina Bakhturina,Tao Yu,Yejin Choi,Jan Kautz,Pavlo Molchanov

Main category: cs.CL

TL;DR: 本文提出了ToolOrchestra方法,用于训练小型协调模型(Orchestrator),通过强化学习协调多种工具,在复杂任务上实现了比GPT-5更高的准确性和效率,尤其在 Humanity's Last Exam 等基准上表现优异。

Details Motivation: 大型语言模型虽强大但处理复杂任务时成本高且效率低,需要更高效、智能的工具协同机制来提升性能与实用性。 Method: 提出ToolOrchestra,采用面向结果、效率和用户偏好的强化学习方法训练8B规模的轻量级协调模型,动态调度不同工具完成复杂任务。 Result: Orchestrator在HLE上得分为37.1%,超过GPT-5的35.1%,效率高2.5倍;在tau2-Bench和FRAMES上以约30%的成本超越GPT-5,且能泛化到未见工具。 Conclusion: 轻量级协调模型结合多样化工具的组合方式在性能和成本之间实现了最优权衡,为可扩展的工具增强型推理系统提供了可行路径。 Abstract: Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.

[46] Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Yeganeh Kordi,Nihal V. Nayak,Max Zuo,Ilana Nguyen,Stephen H. Bach

Main category: cs.CL

TL;DR: 该研究通过大规模、细粒度的分析,利用数千个大语言模型和项目反应理论(IRT)对六大数据集中的样本难度进行排序,系统评估了大语言模型在不同任务难度间的泛化能力。结果表明,跨难度泛化能力有限,仅训练简单或困难数据无法在所有难度上取得一致提升,强调了训练和评估中涵盖多种难度的重要性。

Details Motivation: 现有研究对于训练数据难易程度如何影响大语言模型在不同难度任务上的泛化能力存在分歧,且缺乏客观、细粒度的难度评估方法。本文旨在通过更客观、大规模的方法系统性地探究LLMs在不同任务难度间的泛化表现。 Method: 采用项目反应理论(IRT),基于数千个不同大语言模型在六个数据集上的输出结果自动计算样本难度评分,排除人类主观判断;随后系统评估模型在不同难度训练与测试数据组合下的跨难度泛化性能。 Result: 发现大语言模型的跨难度泛化能力通常有限,仅在简单或困难数据上训练无法在全部难度范围内实现一致的性能提升;训练数据的难度偏向往往导致在对应难度测试数据上的表现更好,但牺牲了对其他难度的泛化性。 Conclusion: 为确保大语言模型具备良好的泛化能力,训练和评估数据中都应包含广泛的任务难度;依赖单一难度的数据会带来风险,不能有效代表模型的真实性能。 Abstract: We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.

cs.CV [Back]

[47] Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

David Amebley,Sayanton Dibbo

Main category: cs.CV

TL;DR: 本文提出了一种神经科学启发的拓扑正则化(tau)框架,用于增强多模态视觉-语言模型(VLMs)对黑盒成员推断攻击(MIA)的隐私抗性,并在多个模型和数据集上验证了其有效性。

Details Motivation: 随着多模态模型(MMs)广泛应用,其面临新的隐私攻击向量,尤其是成员推断攻击(MIA)。现有研究主要集中于单模态系统,而对多模态尤其是神经科学启发的模型在隐私攻击下的抗性尚缺乏探索。 Method: 提出一种基于神经科学启发的拓扑正则化(tau)框架,构建具备更强隐私抗性的NEURO-VLM变体(tau > 0),并在BLIP、PaliGemma 2和ViT-GPT2三种VLM上,使用COCO、CC3M和NoCaps三个数据集进行成员推断攻击实验,评估其隐私抗性与模型效用(MPNet、ROUGE-2)之间的权衡。 Result: 在BLIP+COCO上的实验显示,NEURO VLM的MIA攻击成功率平均ROC-AUC下降24%,同时保持与基线相当的生成文本质量;在其他模型和数据集上的结果也具有一致性,表明神经启发结构能有效提升隐私抗性而不显著牺牲性能。 Conclusion: 神经科学启发的拓扑结构能够增强多模态模型对成员推断攻击的隐私抗性,为构建更安全的多模态AI系统提供了可行路径。 Abstract: In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.

[48] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Inferix Team,Tianyu Feng,Yizeng Han,Jiahao He,Yuanyu He,Xi Lin,Teng Liu,Hanfeng Lu,Jiasheng Tang,Wei Wang,Zhiyuan Wang,Jichao Wu,Mingyang Yang,Yinghao Yu,Zeyu Zhang,Bohan Zhuang

Main category: cs.CV

TL;DR: Inferix是一种专为沉浸式世界合成设计的下一代推理引擎,通过优化半自回归解码过程,支持高效、高质量的长视频生成,并引入交互式视频流和精细评估基准LV-Bench,推动世界模型的发展。

Details Motivation: 现有的视频生成模型在生成长、连贯且可交互的高质量视频方面存在局限,尤其是在世界模拟和具身智能等应用中。标准视频扩散模型缺乏有效的上下文管理机制,难以实现稳定和可扩展的生成。因此,需要一种新型推理引擎来支持具备KV Cache管理能力的半自回归解码,以实现更优的世界建模。 Method: Inferix采用半自回归(块扩散)解码范式,在块内使用扩散模型生成视频token,同时基于前序块进行条件生成,并重新引入类似大语言模型的KV Cache机制,实现高效、变长度、高质量的视频生成。系统针对世界模拟专门优化,支持交互式视频流、实时动态模拟和LV-Bench集成,用于细粒度性能评估。 Result: Inferix实现了更连贯、稳定的长视频生成,支持实时交互和高效率推理,显著优于传统视频扩散模型和高并发LLM服务系统。其与LV-Bench的无缝集成也提供了针对分钟级视频生成场景的精确评估能力。 Conclusion: Inferix作为专为世界模型设计的推理引擎,通过半自回归解码和KV Cache管理,推动了视觉感知、理解和推理的潜力,标志着从当前LLM中心化视觉基础模型向新型世界模拟范式的转变。 Abstract: World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.

[49] Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?

Kun Guo,Yun Shen,Xijun Wang,Chaoqun You,Yun Rui,Tony Q. S. Quek

Main category: cs.CV

TL;DR: 提出了一种基于深度强化学习的自适应算法LTED-Ada,用于在边缘计算环境中优化视频对象识别中的检测与跟踪决策,支持单设备和多设备场景,并通过联邦学习提升泛化能力。

Details Motivation: 在资源受限设备上实现快速准确的视频对象识别面临挑战,现有混合方法缺乏有效的检测与跟踪切换策略。 Method: 构建了单设备和多设备下的长期优化问题模型,提出LTED-Ada算法,结合深度强化学习自适应选择本地跟踪或边缘检测;在多设备场景中引入联邦学习进行协同策略训练。 Result: 硬件在环实验表明,LTED-Ada在不同帧率和动态网络条件下均优于基线方法,联邦学习增强了模型对未见场景的适应性。 Conclusion: LTED-Ada能有效平衡识别精度、延迟和计算开销,联邦增强的LTED-Ada进一步提升了多设备环境下的泛化性能和系统可扩展性。 Abstract: Fast and accurate video object recognition, which relies on frame-by-frame video analytics, remains a challenge for resource-constrained devices such as traffic cameras. Recent advances in mobile edge computing have made it possible to offload computation-intensive object detection to edge servers equipped with high-accuracy neural networks, while lightweight and fast object tracking algorithms run locally on devices. This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking. To address this, we formulate two long-term optimization problems for both single-device and multi-device scenarios, taking into account the temporal correlation of consecutive frames and the dynamic conditions of mobile edge networks. Based on the formulation, we propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection, according to the frame rate as well as recognition accuracy and delay requirement. In multi-device setting, we further enhance LTED-Ada using federated learning to enable collaborative policy training across devices, thereby improving its generalization to unseen frame rates and performance requirements. Finally, we conduct extensive hardware-in-the-loop experiments using multiple Raspberry Pi 4B devices and a personal computer as the edge server, demonstrating the superiority of LTED-Ada.

[50] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

Haibo HU,Lianming Huang,Nan Guan,Chun Jason Xue

Main category: cs.CV

TL;DR: DeeAD是一种无需训练的动作引导早期退出框架,通过评估中间轨迹的物理可行性来加速视觉-语言动作(VLA)模型的推理,显著降低自动驾驶中的延迟。

Details Motivation: VLA模型在自动驾驶中因深层Transformer结构导致显著的推理延迟,影响实时性,因此需要一种高效且无需重新训练的加速方法。 Method: 提出DeeAD框架,利用轻量级规划先验(如导航或低精度规划)判断预测轨迹是否在可接受偏差内(<2m),并在满足条件时提前退出推理;引入多跳控制器自适应跳过冗余层以提升效率。 Result: 在Bench2Drive基准上实验表明,DeeAD实现了最高28%的Transformer层稀疏性和29%的延迟降低,同时保持了规划质量和安全性。 Conclusion: DeeAD能有效集成到现有VLA模型(如ORION)中,无需重训练即可显著提升推理速度,为自动驾驶提供了高效、安全的解决方案。 Abstract: Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.

[51] Foundry: Distilling 3D Foundation Models for the Edge

Guillaume Letellier,Siddharth Srivastava,Frédéric Jurie,Gaurav Sharma

Main category: cs.CV

TL;DR: 本文提出了一种名为Foundation Model Distillation (FMD)的新范式,用于压缩自监督学习的大型基础模型,保留其通用表示能力。作为首个针对3D点云的实现,Foundry通过学习重建教师模型令牌级表示的紧凑SuperTokens,实现高效且泛化能力强的蒸馏模型。

Details Motivation: 大型基础模型虽强大但计算开销大,难以部署在边缘设备上;现有压缩方法(如标准知识蒸馏)会牺牲模型的通用性,无法保持对下游任务无依赖的泛化能力。 Method: 提出FMD框架,通过让学生模型学习一组紧凑的SuperTokens来重建教师模型的令牌级表示,从而捕捉其潜在空间的压缩基底。该方法专注于保留基础模型的通用特征表达能力。 Result: Foundry在分类、部件分割和少样本等多样化下游任务中表现出强迁移能力,性能接近原始基础模型,同时显著减少使用的tokens数量和计算量(FLOPs)。 Conclusion: FMD是一种有效的基础模型压缩新范式,能够在大幅降低模型规模和计算成本的同时,保留其作为通用特征提取器的核心优势,推动基础模型在资源受限设备上的实际部署。 Abstract: Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient 'specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher's token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.

[52] DinoLizer: Learning from the Best for Generative Inpainting Localization

Minh Thong Doi,Jan Butora,Vincent Itier,Jérémie Boulanger,Patrick Bas

Main category: cs.CV

TL;DR: DinoLizer是一种基于DINOv2的模型,用于定位生成式图像修复中的篡改区域,通过在ViT的patch嵌入上添加分类头并采用滑动窗口策略处理大图像,在多种数据集和后处理条件下均优于现有方法。

Details Motivation: 现有的局部篡改检测方法在面对生成式图像修复时性能有限,且对后处理操作敏感,需要更鲁棒和精确的检测模型。 Method: 基于DINOv2模型,添加线性分类头以在14x14 patch分辨率上预测篡改区域;使用滑动窗口策略处理大尺寸图像,并对输出热图进行后处理以生成二值化篡改掩码。 Result: DinoLizer在多个生成式修复数据集上超越了最先进的检测方法,平均IoU高出12%,在经历缩放、噪声和JPEG压缩等后处理后仍保持鲁棒性;消融实验验证了DINOv2相较于DINOv3在此任务上的优势。 Conclusion: DinoLizer利用DINOv2强大的视觉表征能力,有效定位生成式图像修复中的篡改区域,具有高精度和强鲁棒性,展示了ViT在此类任务中的巨大潜力。 Abstract: We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer's patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12\% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer's superiority. The code will be publicly available upon acceptance of the paper.

[53] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

Daeheon Jeong,Seoyeon Byun,Kihoon Son,Dae Hyun Kim,Juho Kim

Main category: cs.CV

TL;DR: 本文介绍了CANVAS,首个用于评估视觉语言模型(VLMs)在基于工具的用户界面设计中性能的基准测试,包含598个任务,旨在衡量模型通过调用设计软件工具逐步编辑UI的能力,并分析其表现与常见错误。

Details Motivation: 现有的视觉语言模型已能通过工具调用操作设计软件,但缺乏评估其在真实UI设计环境中迭代能力的基准,因此需要构建一个标准化评测体系来衡量和提升其设计协作潜力。 Method: 提出CANVAS基准,包含598个基于工具的UI设计任务,源自3.3K个移动UI设计,覆盖30个功能类别;任务分为设计复现与设计修改两类,要求模型通过上下文感知的工具调用来逐步更新UI设计,并提供真实参考答案进行评估。 Result: 实验结果表明,领先的VLM能够进行更具策略性的工具调用,从而提高设计质量;同时识别出模型常见的错误模式,如操作顺序错误、组件定位不准等。 Conclusion: CANVAS为评估VLM在真实设计软件中的交互与编辑能力提供了有效基准,揭示了当前模型在工具使用上的潜力与不足,为未来增强VLM在UI设计协作中的应用指明方向。 Abstract: User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs' potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.

[54] Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru,Xiaochuang Han,Bhuwan Dhingra,Emily Dinan,Maha Elbayad

Main category: cs.CV

TL;DR: 本文提出了一种文本引导的语义图像编码器(TIE),能够在给定文本查询条件下生成图像表示,提升了视觉-语言模型在多项图像到文本任务上的性能,并显著提高了推理效率。

Details Motivation: 传统的图像编码器独立于下游任务和文本查询进行预训练,导致其对特定任务的适应性较差。本文旨在通过引入文本引导机制,使图像编码器能够根据具体文本查询生成更具针对性的图像表示。 Method: 提出Text-Guided Semantic Image Encoder (TIE),将输入文本查询作为条件来指导图像编码过程,使图像表示更贴合文本语义;并在多个视觉-语言任务中与标准编码器进行对比评估。 Result: 配备TIE的视觉-语言模型在1B和3B参数规模下,在九个图像到文本基准上平均提升+1.5和+1.3分,部分任务如DocVQA和InfoVQA提升高达6分;同时仅使用一半图像块即实现更优性能,显著提升推理效率;可视化分析显示TIE能准确关注与查询相关的图像区域。 Conclusion: TIE通过文本引导的图像编码方式,有效增强了图像表示的语义相关性和任务适应性,在性能、效率和可解释性方面均优于传统方法,具有良好的泛化能力。 Abstract: Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.

[55] One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

Sindhuja Penchala,Gavin Money,Gabriel Marques,Samuel Wood,Jessica Kirschman,Travis Atkison,Shahram Rahimi,Noorbakhsh Amiri Golilarz

Main category: cs.CV

TL;DR: 本文提出了一种名为SMARC的统一模型,能够在仅输入10%连续图像块的情况下实现表面材质的重建与分类,结合部分卷积U-Net和分类头,在极稀疏观测下实现了先进的RGB表面修复和材质识别性能。

Details Motivation: 现有方法依赖密集或全场景观测,难以应对视野受限或部分观测的场景,因此需要一种能从稀疏视觉线索中理解材料表面的方法。 Method: 提出SMARC模型,采用部分卷积U-Net结合分类头,利用单个10%连续图像块进行表面重建与材质分类。 Result: 在Touch and Go数据集上,SMARC取得了17.55 dB的PSNR和85.10%的材质分类准确率,优于包括ViT、MAE、Swin Transformer等在内的五种模型。 Conclusion: 部分卷积在处理缺失数据时具有优越的空间推理能力,SMARC为极简视觉条件下的表面理解提供了坚实基础。 Abstract: Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation, and material perception. However, most existing methods rely on dense or full-scene observations, limiting their effectiveness in constrained or partial view environment. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders [17], Vision Transformer (ViT) [13], Masked Autoencoder (MAE) [5], Swin Transformer [9], and DETR [2] using Touch and Go dataset [16] of real-world surface textures. SMARC achieves state-of-the-art results with a PSNR of 17.55 dB and a material classification accuracy of 85.10%. Our findings highlight the advantages of partial convolution in spatial reasoning under missing data and establish a strong foundation for minimal-vision surface understanding.

[56] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang,Sudong Wang,Kaichen Zhang,Keming Wu,Sicong Leng,Yifan Zhang,Chengwei Qin,Shijian Lu,Xingxuan Li,Lidong Bing

Main category: cs.CV

TL;DR: 本文提出了一种名为LongVT的端到端代理框架,通过多模态链式工具思维实现对长视频的推理,利用大模型的时间定位能力进行全局到局部的视频理解,并发布了一个新的数据集VideoSIAH以支持训练与评估。

Details Motivation: 现有的大型多模态模型在处理长视频时容易产生幻觉,尤其是在证据稀疏且时间分散的情况下,因此需要一种更可靠的方法来提升长视频理解与推理能力。 Method: 提出LongVT框架,利用LMM的时序定位能力作为原生视频裁剪工具,实现从全局浏览到局部细节检查的迭代推理;设计三阶段训练策略,包括工具集成冷启动微调、代理强化学习和代理强化微调。 Result: 在四个具有挑战性的长视频理解与推理基准上,LongVT持续优于现有强基线模型;发布了包含训练与评测数据的新数据集VideoSIAH,其中评测集包含1,280个经过人工验证的问答对。 Conclusion: LongVT通过模仿人类观看长视频的方式,实现了更可靠的视觉证据驱动推理,为长视频理解提供了有效的新范式。 Abstract: Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

[57] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models

Souradeep Dutta,Keshav Bulia,Neena S Nair

Main category: cs.CV

TL;DR: 本文重新审视了Facebook AI Research提出的KRISP模型,提出了一种参数更少的轻量级复现版本,在资源受限条件下研究知识增强型VQA架构的可扩展性与有效性。

Details Motivation: 原KRISP模型计算开销大、依赖大型骨干网络,难以在资源受限设备上部署,且存在未充分揭示的设计缺陷和实际问题。 Method: 通过系统性的消融实验,构建了一个轻量级的KRISP复现模型,并在合成VQA数据和DAQUAR数据集上进行验证,限制外部知识图谱域以控制输出范围。 Result: 复现模型性能达到原模型约75%,能有效防止AI幻觉,输出局限于指定知识域内,且可在智能手机、AR-VR等边缘设备上运行。 Conclusion: 轻量化知识增强VQA模型在保持较高性能的同时具备良好的实用性和部署灵活性,为离线视觉推理提供了可行方案。 Abstract: Facebook AI Research introduced KRISP [4], which integrates structured external knowledge into pipelines for vision-language reasoning. Despite its effectiveness, the original model has been developed for industrial-scale training, is computationally demanding, and is tightly connected to a large backbone. In this work, we reexamine KRISP from a different angle and offer a lightweight reproduction with significantly fewer parameters. Even though our replicated model performs about 75 % of the original, the replication process uncovers a number of design flaws, real-world pitfalls, and implicit problems that were not fully covered in the original paper. We offer insights into the scalability and efficacy of knowledge-enhanced VQA architectures under resource constraints through systematic ablation studies, which include a proof-of-concept on synthetic VQA data and evaluation on the DAQUAR dataset. Our model, configured with a low parameter setup and constrained by the external Knowledge graph domain, prevents AI hallucinations and generates outputs solely within that domain. Minimal parameters allow us to function on edge devices like smartphones and AR-VR, further improving offline visual reasoning.

[58] Intriguing Properties of Dynamic Sampling Networks

Dario Morle,Reid Zaffino

Main category: cs.CV

TL;DR: 本文提出了一种名为“warping”的新型算子,统一了深度学习中各种动态采样方法,并对其进行了理论分析,揭示了其在前向与反向传播中的不对称性及其与传统卷积算子的本质区别,同时探讨了动态采样网络稳定训练的条件和离散化效应,并提出了利用梯度更新信息的新颖损失景观可视化方法。

Details Motivation: 现有的动态采样机制在多个计算机视觉模型中表现出色,但缺乏统一的理论分析框架。作者希望建立一个通用的形式化工具来连接和解释不同的动态采样方法。 Method: 提出并分析一种称为“warping”的广义算子,该算子可还原多种现有架构(如可变形卷积、主动卷积单元和空间变换网络),并通过将输入建模为独立同分布变量和齐次随机场进行统计分析;引入基于梯度更新的损失景观可视化方法。 Result: 证明了warping算子在数学上构成一类不同于传统平移不变卷积算子的正交算子类别;发现了前向与反向传播之间的独特不对称性;明确了动态采样网络稳定训练的条件;分析了离散化带来的统计影响;提出了新的损失景观可视化技术。 Conclusion: 动态采样机制代表了一类全新的算子类型,需采用新的理论视角进行理解和优化,本文提供的形式化框架为未来相关模型的设计与分析奠定了基础。 Abstract: Dynamic sampling mechanisms in deep learning architectures have demonstrated utility across many computer vision models, though the theoretical analysis of these structures has not yet been unified. In this paper we connect the various dynamic sampling methods by developing and analyzing a novel operator which generalizes existing methods, which we term "warping". Warping provides a minimal implementation of dynamic sampling which is amenable to analysis, and can be used to reconstruct existing architectures including deformable convolutions, active convolutional units, and spatial transformer networks. Using our formalism, we provide statistical analysis of the operator by modeling the inputs as both IID variables and homogeneous random fields. Extending this analysis, we discover a unique asymmetry between the forward and backward pass of the model training. We demonstrate that these mechanisms represent an entirely different class of orthogonal operators to the traditional translationally invariant operators defined by convolutions. With a combination of theoretical analysis and empirical investigation, we find the conditions necessary to ensure stable training of dynamic sampling networks. In addition, statistical analysis of discretization effects are studied. Finally, we introduce a novel loss landscape visualization which utilizes gradient update information directly, to better understand learning behavior.

[59] $Δ$-NeRF: Incremental Refinement of Neural Radiance Fields through Residual Control and Knowledge Transfer

Kriti Ghosh,Devjyoti Chakraborty,Lakshmish Ramaswamy,Suchendra M. Bhandarkar,In Kee Kim,Nancy O'Hare,Deepak Mishra

Main category: cs.CV

TL;DR: 本文提出了Δ-NeRF,一种用于增量式NeRF精炼的模块化残差框架,适用于数据按序到达的场景(如卫星遥感)。该方法通过残差控制器、不确定性感知门控机制和视图选择策略,在无需重训且不遗忘历史信息的前提下实现高效更新,并结合知识蒸馏压缩模型,显著减少训练时间并保持优越性能。

Details Motivation: 现有NeRF方法在新视角增量加入时通常需要重新训练,易导致灾难性遗忘,难以适应如卫星观测等数据序列化到达的应用场景,因此亟需一种支持持续学习的增量式NeRF框架。 Method: 提出Δ-NeRF,采用冻结的基础NeRF与残差控制器相结合的方式,逐层注入修正;引入不确定性感知门控机制自适应融合基础与精调预测;设计视图选择策略减少训练数据量;并通过知识蒸馏将增强模型压缩为小型学生网络。 Result: 在卫星图像上的实验表明,Δ-NeRF性能媲美联合训练,训练时间减少30-42%;相比朴素微调PSNR最高提升43.5%,并在某些指标上优于联合训练;模型可压缩至原大小的20%。 Conclusion: Δ-NeRF有效解决了增量场景下NeRF更新中的灾难性遗忘问题,实现了高效、紧凑且高性能的持续3D场景建模,特别适用于长时间序列观测任务如地形监测。 Abstract: Neural Radiance Fields (NeRFs) have demonstrated remarkable capabilities in 3D reconstruction and novel view synthesis. However, most existing NeRF frameworks require complete retraining when new views are introduced incrementally, limiting their applicability in domains where data arrives sequentially. This limitation is particularly problematic in satellite-based terrain analysis, where regions are repeatedly observed over time. Incremental refinement of NeRFs remains underexplored, and naive approaches suffer from catastrophic forgetting when past data is unavailable. We propose $Δ$-NeRF, a unique modular residual framework for incremental NeRF refinement. $Δ$-NeRF introduces several novel techniques including: (1) a residual controller that injects per-layer corrections into a frozen base NeRF, enabling refinement without access to past data; (2) an uncertainty-aware gating mechanism that prevents overcorrection by adaptively combining base and refined predictions; and (3) a view selection strategy that reduces training data by up to 47\% while maintaining performance. Additionally, we employ knowledge distillation to compress the enhanced model into a compact student network (20\% of original size). Experiments on satellite imagery demonstrate that $Δ$-NeRF achieves performance comparable to joint training while reducing training time by 30-42\%. $Δ$-NeRF consistently outperforms existing baselines, achieving an improvement of up to 43.5\% in PSNR over naive fine-tuning and surpassing joint training on some metrics.

[60] Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara,Yujia Chen,Ming-Hsuan Yang,James M. Rehg,Wen-Sheng Chu,Du Tran

Main category: cs.CV

TL;DR: 提出Split-then-Merge(StM)框架,通过自分解和重组无标签视频提升生成视频合成的控制能力与数据利用效率。

Details Motivation: 解决生成视频合成中对标注数据或手工规则的依赖以及数据稀缺问题。 Method: 将大量无标签视频拆分为动态前景与背景层,通过自组合学习动态主体与场景的交互;引入变换感知训练流程、多层融合增强和身份保持损失。 Result: 在定量基准和人类/VLLM定性评估中均优于现有最先进方法。 Conclusion: StM能有效学习复杂视频组成动态,实现更逼真且可控的视频生成。 Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam,Saksham Aggarwal,Justin Yang Chae,Nidhi Rastogi

Main category: cs.CV

TL;DR: Sphinx是一个用于视觉感知和推理的合成环境,包含25种任务类型,评估显示当前大模型表现远低于人类,而使用可验证奖励的强化学习能显著提升性能。

Details Motivation: 旨在构建一个针对核心认知能力的视觉推理环境,提供可验证的真值解以支持精确评估和大规模数据集构建。 Method: 通过程序化生成基于图案、图块、图表、图标和几何原语的谜题,设计25类视觉推理任务,并采用强化学习与可验证奖励(RLVR)来提升模型性能。 Result: 最先进的LVLM如GPT-5在该基准上仅达到51.1%的准确率,远低于人类水平;引入RLVR后模型准确率显著提升,并在外迁视觉推理任务上也取得增益。 Conclusion: Sphinx为视觉推理提供了具有挑战性的测试平台,且RLVR是一种有前景的多模态推理提升方法。 Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell'Erba,Andrew D. Bagdanov

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、无需数据的优化方法OVI,用于替代文本到图像生成中昂贵的扩散先验网络,并引入两种新约束提升生成质量,实验表明其性能可与现有先进方法相媲美。

Details Motivation: 现有的文本到图像扩散模型依赖计算成本高且需大量数据训练的先验网络,本文旨在挑战这一必要性,探索更高效替代方案。 Method: 提出基于优化的视觉反演(OVI),通过随机伪标记初始化潜在视觉表示,并迭代优化以最大化与文本提示嵌入的余弦相似度;同时引入Mahalanobis和最近邻损失两种新约束来正则化优化过程。 Result: 在Kandinsky 2.2上实验显示,OVI可有效替代传统先验;分析发现当前T2I-CompBench++等基准存在缺陷,仅用文本嵌入作先验即可得高分;所提约束方法尤其是最近邻法显著提升视觉保真度,定量指标达到或超过现有先进数据高效先验。 Conclusion: OVI作为一种无需训练的先验替代方案具有潜力,且当前评估基准需重新审视,该方向值得进一步研究。 Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

[63] RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerline Graphs

Roman Naeem,David Hagerman,Jennifer Alvén,Fredrik Kahl

Main category: cs.CV

TL;DR: RefTr是一种基于Transformer的3D图像到图模型,用于血管树中心线生成,通过递归优化汇聚轨迹实现高召回率和精确的树状拓扑结构。

Details Motivation: 准确检测具有正确树状拓扑的管状树中心线对临床诊断和手术导航至关重要,尤其是高召回率可避免因遗漏小分支导致的致命错误。 Method: 提出RefTr模型,采用Producer-Refiner架构和Transformer解码器,Producer生成初始汇聚轨迹,Refiner多次迭代优化轨迹;引入汇流轨迹表示法以保持有效树拓扑,并设计高效的非极大抑制算法合并重复分支。 Result: 在多个公开数据集上,RefTr实现了优于现有方法的召回率和相当的精度,推理速度更快,解码器参数减少2.4倍。 Conclusion: RefTr在3D医学图像血管树分析中表现出优越性能,具备成为新SOTA框架的潜力。 Abstract: Tubular trees, such as blood vessels and lung airways, are essential for material transport within the human body. Accurately detecting their centerlines with correct tree topology is critical for clinical tasks such as diagnosis, treatment planning, and surgical navigation. In these applications, maintaining high recall is crucial, as missing small branches can result in fatal mistakes caused by incomplete assessments or undetected abnormalities. We present RefTr, a 3D image-to-graph model for centerline generation of vascular trees via recurrent refinement of confluent trajectories. RefTr uses a Producer-Refiner architecture based on a Transformer decoder, where the Producer proposes a set of initial confluent trajectories that are recurrently refined by the Refiner to produce final trajectories, which forms the centerline graph. The confluent trajectory representation enables refinement of complete trajectories while explicitly enforcing a valid tree topology. The recurrent refinement scheme improves precision and reuses the same Refiner block across multiple steps, yielding a 2.4x reduction in decoder parameters compared to previous SOTA. We also introduce an efficient non-maximum suppression algorithm for spatial tree graphs to merge duplicate branches and boost precision. Across multiple public centerline datasets, RefTr achieves superior recall and comparable precision to previous SOTA, while offering faster inference and substantially fewer parameters, demonstrating its potential as a new state-of-the-art framework for vascular tree analysis in 3D medical imaging.

[64] MODEST: Multi-Optics Depth-of-Field Stereo Dataset

Nisarg K. Trivedi,Vinayak A. Belludi,Li-Yun Wang,Pardis Taghavi,Dante Lok

Main category: cs.CV

TL;DR: 本文提出了首个高分辨率立体DSLR数据集,包含18000张真实场景图像,系统性地变化焦距和光圈,用于提升深度估计、浅景深渲染等任务在真实光学条件下的泛化能力。

Details Motivation: 现有深度估计研究受限于缺乏大规模、高保真的真实立体DSLR数据集,导致模型在真实光学条件下表现不佳,尤其在从合成数据迁移时存在显著的现实差距。 Method: 采集了9个复杂场景,在10个焦距(28-70mm)和5个光圈(f/2.8-f/22)下使用双相同相机系统拍摄,共覆盖50种光学配置,每场景2000张图像,总18000张5472×3648px图像;提供标定数据支持内参与外参评估。 Result: 数据集包含了多尺度光学错觉、反射面、透明玻璃、细粒度细节和光照变化等挑战性元素,实验证明当前最先进的单目、双目深度估计和景深方法在此数据集上仍面临困难。 Conclusion: 该数据集有效弥合了合成训练数据与真实相机光学之间的现实差距,为深度估计及相关任务提供了高保真、可控的真实世界基准,推动真实光学泛化研究。 Abstract: Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.

[65] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

Sree Bhattacharyya,Yaman Kumar Singla,Sudhir Yarram,Somesh Kumar Singh,Harini S,James Z. Wang

Main category: cs.CV

TL;DR: 本文提出了一个大规模无监督数据集,用于建模视觉内容的可记忆性信号,包含超过82,000个视频及其回忆描述,利用Reddit等平台的“舌尖现象”(ToT)检索查询,推动视觉可记忆性研究。

Details Motivation: 现有数据集受限于人工标注成本高、多样性不足,且多只提供聚合的可记忆性分数,无法捕捉自然回忆中的细粒度信号,因此需要一种更丰富、可扩展的方法来建模视觉内容的可记忆性。 Method: 通过从Reddit等在线平台收集‘舌尖现象’(ToT)检索查询,构建了一个包含82,000多个视频及对应回忆描述的大规模无监督数据集,并采用对比学习策略训练多模态ToT检索模型,同时对大型视觉-语言模型进行微调以生成开放式的回忆描述。 Result: 在回忆生成任务中,基于该数据集微调的视觉-语言模型超过了GPT-4o等最先进模型的表现;同时首次实现了多模态ToT检索任务的有效建模。 Conclusion: 该数据集和模型为视觉内容可记忆性研究提供了新方向,显著提升了回忆生成与ToT检索任务的性能,促进了该领域的可扩展性和细粒度分析能力。 Abstract: Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.

[66] Estimating Fog Parameters from a Sequence of Stereo Images

Yining Ding,João F. C. Mota,Andrew M. Wallace,Sen Wang

Main category: cs.CV

TL;DR: 提出一种基于立体雾天图像序列的雾参数动态估计方法,通过联合优化实现更准确的参数估计,并发布首个包含真实雾天场景和标定参数的数据集SDIRF。

Details Motivation: 现有方法在雾参数估计中采用串行方式,易产生误差累积,且难以处理全局非均匀的真实雾况,因此需要一种能同时估计并动态更新参数的方法以提升视觉系统在雾天环境下的鲁棒性。 Method: 提出一种新的联合优化算法,同时估计雾模型的所有参数,假设雾仅局部均匀以应对真实世界中的非均匀雾;将该方法设计为可集成到现有视觉SLAM或里程计系统的插件模块;并构建了包含真实雾天立体图像、光度标定参数及对应晴天数据的SDIRF数据集用于验证。 Result: 在合成数据和SDIRF真实数据上均优于先前方法,参数估计更准确,对真实雾况适应性更强;发布的SDIRF数据集包含40分钟以上、34,000多帧高质量立体图像及精确标定信息。 Conclusion: 所提方法能有效提升雾天环境下视觉感知系统的性能,联合优化策略优于串行估计,且SDIRF数据集为后续研究提供了重要资源,推动雾天视觉感知的发展。 Abstract: We propose a method which, given a sequence of stereo foggy images, estimates the parameters of a fog model and updates them dynamically. In contrast with previous approaches, which estimate the parameters sequentially and thus are prone to error propagation, our algorithm estimates all the parameters simultaneously by solving a novel optimisation problem. By assuming that fog is only locally homogeneous, our method effectively handles real-world fog, which is often globally inhomogeneous. The proposed algorithm can be easily used as an add-on module in existing visual Simultaneous Localisation and Mapping (SLAM) or odometry systems in the presence of fog. In order to assess our method, we also created a new dataset, the Stereo Driving In Real Fog (SDIRF), consisting of high-quality, consecutive stereo frames of real, foggy road scenes under a variety of visibility conditions, totalling over 40 minutes and 34k frames. As a first-of-its-kind, SDIRF contains the camera's photometric parameters calibrated in a lab environment, which is a prerequisite for correctly applying the atmospheric scattering model to foggy images. The dataset also includes the counterpart clear data of the same routes recorded in overcast weather, which is useful for companion work in image defogging and depth reconstruction. We conducted extensive experiments using both synthetic foggy data and real foggy sequences from SDIRF to demonstrate the superiority of the proposed algorithm over prior methods. Our method not only produces the most accurate estimates on synthetic data, but also adapts better to real fog. We make our code and SDIRF publicly available\footnote{https://github.com/SenseRoboticsLab/estimating-fog-parameters} to the community with the aim of advancing the research on visual perception in fog.

[67] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Jiancheng Pan,Runze Wang,Tianwen Qian,Mohammad Mahdi,Yanwei Fu,Xiangyang Xue,Xiaomeng Huang,Luc Van Gool,Danda Pani Paudel,Yuqian Fu

Main category: cs.CV

TL;DR: 本文提出了V^2-SAM,一个统一的跨视角物体对应框架,通过两个互补的提示生成器将SAM2从单视图分割扩展到跨视角对应任务,在多个基准上实现了最先进的性能。

Details Motivation: 由于显著的视角和外观变化,现有的分割模型(如SAM2)难以直接应用于跨视角物体对应任务,因此需要一种能够有效处理这些挑战的新方法。 Method: 提出V^2-SAM框架,包含两个提示生成器:基于DINOv3特征的跨视角锚点提示生成器(V^2-Anchor)用于建立几何感知对应并实现基于坐标的提示;跨视角视觉提示生成器(V^2-Visual)通过新的视觉提示匹配器从特征和结构角度对齐ego-exo表示。此外,采用多专家设计和后处理循环一致性选择器(PCCS)自适应选择最可靠的专家。 Result: 在Ego-Exo4D、DAVIS-2017和HANDAL-X等多个数据集上进行了广泛实验,验证了V^2-SAM的有效性,并取得了新的最先进性能。 Conclusion: V^2-SAM成功地将SAM2扩展到跨视角物体对应任务,通过结合几何和外观线索以及自适应选择机制,在多种应用场景中表现出卓越性能。 Abstract: Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).

[68] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation

Taehoon Kim,Henry Gouk,Timothy Hospedales

Main category: cs.CV

TL;DR: 提出Null-TTA方法,通过优化无条件文本嵌入实现扩散模型的测试时对齐,避免奖励黑客问题,保持语义一致性并提升目标对齐与跨奖励泛化性能。

Details Motivation: 现有测试时对齐方法容易欠优化或过优化(奖励黑客),缺乏在语义一致流形上的对齐机制。 Method: 在分类器自由引导中优化无条件文本嵌入,而非潜变量或噪声,利用文本嵌入空间的结构语义特性进行对齐。 Result: Null-TTA在目标测试时对齐上达到SOTA,同时保持强跨奖励泛化能力,有效防止奖励黑客。 Conclusion: 语义空间优化是测试时对齐的一种有效且原则性强的新范式。 Abstract: Test-time alignment (TTA) aims to adapt models to specific rewards during inference. However, existing methods tend to either under-optimise or over-optimise (reward hack) the target reward function. We propose Null-Text Test-Time Alignment (Null-TTA), which aligns diffusion models by optimising the unconditional embedding in classifier-free guidance, rather than manipulating latent or noise variables. Due to the structured semantic nature of the text embedding space, this ensures alignment occurs on a semantically coherent manifold and prevents reward hacking (exploiting non-semantic noise patterns to improve the reward). Since the unconditional embedding in classifier-free guidance serves as the anchor for the model's generative distribution, Null-TTA directly steers model's generative distribution towards the target reward rather than just adjusting the samples, even without updating model parameters. Thanks to these desirable properties, we show that Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalisation. This establishes semantic-space optimisation as an effective and principled novel paradigm for TTA.

[69] GaINeR: Geometry-Aware Implicit Network Representation

Weronika Jakubowska,Mikołaj Zieliński,Rafał Tobiasz,Krzysztof Byrski,Maciej Zięba,Dominik Belter,Przemysław Spurek

Main category: cs.CV

TL;DR: 提出GaINeR,一种结合可训练高斯分布与神经网络的几何感知隐式表示方法,实现2D图像的连续表示、可解释几何结构和灵活局部编辑。

Details Motivation: 传统隐式神经表示(INR)缺乏显式几何结构,难以支持局部编辑和物理仿真,限制了其在动态或交互场景中的应用。 Method: 对于给定图像坐标,检索K个最近的高斯分布,聚合距离加权的嵌入特征,并通过神经网络预测RGB值。 Result: 实现了高质量的图像重建,具备显式的几何结构和良好的局部编辑能力,支持物理感知和交互式图像操作。 Conclusion: GaINeR在保持INR高保真表达能力的同时,增强了几何可解释性和编辑灵活性,拓展了其在动态和交互场景中的应用潜力。 Abstract: Implicit Neural Representations (INRs) have become an essential tool for modeling continuous 2D images, enabling high-fidelity reconstruction, super-resolution, and compression. Popular architectures such as SIREN, WIRE, and FINER demonstrate the potential of INR for capturing fine-grained image details. However, traditional INRs often lack explicit geometric structure and have limited capabilities for local editing or integration with physical simulation, restricting their applicability in dynamic or interactive settings. To address these limitations, we propose GaINeR: Geometry-Aware Implicit Network Representation, a novel framework for 2D images that combines trainable Gaussian distributions with a neural network-based INR. For a given image coordinate, the model retrieves the K nearest Gaussians, aggregates distance-weighted embeddings, and predicts the RGB value via a neural network. This design enables continuous image representation, interpretable geometric structure, and flexible local editing, providing a foundation for physically aware and interactive image manipulation. The official implementation of our method is publicly available at https://github.com/WJakubowska/GaINeR.

[70] A deep learning model to reduce agent dose for contrast-enhanced MRI of the cerebellopontine angle cistern

Yunjie Chen,Rianne A. Weber,Olaf M. Neve,Stephan R. Romeijn,Erik F. Hensen,Jelmer M. Wolterink,Qian Tao,Marius Staring,Berit M. Verbist

Main category: cs.CV

TL;DR: 该研究开发了一种深度学习模型,用于从极低剂量(10%-30%)对比剂的T1加权MRI中恢复标准质量图像,显著提升图像质量和肿瘤分割性能,有助于减少临床对比剂使用。

Details Motivation: 减少内听道增强MRI中对比剂的使用剂量,以降低患者风险并提高安全性,同时保持诊断图像质量。 Method: 采用多中心回顾性数据,利用 vestibular schwannoma 患者的T1和T1ce MRI数据模拟不同剂量降低的低剂量图像,并训练深度学习模型从这些低剂量输入中恢复标准剂量T1ce图像,评估其图像质量和分割性能。 Result: 在10%输入剂量下,DL恢复图像使分割Dice系数从0.673提升至0.734,Hausdorff距离和表面距离也明显改善;结构相似性指数和峰值信噪比随输入剂量增加而提升;放射科医生评价显示10%和30%剂量恢复图像质量优良,后者更具信息性。 Conclusion: 深度学习模型可有效恢复极低剂量(10%-30%)对比剂MRI图像质量,支持在内听道病变检测与诊断中大幅减少对比剂用量而不牺牲诊断价值。 Abstract: Objectives: To evaluate a deep learning (DL) model for reducing the agent dose of contrast-enhanced T1-weighted MRI (T1ce) of the cerebellopontine angle (CPA) cistern. Materials and methods: In this multi-center retrospective study, T1 and T1ce of vestibular schwannoma (VS) patients were used to simulate low-dose T1ce with varying reductions of contrast agent dose. DL models were trained to restore standard-dose T1ce from the low-dose simulation. The image quality and segmentation performance of the DL-restored T1ce were evaluated. A head and neck radiologist was asked to rate DL-restored images in multiple aspects, including image quality and diagnostic characterization. Results: 203 MRI studies from 72 VS patients (mean age, 58.51 \pm 14.73, 39 men) were evaluated. As the input dose increased, the structural similarity index measure of the restored T1ce increased from 0.639 \pm 0.113 to 0.993 \pm 0.009, and the peak signal-to-noise ratio increased from 21.6 \pm 3.73 dB to 41.4 \pm 4.84 dB. At 10% input dose, using DL-restored T1ce for segmentation improved the Dice from 0.673 to 0.734, the 95% Hausdorff distance from 2.38 mm to 2.07 mm, and the average surface distance from 1.00 mm to 0.59 mm. Both DL-restored T1ce from 10% and 30% input doses showed excellent images, with the latter being considered more informative. Conclusion: The DL model improved the image quality of low-dose MRI of the CPA cistern, which makes lesion detection and diagnostic characterization possible with 10% - 30% of the standard dose.

[71] Smooth regularization for efficient video recognition

Gil Goldman,Raja Giryes,Mahadev Satyanarayanan

Main category: cs.CV

TL;DR: 提出一种基于高斯随机游走的平滑正则化方法,增强视频识别模型的时间归纳偏置,显著提升轻量级模型在Kinetics-600上的性能。

Details Motivation: 轻量级视频模型难以有效捕捉复杂时间动态,缺乏对视频中自然时间连贯性的建模。 Method: 通过建模连续帧中间层嵌入的变化为高斯随机游走(GRW),鼓励表示变化的平滑性,抑制突变,促进低加速度的时序表征。 Result: 在Kinetics-600上,MoViNets系列提升3.8%-6.1%,MobileNetV3和MoViNets-Stream提升4.9%-6.4%,均达到各自FLOP或内存约束下的SOTA。 Conclusion: 所提平滑正则化方法有效增强了轻量级模型的时间建模能力,在多种架构上显著提升性能并刷新SOTA。 Abstract: We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/gilgoldm/grw-smoothing.

[72] Open Vocabulary Compositional Explanations for Neuron Alignment

Biagio La Rosa,Leilani H. Gilpin

Main category: cs.CV

TL;DR: 本文提出了一种用于视觉领域的开放词汇组合解释框架,通过利用开放词汇语义分割生成的掩码来探测神经元对任意概念的响应,从而克服了以往方法依赖人工标注数据的局限性。

Details Motivation: 现有的组合解释方法依赖于人工标注数据集,限制了其在特定领域和预定义概念之外的应用,因此需要一种更灵活、可扩展的方法来理解神经元如何编码信息。 Method: 该框架包含三个步骤:指定任意概念、使用开放词汇模型生成语义分割掩码、基于这些掩码推导出组合解释。它利用开放词汇语义分割模型自动生成标注,无需人工干预。 Result: 实验表明,与先前方法相比,该框架在定量指标和人类可解释性方面表现相当或更优;同时分析了从人工标注转向模型标注对解释结果的影响,并展示了其在任务和属性上的灵活性与扩展能力。 Conclusion: 所提出的框架有效突破了传统组合解释对人工标注的依赖,支持在任意数据集和概念上进行神经元分析,显著提升了可解释AI方法的适用范围和灵活性。 Abstract: Neurons are the fundamental building blocks of deep neural networks, and their interconnections allow AI to achieve unprecedented results. Motivated by the goal of understanding how neurons encode information, compositional explanations leverage logical relationships between concepts to express the spatial alignment between neuron activations and human knowledge. However, these explanations rely on human-annotated datasets, restricting their applicability to specific domains and predefined concepts. This paper addresses this limitation by introducing a framework for the vision domain that allows users to probe neurons for arbitrary concepts and datasets. Specifically, the framework leverages masks generated by open vocabulary semantic segmentation to compute open vocabulary compositional explanations. The proposed framework consists of three steps: specifying arbitrary concepts, generating semantic segmentation masks using open vocabulary models, and deriving compositional explanations from these masks. The paper compares the proposed framework with previous methods for computing compositional explanations both in terms of quantitative metrics and human interpretability, analyzes the differences in explanations when shifting from human-annotated data to model-annotated data, and showcases the additional capabilities provided by the framework in terms of flexibility of the explanations with respect to the tasks and properties of interest.

[73] UruDendro4: A Benchmark Dataset for Automatic Tree-Ring Detection in Cross-Section Images of Pinus taeda L

Henry Marichal,Joaquin Blanco,Diego Passarella,Gregory Randall

Main category: cs.CV

TL;DR: 本文介绍了UruDendro4数据集,包含102个火炬松(Pinus taeda L.)不同树干高度的木材横截面图像及手工标注的年轮信息,支持基于图像的树木年轮自动检测与体积生长建模,并提供了深度学习方法在该数据集上的性能基线,验证了其对模型泛化能力的提升作用。

Details Motivation: 由于现有木材横截面数据集样本稀缺且多局限于单一高度切片,难以支持树木年轮的自动化精确检测和体积生长建模,因此需要一个更全面、标注精细且具有垂直分布特征的数据集来推动相关研究。 Method: 构建了一个名为UruDendro4的新数据集,包含102张Pinus taeda L.的横截面图像,每张图像均在多个树干高度采集并手工标注年轮边界;采用当前最先进的深度学习模型DeepCS-TRD进行年轮检测,并通过平均精度(mAP)、平均召回率(mAR)和Adapted Rand Error等指标评估性能;同时开展消融实验以优化参数配置,并测试该数据集对模型泛化能力的影响。 Result: DeepCS-TRD在UruDendro4数据集上取得了最佳性能,平均精度为0.838,平均召回率为0.782,Adapted Rand Error为0.084;消融实验证实了模型配置的有效性;加入UruDendro4训练可显著提升模型在跨数据集任务中的泛化表现。 Conclusion: UruDendro4是一个高质量、多高度采样的年轮图像数据集,填补了现有数据在结构多样性上的空白,不仅支持高精度的年轮自动识别,还为基于图像的树木年生长量体积建模提供了基础;实验结果表明该数据集能有效促进相关算法的发展与评估。 Abstract: Tree-ring growth represents the annual wood increment for a tree, and quantifying it allows researchers to assess which silvicultural practices are best suited for each species. Manual measurement of this growth is time-consuming and often imprecise, as it is typically performed along 4 to 8 radial directions on a cross-sectional disc. In recent years, automated algorithms and datasets have emerged to enhance accuracy and automate the delineation of annual rings in cross-sectional images. To address the scarcity of wood cross-section data, we introduce the UruDendro4 dataset, a collection of 102 image samples of Pinus taeda L., each manually annotated with annual growth rings. Unlike existing public datasets, UruDendro4 includes samples extracted at multiple heights along the stem, allowing for the volumetric modeling of annual growth using manually delineated rings. This dataset (images and annotations) allows the development of volumetric models for annual wood estimation based on cross-sectional imagery. Additionally, we provide a performance baseline for automatic ring detection on this dataset using state-of-the-art methods. The highest performance was achieved by the DeepCS-TRD method, with a mean Average Precision of 0.838, a mean Average Recall of 0.782, and an Adapted Rand Error score of 0.084. A series of ablation experiments were conducted to empirically validate the final parameter configuration. Furthermore, we empirically demonstrate that training a learning model including this dataset improves the model's generalization in the tree-ring detection task.

[74] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

Rawa Mohammed,Mina Attin,Bryar Shareef

Main category: cs.CV

TL;DR: 提出BUSTR,一种无需配对图像-报告监督的多任务视觉-语言框架,用于乳腺超声自动报告生成,通过结合标记级和对齐损失提升生成质量和临床有效性。

Details Motivation: 现有的乳腺超声自动报告生成受限于缺乏配对的图像-报告数据集以及大语言模型的幻觉问题。 Method: BUSTR利用结构化描述符(如BI-RADS、病理、组织学)和放射组学特征构建报告,采用多头Swin编码器学习描述符感知的视觉表示,并通过结合标记级交叉熵和余弦相似性对齐损失的双层目标实现视觉与文本标记的对齐。 Result: 在BrEaST和BUS-BRA两个公开数据集上,BUSTR在标准自然语言生成指标和临床有效性指标上均取得一致提升,尤其在BI-RADS类别和病理判断方面表现更优。 Conclusion: 该方法证明了基于描述符感知的视觉模型结合双重损失训练,可在无需配对图像-报告数据的情况下有效提升报告生成质量和临床实用性。 Abstract: Automated radiology report generation (RRG) for breast ultrasound (BUS) is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models. We propose BUSTR, a multitask vision-language framework that generates BUS reports without requiring paired image-report supervision. BUSTR constructs reports from structured descriptors (e.g., BI-RADS, pathology, histology) and radiomics features, learns descriptor-aware visual representations with a multi-head Swin encoder trained using a multitask loss over dataset-specific descriptor sets, and aligns visual and textual tokens via a dual-level objective that combines token-level cross-entropy with a cosine-similarity alignment loss between input and output representations. We evaluate BUSTR on two public BUS datasets, BrEaST and BUS-BRA, which differ in size and available descriptors. Across both datasets, BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics, particularly for key targets such as BI-RADS category and pathology. Our results show that this descriptor-aware vision model, trained with a combined token-level and alignment loss, improves both automatic report metrics and clinical efficacy without requiring paired image-report data. The source code can be found at https://github.com/AAR-UNLV/BUSTR

[75] Beyond Realism: Learning the Art of Expressive Composition with StickerNet

Haoming Lu,David Kocharian,Humphrey Shi

Main category: cs.CV

TL;DR: 本文提出了表达性图像合成任务,旨在反映用户在真实创作平台上进行图像编辑时的多样化风格和松散布局逻辑,并提出了StickerNet框架,通过两阶段方法预测贴纸的透明度、位置、尺度等参数。该研究基于180万条来自匿名在线平台的真实编辑行为数据构建数据集,强调从真实用户行为中学习,实验表明StickerNet在匹配人类放置行为方面优于基线模型。

Details Motivation: 传统图像合成注重视觉真实性和语义合理性,但现实中许多用户更关注艺术性、趣味性和社交互动,而非 realism。因此需要一种新的图像合成范式来反映这种表达性编辑需求。 Method: 提出StickerNet,一个两阶段框架:第一阶段识别合成类型,第二阶段据此预测贴纸的放置参数(如透明度、掩码、位置、尺度)。使用从真实在线平台收集的180万条编辑操作构建数据集,直接反映用户社区认可的布局决策。 Result: 用户研究和定量评估表明,StickerNet在生成符合人类偏好的贴纸布局方面优于常见基线方法,并能较好地模拟真实用户的行为,尽管任务本身存在模糊性。 Conclusion: 本工作开创了以表达性和用户意图为核心的图像合成新方向,强调从真实世界编辑行为中学习的重要性,为视觉理解提供了超越 realism 的新视角。 Abstract: As a widely used operation in image editing workflows, image composition has traditionally been studied with a focus on achieving visual realism and semantic plausibility. However, in practical editing scenarios of the modern content creation landscape, many compositions are not intended to preserve realism. Instead, users of online platforms motivated by gaining community recognition often aim to create content that is more artistic, playful, or socially engaging. Taking inspiration from this observation, we define the expressive composition task, a new formulation of image composition that embraces stylistic diversity and looser placement logic, reflecting how users edit images on real-world creative platforms. To address this underexplored problem, we present StickerNet, a two-stage framework that first determines the composition type, then predicts placement parameters such as opacity, mask, location, and scale accordingly. Unlike prior work that constructs datasets by simulating object placements on real images, we directly build our dataset from 1.8 million editing actions collected on an anonymous online visual creation and editing platform, each reflecting user-community validated placement decisions. This grounding in authentic editing behavior ensures strong alignment between task definition and training supervision. User studies and quantitative evaluations show that StickerNet outperforms common baselines and closely matches human placement behavior, demonstrating the effectiveness of learning from real-world editing patterns despite the inherent ambiguity of the task. This work introduces a new direction in visual understanding that emphasizes expressiveness and user intent over realism.

[76] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

Md Adnan Arefeen,Biplob Debnath,Srimat Chakradhar

Main category: cs.CV

TL;DR: 提出了一种名为TrafficLens的算法,用于多摄像头交通路口的视频分析,通过序列化处理和视觉语言模型的迭代应用,显著减少了视频到文本的转换时间,同时保持信息准确性。

Details Motivation: 由于多摄像头交通视频数据量大,传统方法在将视频数据转换为文本进行分析时耗时较长,难以及时生成洞察和应对交通事件,因此需要一种更高效的分析方法。 Method: 采用序列化策略,利用摄像头覆盖区域的重叠特性,迭代使用不同token限制的视觉语言模型(VLM),并将前一阶段输出作为后续输入;同时引入对象级相似性检测器跳过冗余的VLM调用。 Result: 实验结果表明,TrafficLens在真实世界数据集上可将视频到文本的转换时间减少最多4倍,同时保持较高的信息准确率。 Conclusion: TrafficLens为多摄像头交通视频的高效分析提供了一种有效解决方案,显著提升了处理速度,适用于实时交通管理和事件调查。 Abstract: Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.

[77] Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI

Al Amin,Kamrul Hasan,Liang Hong,Sharif Ullah

Main category: cs.CV

TL;DR: 本文提出了一种结合Vision Transformer与同态加密的隐私保护联邦学习框架,用于跨医疗机构的病理图像分类,通过加密ViT的CLS token实现高效且安全的模型聚合,显著降低通信开销并抵御重构攻击。

Details Motivation: 由于HIPAA等隐私法规限制医疗数据共享,传统联邦学习中的梯度仍易受重构攻击,因此需要更安全、高效的隐私保护方案。 Method: 采用Vision Transformer提取CLS token作为紧凑特征表示,使用CKKS同态加密算法对CLS token进行加密后传输,并在密文上直接进行推理和聚合。 Result: 相比梯度加密减少30倍通信量,每轮仅需326KB加密数据;传统梯度易受攻击(PSNR: 52.26dB, SSIM: 0.999),而本方法可有效防御;全局分类准确率达96.12%(明文)和90.02%(密文)。 Conclusion: 该框架在保证强隐私的前提下实现了高效的跨机构病理图像分类,平衡了安全性、通信效率与模型性能。 Abstract: Collaborative machine learning across healthcare institutions promises improved diagnostic accuracy by leveraging diverse datasets, yet privacy regulations such as HIPAA prohibit direct patient data sharing. While federated learning (FL) enables decentralized training without raw data exchange, recent studies show that model gradients in conventional FL remain vulnerable to reconstruction attacks, potentially exposing sensitive medical information. This paper presents a privacy-preserving federated learning framework combining Vision Transformers (ViT) with homomorphic encryption (HE) for secure multi-institutional histopathology classification. The approach leverages the ViT CLS token as a compact 768-dimensional feature representation for secure aggregation, encrypting these tokens using CKKS homomorphic encryption before transmission to the server. We demonstrate that encrypting CLS tokens achieves a 30-fold communication reduction compared to gradient encryption while maintaining strong privacy guarantees. Through evaluation on a three-client federated setup for lung cancer histopathology classification, we show that gradients are highly susceptible to model inversion attacks (PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741), enabling near-perfect image reconstruction. In contrast, the proposed CLS-protected HE approach prevents such attacks while enabling encrypted inference directly on ciphertexts, requiring only 326 KB of encrypted data transmission per aggregation round. The framework achieves 96.12 percent global classification accuracy in the unencrypted domain and 90.02 percent in the encrypted domain.

[78] Inversion-Free Style Transfer with Dual Rectified Flows

Yingying Deng,Xiangyu He,Fan Tang,Weiming Dong,Xucheng Yin

Main category: cs.CV

TL;DR: 提出一种无需反演的基于双校正流的风格迁移框架,仅通过前向传播实现内容与风格的高效融合。

Details Motivation: 现有基于扩散模型的无训练方法依赖计算昂贵的反演过程,且反演不准确时会导致视觉失真,影响效率和效果。 Method: 提出双校正流框架,平行预测内容与风格轨迹,并通过动态中点插值融合两者速度场;结合注意力注入机制引导风格融合。 Result: 在多种风格和内容组合上实现了高效、高质量的风格迁移,保持了内容结构并提升了视觉保真度,且无需反演步骤。 Conclusion: 该方法提供了一种高效、鲁棒且无需反演的风格迁移新范式,在生成质量与计算效率之间取得了良好平衡。 Abstract: Style transfer, a pivotal task in image processing, synthesizes visually compelling images by seamlessly blending realistic content with artistic styles, enabling applications in photo editing and creative design. While mainstream training-free diffusion-based methods have greatly advanced style transfer in recent years, their reliance on computationally inversion processes compromises efficiency and introduces visual distortions when inversion is inaccurate. To address these limitations, we propose a novel \textit{inversion-free} style transfer framework based on dual rectified flows, which tackles the challenge of finding an unknown stylized distribution from two distinct inputs (content and style images), \textit{only with forward pass}. Our approach predicts content and style trajectories in parallel, then fuses them through a dynamic midpoint interpolation that integrates velocities from both paths while adapting to the evolving stylized image. By jointly modeling the content, style, and stylized distributions, our velocity field design achieves robust fusion and avoids the shortcomings of naive overlays. Attention injection further guides style integration, enhancing visual fidelity, content preservation, and computational efficiency. Extensive experiments demonstrate generalization across diverse styles and content, providing an effective and efficient pipeline for style transfer.

[79] RefOnce: Distilling References into a Prototype Memory for Referring Camouflaged Object Detection

Yu-Huan Wu,Zi-Xuan Zhu,Yan Wang,Liangli Zhen,Deng-Ping Fan

Main category: cs.CV

TL;DR: 提出一种新的Ref-COD框架,通过在训练时将参考图像信息蒸馏到类别原型记忆中,在推理时无需参考图像即可生成引导向量,实现高效、简洁的指定伪装目标检测。

Details Motivation: 现有Ref-COD方法依赖测试时的参考图像,导致部署困难、延迟高和数据收集负担重,因此需要一种无需测试时参考图像的方法以提升实用性和效率。 Method: 设计一个类原型记忆机制,在训练时通过指数移动平均(EMA)维护每个类别的原型;在推理时,根据查询图像预测原型的混合权重以合成参考向量;引入双向注意力对齐模块,缩小参考统计与伪装查询特征之间的表示差距。 Result: 在大规模R2C7K基准上进行了评估,实验表明该方法在无需测试时参考图像的情况下,性能与最新方法相当甚至更优。 Conclusion: 本文提出的框架实现了无需测试时参考图像的Ref-COD,具有良好的部署潜力和效率,为指定伪装目标检测提供了一种简单而有效的解决方案。 Abstract: Referring Camouflaged Object Detection (Ref-COD) segments specified camouflaged objects in a scene by leveraging a small set of referring images. Though effective, current systems adopt a dual-branch design that requires reference images at test time, which limits deployability and adds latency and data-collection burden. We introduce a Ref-COD framework that distills references into a class-prototype memory during training and synthesizes a reference vector at inference via a query-conditioned mixture of prototypes. Concretely, we maintain an EMA-updated prototype per category and predict mixture weights from the query to produce a guidance vector without any test-time references. To bridge the representation gap between reference statistics and camouflaged query features, we propose a bidirectional attention alignment module that adapts both the query features and the class representation. Thus, our approach yields a simple, efficient path to Ref-COD without mandatory references. We evaluate the proposed method on the large-scale R2C7K benchmark. Extensive experiments demonstrate competitive or superior performance of the proposed method compared with recent state-of-the-arts. Code is available at https://github.com/yuhuan-wu/RefOnce.

[80] Wavefront-Constrained Passive Obscured Object Detection

Zhiwen Zheng,Yiwei Ouyang,Zhao Huang,Tao Zhang,Xiaoshuai Zhang,Huiyu Zhou,Wenwen Tang,Shaowei Jiang,Jin Liu,Xingru Huang

Main category: cs.CV

TL;DR: 提出了一种物理驱动的WavePCNet网络,用于通过模拟波前传播来增强遮挡物体的感知,显著提升了在复杂环境下的定位与分割性能。

Details Motivation: 现有方法在处理视场外微弱光模式时难以准确捕捉相干光传播的物理特性,且在低信噪比下易产生非物理解,影响成像稳定性与可靠性。 Method: 提出WavePCNet,结合Tri-Phase Wavefront Complex-Propagation Reprojection(TriWCP)建模复振幅传输过程,引入动量记忆机制抑制扰动累积,并设计高频跨层补偿增强模块以多尺度感受野和频率选择路径提升结构一致性建模能力。 Result: 在四个真实采集数据集上实验表明,WavePCNet在精度和鲁棒性方面均优于现有最先进方法。 Conclusion: WavePCNet通过深度融合物理模型与深度学习,有效解决了多重散射和介质扰动下的非视域成像难题,提升了对遮挡物体的感知能力。 Abstract: Accurately localizing and segmenting obscured objects from faint light patterns beyond the field of view is highly challenging due to multiple scattering and medium-induced perturbations. Most existing methods, based on real-valued modeling or local convolutional operations, are inadequate for capturing the underlying physics of coherent light propagation. Moreover, under low signal-to-noise conditions, these methods often converge to non-physical solutions, severely compromising the stability and reliability of the observation. To address these challenges, we propose a novel physics-driven Wavefront Propagating Compensation Network (WavePCNet) to simulate wavefront propagation and enhance the perception of obscured objects. This WavePCNet integrates the Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) to incorporate complex amplitude transfer operators to precisely constrain coherent propagation behavior, along with a momentum memory mechanism to effectively suppress the accumulation of perturbations. Additionally, a High-frequency Cross-layer Compensation Enhancement is introduced to construct frequency-selective pathways with multi-scale receptive fields and dynamically model structural consistency across layers, further boosting the model's robustness and interpretability under complex environmental conditions. Extensive experiments conducted on four physically collected datasets demonstrate that WavePCNet consistently outperforms state-of-the-art methods across both accuracy and robustness.

[81] GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision

Yuxiao Xiang,Junchi Chen,Zhenchao Jin,Changtao Miao,Haojie Yuan,Qi Chu,Tao Gong,Nenghai Yu

Main category: cs.CV

TL;DR: 本文提出了GuardTrace-VL,一种面向多模态大推理模型的视觉感知安全审计工具,通过联合分析图像和文本内容来监控完整的问答思考流程,有效检测推理过程中出现的不安全内容。

Details Motivation: 现有的多模态安全防护方法主要关注输入问题和最终答案,忽略了中间推理过程可能包含有害内容(如偏见推断或违规使用视觉上下文),存在部署风险。因此需要一种能够监控完整推理链的安全机制。 Method: 提出GuardTrace-VL模型,结合图像与文本进行联合分析,监控Question-Thinking-Answer(QTA)全流程;构建GuardTrace数据集,采用多样化提示策略生成并经由MLRM与人工投票验证精炼;设计三阶段渐进式训练方案,结合数据优化过程,使模型能根据不同风险等级学习细致的安全偏好。 Result: 在涵盖领域内和领域外场景的测试集上,GuardTrace-VL在检测不安全推理任务中的F1得分为93.1%,相比此前最强的多模态安全防御方法提升了13.5%。 Conclusion: GuardTrace-VL能有效识别多模态推理过程中的潜在有害内容,显著提升多模态大模型的安全性,具有良好的泛化能力和实际应用前景。 Abstract: Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.

[82] From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

Jingxi Chen,Yixiao Zhang,Xiaoye Qian,Zongxia Li,Cornelia Fermuller,Caren Chen,Yiannis Aloimonos

Main category: cs.CV

TL;DR: 提出一种基于扩散模型的轻量级微调方法,用于图像分层分解,结合多模态上下文融合模块,在合成数据上训练,实现了优越的对象移除和遮挡恢复效果。

Details Motivation: 由于缺乏有效的方法和数据,从单幅图像中分解出图层仍然具有挑战性。作者希望通过利用修复任务与图层分解之间的联系来解决这一问题。 Method: 采用基于扩散的修复模型,并引入具有线性注意力复杂度的新型多模态上下文融合模块,通过轻量级微调在纯合成数据集上进行训练。 Result: 该模型在对象移除和遮挡恢复方面表现出色,优于现有方法。 Conclusion: 所提出的方法能够有效实现图像的分层分解,为内容编辑和创意应用提供了新的可能性。 Abstract: Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.

[83] Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

Xiaoxing You,Qiang Huang,Lingyu Li,Chi Zhang,Xiaopeng Liu,Min Zhang,Jun Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为MERGE的多模态实体感知检索增强生成框架,用于新闻图像描述生成,通过构建实体中心的多模态知识库和动态检索机制,在多个数据集上显著优于现有方法。

Details Motivation: 现有新闻图像描述方法存在信息覆盖不全、跨模态对齐弱和视觉实体定位不佳的问题。 Method: 提出MERGE框架,构建融合文本、视觉和结构化知识的实体中心多模态知识库(EMKB),采用多阶段假设-描述策略改善跨模态对齐,并通过图像引导的动态检索增强视觉-实体匹配。 Result: 在GoodNews和NYTimes800k数据集上CIDEr分别提升+6.84和+1.16,F1-score在命名实体识别上提升+4.14和+2.64;在未见的Visual News数据集上CIDEr提升+20.17,F1-score提升+6.22。 Conclusion: MERGE有效解决了新闻图像描述中的关键挑战,展现出强大的鲁棒性和领域适应能力。 Abstract: News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.

[84] MetaRank: Task-Aware Metric Selection for Model Transferability Estimation

Yuhang Liu,Wenjie Zhao,Yunhui Guo

Main category: cs.CV

TL;DR: 本文提出了MetaRank,一种基于元学习的自动、任务感知的模型迁移性估计(MTE)指标选择框架,通过将数据集和MTE指标的文本描述嵌入共享语义空间,并以列表级目标训练元预测器,实现对新目标数据集的高效MTE指标排序。

Details Motivation: 现有MTE指标选择方法通常依赖于历史平均性能,缺乏任务适应性,且无单一指标在所有任务上均最优,因此需要一种任务感知的自动化MTE指标选择方法。 Method: 将MTE指标选择建模为学习排序问题,使用预训练语言模型编码数据集和MTE指标的文本描述,在共享语义空间中表示;通过多样化的元任务离线训练一个元预测器,采用列表级损失函数优化其对高性能MTE指标的排序能力。 Result: 在11个预训练模型和11个目标数据集上的实验表明,MetaRank能有效识别出最适合特定任务的MTE指标,显著优于基于固定或启发式选择的方法。 Conclusion: MetaRank实现了任务自适应的MTE指标选择,提升了迁移学习中源模型评估的效率与准确性,为实际应用中的模型选择提供了可靠支持。 Abstract: Selecting an appropriate pre-trained source model is a critical, yet computationally expensive, task in transfer learning. Model Transferability Estimation (MTE) methods address this by providing efficient proxy metrics to rank models without full fine-tuning. In practice, the choice of which MTE metric to use is often ad hoc or guided simply by a metric's average historical performance. However, we observe that the effectiveness of MTE metrics is highly task-dependent and no single metric is universally optimal across all target datasets. To address this gap, we introduce MetaRank, a meta-learning framework for automatic, task-aware MTE metric selection. We formulate metric selection as a learning-to-rank problem. Rather than relying on conventional meta-features, MetaRank encodes textual descriptions of both datasets and MTE metrics using a pretrained language model, embedding them into a shared semantic space. A meta-predictor is then trained offline on diverse meta-tasks to learn the intricate relationship between dataset characteristics and metric mechanisms, optimized with a listwise objective that prioritizes correctly ranking the top-performing metrics. During the subsequent online phase, MetaRank efficiently ranks the candidate MTE metrics for a new, unseen target dataset based on its textual description, enabling practitioners to select the most appropriate metric a priori. Extensive experiments across 11 pretrained models and 11 target datasets demonstrate the strong effectiveness of our approach.

[85] Structure-Aware Prototype Guided Trusted Multi-View Classification

Haojian Huang,Jiahao Shi,Zhe Liu,Harold Haodong Chen,Han Fang,Hao Sun,Zhongjiang He

Main category: cs.CV

TL;DR: 提出一种新的多视图分类框架,通过引入原型表示各视图的邻域结构,简化视图内关系学习并实现动态对齐,提升跨视图一致性与分类可信度。

Details Motivation: 现有可信多视图分类方法依赖全局密集邻域关系,计算成本高且难以保证视图间关系的一致性;同时采用手动赋权融合证据,缺乏对类空间内多视图邻域结构一致性的保障,影响分类结果的可信性。 Method: 引入原型来表示每个视图的邻域结构,简化视图内依赖建模,并通过动态对齐机制联合优化视图内与视图间结构,增强跨视图共识的一致性与学习效率。 Result: 在多个公开多视图数据集上的实验表明,该方法在下游任务性能和鲁棒性方面优于或媲美现有的主流可信多视图分类方法。 Conclusion: 所提基于原型的多视图分类框架有效提升了跨视图邻域结构的一致性与学习效率,增强了分类结果的可信性,具有良好的实际应用潜力。 Abstract: Trustworthy multi-view classification (TMVC) addresses the challenge of achieving reliable decision-making in complex scenarios where multi-source information is heterogeneous, inconsistent, or even conflicting. Existing TMVC approaches predominantly rely on globally dense neighbor relationships to model intra-view dependencies, leading to high computational costs and an inability to directly ensure consistency across inter-view relationships. Furthermore, these methods typically aggregate evidence from different views through manually assigned weights, lacking guarantees that the learned multi-view neighbor structures are consistent within the class space, thus undermining the trustworthiness of classification outcomes. To overcome these limitations, we propose a novel TMVC framework that introduces prototypes to represent the neighbor structures of each view. By simplifying the learning of intra-view neighbor relations and enabling dynamic alignment of intra- and inter-view structure, our approach facilitates more efficient and consistent discovery of cross-view consensus. Extensive experiments on multiple public multi-view datasets demonstrate that our method achieves competitive downstream performance and robustness compared to prevalent TMVC methods.

[86] CameraMaster: Unified Camera Semantic-Parameter Control for Photography Retouching

Qirui Yang,Yang Yang,Ying Zeng,Xiaobin Hu,Bo Li,Huanjing Yue,Jingyu Yang,Peng-Tao Jiang

Main category: cs.CV

TL;DR: 本文提出CameraMaster,一种统一的相机感知框架,通过解耦相机指令并融合摄影师意图与精确相机参数嵌入,实现物理一致且参数可控的图像润饰。

Details Motivation: 现有基于文本引导的扩散模型在图像编辑中难以实现精确的相机参数控制(如曝光、白平衡、变焦),且依赖模糊的文本提示或独立训练头,限制了可扩展性和多参数组合能力。 Method: 提出CameraMaster框架,将相机参数嵌入用于调制相机指令和内容语义;通过交叉注意力注入调制后的指令,并将指令与参数嵌入作为时间嵌入的条件和门控信号,实现去噪过程中逐层统一调制。 Result: 构建了包含78K图像-提示对的大规模数据集;实验表明CameraMaster对参数变化具有单调且近线性的响应,支持无缝的多参数组合,在性能上显著优于现有方法。 Conclusion: CameraMaster实现了更精确、敏感且可组合的相机参数控制,推动了物理一致的文本引导图像润饰的发展。 Abstract: Text-guided diffusion models have greatly advanced image editing and generation. However, achieving physically consistent image retouching with precise parameter control (e.g., exposure, white balance, zoom) remains challenging. Existing methods either rely solely on ambiguous and entangled text prompts, which hinders precise camera control, or train separate heads/weights for parameter adjustment, which compromises scalability, multi-parameter composition, and sensitivity to subtle variations. To address these limitations, we propose CameraMaster, a unified camera-aware framework for image retouching. The key idea is to explicitly decouple the camera directive and then coherently integrate two critical information streams: a directive representation that captures the photographer's intent, and a parameter embedding that encodes precise camera settings. CameraMaster first uses the camera parameter embedding to modulate both the camera directive and the content semantics. The modulated directive is then injected into the content features via cross-attention, yielding a strongly camera-sensitive semantic context. In addition, the directive and camera embeddings are injected as conditioning and gating signals into the time embedding, enabling unified, layer-wise modulation throughout the denoising process and enforcing tight semantic-parameter alignment. To train and evaluate CameraMaster, we construct a large-scale dataset of 78K image-prompt pairs annotated with camera parameters. Extensive experiments show that CameraMaster produces monotonic and near-linear responses to parameter variations, supports seamless multi-parameter composition, and significantly outperforms existing methods.

[87] CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang,Yunong Liu,Bohan Zhai,Ximeng Sun,Zicheng Liu,Emad Barsoum,Manling Li,Chenfeng Xu

Main category: cs.CV

TL;DR: 本文提出了一个基于实用性的图像描述评估基准CaptionQA,通过衡量生成的描述在下游任务中的表现来评估其质量,覆盖四个领域并构建了大量多选题,揭示了现有模型在描述实用性上的显著不足。

Details Motivation: 现有的图像描述评估方法未能回答一个核心问题:描述是否能在真实下游任务中有效替代图像?因此需要一种新的评估方式来衡量描述的实际效用。 Method: 提出CaptionQA,一个可扩展的、领域相关的基准,包含四个领域的细粒度分类体系,并构建33,027个需依赖视觉信息的多选题;使用LLM仅基于描述回答问题,直接评估描述对图像效用的保留程度。 Result: 评估发现,即使在传统图像问答基准上表现相近的多模态大模型,在描述实用性上差距显著,部分模型下降高达32%。 Conclusion: CaptionQA能有效揭示当前图像描述在下游任务中的效用差距,为改进描述生成提供了新方向,并支持向新领域扩展。 Abstract: Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.

[88] FlowerDance: MeanFlow for Efficient and Refined 3D Dance Generation

Kaixing Yang,Xulong Tang,Ziqiao Peng,Xiangyue Zhang,Puwei Wang,Jun He,Hongyan Liu

Main category: cs.CV

TL;DR: FlowerDance是一种高效的音乐生成舞蹈方法,结合MeanFlow与物理一致性约束,在少量采样步数下实现高质量、物理合理且富有表现力的舞蹈生成,同时通过BiMamba骨干和通道级跨模态融合提升推理速度与内存效率,并支持交互式动作编辑。

Details Motivation: 现有音乐生成舞蹈方法生成效率低,难以满足高保真3D渲染对实时性和计算资源的需求,限制了实际应用中的表现力。 Method: 提出FlowerDance,结合MeanFlow与物理一致性约束以减少采样步数;采用BiMamba骨干网络和通道级跨模态融合的非自回归架构,提升生成效率和运动质量。 Result: 在AIST++和FineDance数据集上实验表明,FlowerDance在运动质量和生成效率方面均达到SOTA水平,具备快速推理和低内存消耗优势。 Conclusion: FlowerDance在保证舞蹈动作高质量的同时显著提升了生成效率,适用于虚拟现实、数字娱乐等需要实时交互的应用场景。 Abstract: Music-to-dance generation aims to translate auditory signals into expressive human motion, with broad applications in virtual reality, choreography, and digital entertainment. Despite promising progress, the limited generation efficiency of existing methods leaves insufficient computational headroom for high-fidelity 3D rendering, thereby constraining the expressiveness of 3D characters during real-world applications. Thus, we propose FlowerDance, which not only generates refined motion with physical plausibility and artistic expressiveness, but also achieves significant generation efficiency on inference speed and memory utilization . Specifically, FlowerDance combines MeanFlow with Physical Consistency Constraints, which enables high-quality motion generation with only a few sampling steps. Moreover, FlowerDance leverages a simple but efficient model architecture with BiMamba-based backbone and Channel-Level Cross-Modal Fusion, which generates dance with efficient non-autoregressive manner. Meanwhile, FlowerDance supports motion editing, enabling users to interactively refine dance sequences. Extensive experiments on AIST++ and FineDance show that FlowerDance achieves state-of-the-art results in both motion quality and generation efficiency. Code will be released upon acceptance.

[89] LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules

Cheng Yang,Hui Jin,Xinlei Yu,Zhipeng Wang,Yaoqun Liu,Fenglei Fan,Dajiang Lei,Gangyong Jia,Changmiao Wang,Ruiquan Ge

Main category: cs.CV

TL;DR: 提出了一种名为LungNoduleAgent的协作式多智能体系统,用于分析肺部CT扫描,通过三个模块提升肺结节描述和恶性分级的准确性,在多个数据集上优于现有模型。

Details Motivation: 现有视觉-语言模型在准确描述肺结节形态和融合医学专业知识方面存在不足,影响其在临床中的可靠性;多智能体系统在病理学中的潜力尚未充分探索。 Method: 将诊断过程分解为三个模块:Nodule Spotter负责检测结节;Radiologist结合局部图像描述技术生成详细CT报告;Doctor Agent System利用影像、报告及病理知识库进行恶性推理,采用多智能体框架实现协作。 Result: 在两个私有数据集和公开LIDC-IDRI数据集上的实验表明,LungNoduleAgent在结节描述和恶性分级任务上优于主流视觉-语言模型、智能体系统和专家模型。 Conclusion: 区域级语义对齐与多智能体协作对肺结节诊断至关重要,LungNoduleAgent有望成为支持临床肺结节分析的基础工具。 Abstract: Diagnosing lung cancer typically involves physicians identifying lung nodules in Computed tomography (CT) scans and generating diagnostic reports based on their morphological features and medical expertise. Although advancements have been made in using multimodal large language models for analyzing lung CT scans, challenges remain in accurately describing nodule morphology and incorporating medical expertise. These limitations affect the reliability and effectiveness of these models in clinical settings. Collaborative multi-agent systems offer a promising strategy for achieving a balance between generality and precision in medical applications, yet their potential in pathology has not been thoroughly explored. To bridge these gaps, we introduce LungNoduleAgent, an innovative collaborative multi-agent system specifically designed for analyzing lung CT scans. LungNoduleAgent streamlines the diagnostic process into sequential components, improving precision in describing nodules and grading malignancy through three primary modules. The first module, the Nodule Spotter, coordinates clinical detection models to accurately identify nodules. The second module, the Radiologist, integrates localized image description techniques to produce comprehensive CT reports. Finally, the Doctor Agent System performs malignancy reasoning by using images and CT reports, supported by a pathology knowledge base and a multi-agent system framework. Extensive testing on two private datasets and the public LIDC-IDRI dataset indicates that LungNoduleAgent surpasses mainstream vision-language models, agent systems, and advanced expert models. These results highlight the importance of region-level semantic alignment and multi-agent collaboration in diagnosing nodules. LungNoduleAgent stands out as a promising foundational tool for supporting clinical analyses of lung nodules.

[90] PG-ControlNet: A Physics-Guided ControlNet for Generative Spatially Varying Image Deblurring

Hakki Motorcu,Mujdat Cetin

Main category: cs.CV

TL;DR: 本文提出了一种新的图像去模糊框架,通过将强大的生成先验与显式的密集物理约束相结合,解决了空间变化模糊复原中物理准确性与感知真实感之间的矛盾。

Details Motivation: 现有的基于学习的去模糊方法在物理准确性和视觉质量之间存在权衡:模型驱动方法物理约束强但纹理过平滑,生成模型视觉效果好但容易产生幻觉细节。本文旨在统一这两种范式。 Method: 将退化场建模为高维压缩核的密集连续体,以捕捉细微的运动和退化模式变化,并利用该描述符场条件化ControlNet架构,引导扩散模型的采样过程。 Result: 实验表明,该方法在严重模糊的复杂场景下优于当前最先进的模型驱动和生成式基线方法,有效平衡了物理准确性和感知真实性。 Conclusion: 所提出的方法成功融合了模型驱动与生成模型的优势,在保持物理约束的同时生成高质量纹理,为复杂空间变化去模糊提供了新的解决方案。 Abstract: Spatially varying image deblurring remains a fundamentally ill-posed problem, especially when degradations arise from complex mixtures of motion and other forms of blur under significant noise. State-of-the-art learning-based approaches generally fall into two paradigms: model-based deep unrolling methods that enforce physical constraints by modeling the degradations, but often produce over-smoothed, artifact-laden textures, and generative models that achieve superior perceptual quality yet hallucinate details due to weak physical constraints. In this paper, we propose a novel framework that uniquely reconciles these paradigms by taming a powerful generative prior with explicit, dense physical constraints. Rather than oversimplifying the degradation field, we model it as a dense continuum of high-dimensional compressed kernels, ensuring that minute variations in motion and other degradation patterns are captured. We leverage this rich descriptor field to condition a ControlNet architecture, strongly guiding the diffusion sampling process. Extensive experiments demonstrate that our method effectively bridges the gap between physical accuracy and perceptual realism, outperforming state-of-the-art model-based methods as well as generative baselines in challenging, severely blurred scenarios.

[91] MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization

Yingjie Xia,Xi Wang,Jinglei Shi,Vicky Kalogeiton,Jian Yang

Main category: cs.CV

TL;DR: MUSE是首个统一的图像情感合成框架,能够同时进行情感生成与编辑,无需额外训练扩散模型或专用数据集,通过梯度优化情感标记、语义相似性指导时机选择以及多情感损失减少干扰,在情感准确性、语义多样性和文本一致性之间实现了更好平衡。

Details Motivation: 现有图像情感合成方法将生成与编辑任务割裂,导致效率低下且限制了在治疗、叙事等场景中的应用,因此需要一个统一高效的框架来解决这一问题。 Method: 提出MUSE框架,采用类测试时扩展(TTS)策略,利用现成的情感分类器进行梯度优化以控制情感标记;通过语义相似性确定情感引导的最佳时机;设计多情感损失函数以降低固有及相似情绪的干扰。 Result: 实验表明MUSE在生成和编辑任务上均优于现有方法,显著提升情感准确性和语义多样性,同时保持内容保真度、文本对齐性和真实情感表达之间的良好平衡。 Conclusion: MUSE为图像情感合成建立了新范式,首次实现了无需额外训练的统一生成与编辑框架,具有广泛的应用潜力。 Abstract: Images evoke emotions that profoundly influence perception, often prioritized over content. Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling. In this work, we introduce MUSE, the first unified framework capable of both emotional generation and editing. By adopting a strategy conceptually aligned with Test-Time Scaling (TTS) that widely used in both LLM and diffusion model communities, it avoids the requirement for additional updating diffusion model and specialized emotional synthesis datasets. More specifically, MUSE addresses three key questions in emotional synthesis: (1) HOW to stably guide synthesis by leveraging an off-the-shelf emotion classifier with gradient-based optimization of emotional tokens; (2) WHEN to introduce emotional guidance by identifying the optimal timing using semantic similarity as a supervisory signal; and (3) WHICH emotion to guide synthesis through a multi-emotion loss that reduces interference from inherent and similar emotions. Experimental results show that MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining an optimal balance between desired content, adherence to text prompts, and realistic emotional expression. It establishes a new paradigm for emotion synthesis.

[92] Long-Term Alzheimers Disease Prediction: A Novel Image Generation Method Using Temporal Parameter Estimation with Normal Inverse Gamma Distribution on Uneven Time Series

Xin Hong,Xinze Sun,Yinhao Li,Yen-Wei Chen

Main category: cs.CV

TL;DR: 提出了一种基于时间参数的正态逆伽马分布(T-NIG)模型,用于在不规则时间间隔下生成脑图像并长期预测阿尔茨海默病(AD),在减少不确定性的同时保持疾病相关特征。

Details Motivation: 现有方法在处理不规则时间间隔的序列数据时,难以在长期图像生成中保持与疾病相关的特征,影响AD的准确预测。 Method: T-NIG模型利用两个时间点的脑图像,通过引入时间参数到正态逆伽马分布中,并结合坐标邻域特征和不确定性估计(减少认知和随机不确定性),实现中间和未来图像的生成。 Result: T-NIG在短期和长期预测任务中均达到最先进性能,能够有效预测疾病进展,并在不规则时间分布下保持疾病相关特征。 Conclusion: T-NIG模型能有效应对不规则时间间隔的挑战,提升长期AD预测的准确性与鲁棒性,具有临床应用潜力。 Abstract: Image generation can provide physicians with an imaging diagnosis basis in the prediction of Alzheimer's Disease (AD). Recent research has shown that long-term AD predictions by image generation often face difficulties maintaining disease-related characteristics when dealing with irregular time intervals in sequential data. Considering that the time-related aspects of the distribution can reflect changes in disease-related characteristics when images are distributed unevenly, this research proposes a model to estimate the temporal parameter within the Normal Inverse Gamma Distribution (T-NIG) to assist in generating images over the long term. The T-NIG model employs brain images from two different time points to create intermediate brain images, forecast future images, and predict the disease. T-NIG is designed by identifying features using coordinate neighborhoods. It incorporates a time parameter into the normal inverse gamma distribution to understand how features change in brain imaging sequences that have varying time intervals. Additionally, T-NIG utilizes uncertainty estimation to reduce both epistemic and aleatoric uncertainties in the model, which arise from insufficient temporal data. In particular, the T-NIG model demonstrates state-of-the-art performance in both short-term and long-term prediction tasks within the dataset. Experimental results indicate that T-NIG is proficient in forecasting disease progression while maintaining disease-related characteristics, even when faced with an irregular temporal data distribution.

[93] MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Ziyun Zeng,Hang Hua,Jiebo Luo

Main category: cs.CV

TL;DR: 本文提出了MIRA(多模态迭代推理代理),一种轻量级、即插即用的图像编辑方法,通过迭代感知-推理-行动循环来逐步解析复杂的自然语言指令,并利用视觉反馈提升编辑准确性。

Details Motivation: 扩散模型在理解复杂的自然语言指令(如组合关系、上下文线索或指代表达)时表现不佳,常导致语义偏离或编辑失败。为此,需要一种能更准确解析复杂指令的图像编辑框架。 Method: 提出MIRA,采用迭代的感知-推理-行动机制,逐步生成原子级编辑指令并结合视觉反馈进行决策;构建包含15万样本的多模态工具使用数据集MIRA-Editing,并采用两阶段SFT + GRPO训练策略。 Result: MIRA在多个开源图像编辑模型(如Flux.1-Kontext、Step1X-Edit、Qwen-Image-Edit)上显著提升了语义一致性和感知质量,性能媲美甚至超过GPT-Image和Nano-Banana等专有系统。 Conclusion: MIRA通过模拟人类多轮交互过程,有效提升了指令驱动图像编辑的精确性与鲁棒性,具备良好的通用性和应用潜力。 Abstract: Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

[94] AnchorOPT: Towards Optimizing Dynamic Anchors for Adaptive Prompt Learning

Zheng Li,Yibing Song,Xin Zhang,Lei Luo,Xiang Li,Jian Yang

Main category: cs.CV

TL;DR: 提出AnchorOPT,一种基于动态锚点的提示学习框架,通过从任务数据中动态学习锚点值并自适应优化锚点与软令牌的位置关系,提升CLIP模型的泛化能力。

Details Motivation: 现有基于CLIP的提示学习方法使用静态文本标记作为锚点,缺乏跨任务和训练阶段的灵活性,限制了模型的泛化性能。 Method: AnchorOPT在两个维度引入动态性:一是锚点值从任务特定数据中动态学习,而非手工设计;二是通过一个依赖于训练阶段和任务上下文的可学习位置矩阵,自适应地优化锚点与软令牌的位置关系。训练分为两阶段:先学习并冻结锚点,再优化软令牌和位置矩阵。 Result: 实验表明,仅使用简单的可学习锚点和位置矩阵,AnchorOPT即可达到或超越一些引入额外模块或正则化技术的方法,且作为即插即用模块,在多个数据集上均带来性能提升。 Conclusion: AnchorOPT通过动态锚点和位置优化机制,增强了提示学习的灵活性和适应性,有效提升了CLIP在下游任务中的表现。 Abstract: Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., "shape", "color"), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.

[95] CLRecogEye : Curriculum Learning towards exploiting convolution features for Dynamic Iris Recognition

Geetanjali Sharma,Gaurav Jaswal,Aditya Nigam,Raghavendra Ramachandra

Main category: cs.CV

TL;DR: 提出一种基于3D-CNN和课程学习的虹膜认证新方法,通过建模时空特征提升对旋转、尺度、反光和模糊等干扰的鲁棒性。

Details Motivation: 现有虹膜识别方法多依赖点对点比较,难以有效利用虹膜模式的时空结构,且在面对旋转、尺度变化、反光和模糊时鲁棒性不足。 Method: 将虹膜图像沿一维分割成子图像序列,输入3D-CNN以捕捉空间和时空特征,并采用课程学习策略训练模型,结合三元组损失和ArcFace损失进行端到端优化。 Result: 所提方法在多种干扰条件下表现出更强的判别能力和鲁棒性,显著提升了虹膜识别性能。 Conclusion: 该框架通过学习丰富的时空特征表示,实现了更鲁棒、可泛化的虹膜认证解决方案。 Abstract: Iris authentication algorithms have achieved impressive recognition performance, making them highly promising for real-world applications such as border control, citizen identification, and both criminal investigations and commercial systems. However, their robustness is still challenged by variations in rotation, scale, specular reflections, and defocus blur. In addition, most existing approaches rely on straightforward point-to-point comparisons, typically using cosine or L2 distance, without effectively leveraging the spatio-spatial-temporal structure of iris patterns. To address these limitations, we propose a novel and generalized matching pipeline that learns rich spatio-spatial-temporal representations of iris features. Our approach first splits each iris image along one dimension, generating a sequence of sub-images that serve as input to a 3D-CNN, enabling the network to capture both spatial and spatio-spatial-temporal cues. To further enhance the modeling of spatio-spatial-temporal feature dynamics, we train the model in curriculum manner. This design allows the network to embed temporal dependencies directly into the feature space, improving discriminability in the deep metric domain. The framework is trained end-to-end with triplet and ArcFace loss in a curriculum manner, enforcing highly discriminative embeddings despite challenges like rotation, scale, reflections, and blur. This design yields a robust and generalizable solution for iris authentication.Github code: https://github.com/GeetanjaliGTZ/CLRecogEye

[96] Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis

Jiyun Bae,Hyunjong Ok,Sangwoo Mo,Jaeho Lee

Main category: cs.CV

TL;DR: 本文研究了视觉语言模型中视觉干扰物对测试时扩展性的影响,发现与文本干扰物不同,视觉干扰物在不增加推理长度的情况下降低准确性,并提出了一种简单的提示策略来减轻偏差预测。

Details Motivation: 探究多模态场景下视觉干扰物是否会导致类似文本模型中的逆向扩展效应,理解视觉与文本干扰物的差异及其对推理和准确率的影响。 Method: 构建了一个系统性改变语义、数值和空间维度干扰物的视觉问答数据集Idis,并通过分析推理轨迹中的属性计数来研究干扰物、推理长度和准确率之间的关系。 Result: 发现视觉干扰物导致逆向扩展但不增加推理长度;干扰物减少准确性,属性计数可揭示推理过程中的关键模式;该现象在Waterbirds等基准上得到验证。 Conclusion: 视觉干扰物对VLMs的影响机制不同于文本干扰物,仅通过减少准确性体现,且可通过简单提示策略缓解偏差驱动的预测。 Abstract: How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.

[97] Pygmalion Effect in Vision: Image-to-Clay Translation for Reflective Geometry Reconstruction

Gayoung Lee,Junho Kim,Jin-Hwa Kim,Junmo Kim

Main category: cs.CV

TL;DR: 本文提出了一种受皮格马利翁神话启发的新框架,通过将反射图像转换为类黏土形态来抑制镜面反射,从而实现对含复杂反射的物体进行鲁棒的三维重建。

Details Motivation: 由于视角依赖的反射导致外观与几何的纠缠,理解反射一直是3D重建中的长期挑战。现有方法难以在存在复杂反射时保持几何一致性。 Method: 提出双分支网络:一个基于BRDF的反射分支和一个黏土引导分支;利用合成的无反射黏土图像作为监督信号,联合训练两个分支以稳定几何并优化表面法线。 Result: 在合成与真实数据集上均显著提升了法线精度和网格完整性,优于现有的处理反射的方法。 Conclusion: 通过“去光泽化”实现几何学习是一种有效的归纳偏置,'见于无光'可成为处理反射物体3D重建的新范式。 Abstract: Understanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically "sculpts" reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.

[98] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Wenbo Hu,Jingli Lin,Yilin Long,Yunlong Ran,Lihan Jiang,Yifan Wang,Chenming Zhu,Runsen Xu,Tai Wang,Jiangmiao Pang

Main category: cs.CV

TL;DR: G$^2$VLM 是一种几何基础的视觉-语言模型,通过结合3D视觉几何特征来提升空间理解与推理能力,兼具3D重建和空间任务的优异性能。

Details Motivation: 现有视觉-语言模型在空间智能方面表现不足,缺乏从2D图像重建3D空间的几何学习过程。 Method: 提出 G$^2$VLM,利用多视角图像和视频数据训练,融合3D视觉先验,通过上下文学习和交错推理机制,统一实现3D重建与空间理解。 Result: 在3D重建任务上达到与最先进模型相当的性能,在空间理解与推理任务上表现更好或具有竞争力。 Conclusion: G$^2$VLM 成功融合了语义丰富的视觉-语言模型与低层3D视觉任务,可作为空间智能研究的强基线,并推动如3D场景编辑等应用的发展。 Abstract: Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.

[99] Scaling Foundation Models for Radar Scene Understanding

Pushkal Mishra,Kshitiz Bansal,Dinesh Bharadia

Main category: cs.CV

TL;DR: 本文提出了RadarFM,一种基于结构化空间语言监督的雷达基础模型,旨在实现跨任务的统一场景级表征学习。

Details Motivation: 现有雷达方法碎片化且任务特定,缺乏跨任务迁移能力,同时雷达与基础模型的结合尚待探索。 Method: 提出结构化字幕框架和哈希感知对比学习目标,利用CARLA模拟器生成大规模数据,并设计定位感知的评估指标。 Result: 实现了细粒度的空间推理能力,提升了雷达感知在不同任务中的可迁移性和性能表现。 Conclusion: RadarFM为雷达感知提供了统一的基础模型框架,推动了其在复杂驾驶场景下的广泛应用潜力。 Abstract: Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.

[100] EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

Ze Feng,Sen Yang,Boqiang Duan,Wankou Yang,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的知识蒸馏范式EM-KD,用于增强高效多模态大语言模型(MLLMs),通过曼哈顿距离和匈牙利匹配算法对齐教师与学生模型的视觉logits,并引入两种蒸馏策略(VLAD和VSD)来提升视觉-语言理解能力,在多个基准上显著优于现有方法。

Details Motivation: 现有的高效MLLM在压缩视觉token时会丢失信息,导致理解能力下降;传统知识蒸馏方法忽略了师生模型间视觉token不均衡带来的细粒度视觉理解差异,因此需要一种更有效的蒸馏机制。 Method: 首先计算教师与学生模型视觉logits之间的曼哈顿距离,并使用匈牙利算法在空间维度上进行对齐;随后引入两种蒸馏策略:1)视觉-语言亲和性蒸馏(VLAD),通过最小化平滑L1损失来对齐文本与视觉token的亲和矩阵;2)视觉语义蒸馏(VSD),利用反向KL散度衡量对齐后视觉logits在词汇空间上的概率分布差异。 Result: 在多个基准测试中,采用EM-KD训练的模型在准确性和效率方面均显著优于之前的高效MLLM方法;与现有蒸馏方法相比,结合所提出的视觉token对齐策略后,EM-KD也表现出更优性能。 Conclusion: EM-KD通过有效的视觉token对齐机制和双重蒸馏策略,成功缓解了因视觉token压缩导致的信息损失问题,显著提升了高效MLLM的多模态理解能力,为轻量化模型训练提供了新思路。 Abstract: Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.

[101] FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain

YuAn Wang,Xiaofan Li,Chi Huang,Wenhao Zhang,Hao Li,Bosheng Wang,Xun Sun,Jun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为FaithFusion的3DGS与扩散模型融合框架,利用像素级的期望信息增益(EIG)实现可控驾驶场景重建中的几何保真与视觉真实感的平衡,无需额外先验条件,在大视角变换下仍保持优越性能。

Details Motivation: 在可控驾驶场景重建和3D场景生成中,如何在大视角变化下保持几何保真同时生成视觉上合理的外观是一个挑战。现有方法因缺乏像素级、3D一致的编辑准则,易导致过度修复和几何漂移。 Method: 提出FaithFusion框架,引入像素级的期望信息增益(EIG)作为统一策略:EIG指导扩散模型作为空间先验来优化高不确定性区域,并通过像素级加权将编辑结果蒸馏回3D高斯散点(3DGS),实现协同时空合成。该框架即插即用,无需额外先验或结构修改。 Result: 在Waymo数据集上的实验表明,该方法在NTA-IoU、NTL-IoU和FID等指标上达到SOTA水平,在6米车道偏移下仍保持107.47的FID值。 Conclusion: FaithFusion通过EIG实现了3DGS与扩散模型的有效融合,在不依赖额外条件的情况下显著提升了驾驶场景重建的几何一致性和视觉质量,具有良好的实用性和扩展性。 Abstract: In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce \textbf{FaithFusion}, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications.Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift. Our code is available at https://github.com/wangyuanbiubiubiu/FaithFusion.

[102] Deformation-aware Temporal Generation for Early Prediction of Alzheimers Disease

Xin Honga,Jie Lin,Minghui Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Deformation-Aware Temporal Generative Network (DATGN)的新方法,用于通过生成未来MRI图像来自动学习阿尔茨海默病(AD)的脑形态变化,实现疾病的早期预测。该方法能处理时间序列中常见的缺失数据,并生成符合疾病进展趋势的图像,在分类任务中显著提升了准确率。

Details Motivation: 阿尔茨海默病的早期预测对延缓病情发展至关重要。现有方法依赖手动提取脑部影像形态特征,且难以应对时间序列MRI数据缺失问题,因此需要一种能自动学习疾病进展中脑萎缩模式并具备鲁棒性的时间建模方法。 Method: 提出DATGN模型:首先对不完整的时间序列MRI数据进行插值补全;然后利用双向时间形变感知模块引导网络生成符合AD progression规律的未来MRI图像;生成的图像可用于训练分类器以提升预测性能。在ADNI数据集上验证了图像生成质量和下游分类任务的改进。 Result: DATGN在MRI图像生成任务中取得了具有竞争力的PSNR和MMSE指标;当生成的合成数据用于SVM、CNN和3DCNN分类时,AD vs. NC分类准确率提升6.21%~16%,AD vs. MCI vs. NC多类分类准确率提升7.34%~21.25%;可视化结果表明生成图像符合AD相关的脑萎缩趋势。 Conclusion: DATGN能够有效建模阿尔茨海默病的动态脑形态变化,生成高质量且符合病理发展趋势的未来MRI图像,有助于缓解医疗时序数据缺失问题,并通过数据增强显著提升早期诊断的分类性能,具有临床应用潜力。 Abstract: Alzheimer's disease (AD), a degenerative brain condition, can benefit from early prediction to slow its progression. As the disease progresses, patients typically undergo brain atrophy. Current prediction methods for Alzheimers disease largely involve analyzing morphological changes in brain images through manual feature extraction. This paper proposes a novel method, the Deformation-Aware Temporal Generative Network (DATGN), to automate the learning of morphological changes in brain images about disease progression for early prediction. Given the common occurrence of missing data in the temporal sequences of MRI images, DATGN initially interpolates incomplete sequences. Subsequently, a bidirectional temporal deformation-aware module guides the network in generating future MRI images that adhere to the disease's progression, facilitating early prediction of Alzheimer's disease. DATGN was tested for the generation of temporal sequences of future MRI images using the ADNI dataset, and the experimental results are competitive in terms of PSNR and MMSE image quality metrics. Furthermore, when DATGN-generated synthetic data was integrated into the SVM vs. CNN vs. 3DCNN-based classification methods, significant improvements were achieved from 6. 21\% to 16\% in AD vs. NC classification accuracy and from 7. 34\% to 21. 25\% in AD vs. MCI vs. NC classification accuracy. The qualitative visualization results indicate that DATGN produces MRI images consistent with the brain atrophy trend in Alzheimer's disease, enabling early disease prediction.

[103] Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models

Changlin Li,Jiawei Zhang,Zeyi Shi,Zongxin Yang,Zhihui Li,Xiaojun Chang

Main category: cs.CV

TL;DR: 本文提出EntPruner,一种基于熵引导的自动渐进式剪枝框架,用于扩散和流生成模型,通过条件熵偏差(CED)评估模块重要性,并结合零样本自适应剪枝策略,在保持生成质量的同时实现最高2.22倍的推理加速。

Details Motivation: 大规模视觉生成模型在下游任务迁移时存在显著参数冗余问题,且生成模型需保持输出分布的多样性与条件保真度,传统剪枝方法难以适用。 Method: 提出熵引导剪枝策略,使用数据依赖的条件熵偏差(CED)衡量模块重要性;设计零样本自适应渐进剪枝框架,在训练过程中动态决定剪枝时机与程度,避免模式崩溃。 Result: 在DiT和SiT模型上实验表明,EntPruner可实现最高2.22倍推理速度提升,同时在ImageNet及三个下游数据集上保持具有竞争力的生成质量。 Conclusion: EntPruner有效平衡了生成模型压缩与生成性能之间的权衡,为扩散与流模型的高效部署提供了可行方案。 Abstract: Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive pruning framework for diffusion and flow models. First, we introduce entropy-guided pruning, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require preserving the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes pruning of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive pruning framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot pruning, mitigating mode collapse, and preserving model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.

[104] CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

Dianbing Xi,Jiepeng Wang,Yuanzhi Liang,Xi Qiu,Jialun Liu,Hao Pan,Yuchi Huo,Rui Wang,Haibin Huang,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出CtrlVDiff,一个统一的扩散模型,通过引入多种图形模态(如深度、法线、分割、边缘、反照率、粗糙度等)实现高质量的视频理解和可控生成,解决了几何线索不足和多模态输入不完整的问题。

Details Motivation: 现有的基于几何线索的视频生成方法在外观、材质和光照控制方面存在局限,难以实现物理上合理的编辑并易导致时序漂移。需要更丰富的模态信息来提升理解和控制能力。 Method: 提出CtrlVDiff模型,采用混合模态控制策略(HMCS),可接受任意子集的多模态输入,并通过特征路由与融合机制保持时序一致性;构建了MMVideo数据集,结合真实与合成视频,提供跨模态对齐标注用于训练。 Result: 在视频理解与生成任务中,CtrlVDiff在可控性与生成质量上优于现有最先进方法,支持逐层编辑(如重光照、材质调整、物体插入),并在部分模态缺失时仍保持鲁棒性。 Conclusion: 引入更多图形语义模态并结合混合训练策略,能够有效提升视频扩散模型的理解与生成能力,实现更精确、可预测且时序连贯的视频编辑。 Abstract: We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.

[105] DeepRFTv2: Kernel-level Learning for Image Deblurring

Xintian Mao,Haofei Song,Yin-Nian Liu,Qingli Li,Yan Wang

Main category: cs.CV

TL;DR: 提出傅里叶核估计器(FKE),在频域中将空间域的卷积转换为乘法,实现低复杂度、无监督的核级模糊过程学习,并结合多尺度可逆网络结构提升去模糊性能。

Details Motivation: 现有深度网络局限于像素级学习,无法让去模糊模型真正理解模糊的本质机制,缺乏对核级模糊过程的学习能力。 Method: 提出傅里叶核估计器(FKE),在傅里叶空间进行激活操作,将空间域卷积转化为频域乘法;将卷积对象从图像转为富含语义信息的网络特征;设计解耦的多尺度可逆子U-Net架构以提升特征提取效率。 Result: 方法在运动去模糊任务上达到最先进性能,核估计器能学习到物理上有意义的模糊核,且具备处理其他核相关问题的潜力。 Conclusion: FKE实现了高效、无监督的核级模糊建模,通过频域处理和特征级卷积使网络真正理解模糊本质,显著提升了去模糊效果。 Abstract: It is well-known that if a network aims to learn how to deblur, it should understand the blur process. Blurring is naturally caused by the convolution of the sharp image with the blur kernel. Thus, allowing the network to learn the blur process in the kernel-level can significantly improve the image deblurring performance. But, current deep networks are still at the pixel-level learning stage, either performing end-to-end pixel-level restoration or stage-wise pseudo kernel-level restoration, failing to enable the deblur model to understand the essence of the blur. To this end, we propose Fourier Kernel Estimator (FKE), which considers the activation operation in Fourier space and converts the convolution problem in the spatial domain to a multiplication problem in Fourier space. Our FKE, jointly optimized with the deblur model, enables the network to learn the kernel-level blur process with low complexity and without any additional supervision. Furthermore, we change the convolution object of the kernel from ``image" to network extracted ``feature", whose rich semantic and structural information is more suitable to blur process learning. With the convolution of the feature and the estimated kernel, our model can learn the essence of blur in kernel-level. To further improve the efficiency of feature extraction, we design a decoupled multi-scale architecture with multiple hierarchical sub-unets with a reversible strategy, which allows better multi-scale encoding and decoding in low training memory. Extensive experiments indicate that our method achieves state-of-the-art motion deblurring results and show potential for handling other kernel-related problems. Analysis also shows our kernel estimator is able to learn physically meaningful kernels. The code will be available at https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur.

[106] Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning

Changlin Li,Jiawei Zhang,Shuhao Liu,Sihao Lin,Zeyi Shi,Zhihui Li,Xiaojun Chang

Main category: cs.CV

TL;DR: 提出了一种名为Ent-Prog的高效训练框架,用于人体视频生成中的扩散模型,通过熵引导的优先级学习和自适应渐进策略,显著减少训练时间和显存消耗,同时保持生成性能。

Details Motivation: 由于在高分辨率、多帧数据上训练扩散模型存在高计算成本和大内存消耗的问题,因此需要一种更高效的训练方法来应对这些挑战。 Method: 提出了条件熵膨胀(CEI)来评估模型组件在条件生成任务中的重要性,并结合自适应渐进训练策略,动态调整训练过程中的计算复杂度。 Result: 在三个数据集上的实验表明,Ent-Prog可实现最高2.2倍的训练加速和2.4倍的GPU内存减少,且不牺牲生成性能。 Conclusion: Ent-Prog为扩散模型在人体视频生成中的高效训练提供了一种有效解决方案,平衡了效率与模型性能。 Abstract: Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.

[107] Referring Video Object Segmentation with Cross-Modality Proxy Queries

Baoli Sun,Xinzhu Ma,Ning Wang,Zhihui Wang,Zhiyong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为ProxyFormer的新型指代表情视频对象分割(RVOS)架构,通过引入代理查询来融合视觉与文本语义,并在视频特征编码过程中动态传播,增强了跨帧依赖性和语义对齐,显著提升了分割性能。

Details Motivation: 现有方法在跨模态对齐时缺乏帧间依赖建模,且文本约束集成过晚,导致目标跟踪不准确并可能关注错误对象。 Method: 提出ProxyFormer模型,引入可更新和传播的代理查询,在多阶段视频特征编码中融合视觉与文本语义;将跨模态交互解耦为时空两个维度以降低计算成本,并设计联合语义一致性(JSC)训练策略增强语义对齐。 Result: 在四个主流RVOS基准上的实验表明,ProxyFormer在性能上优于现有最先进方法。 Conclusion: ProxyFormer通过动态代理查询机制有效实现了跨帧语义传播与精确的图文对齐,显著提升了指代表情下的视频对象分割效果,兼具高效性与鲁棒性。 Abstract: Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.

[108] TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

Jiaming He,Guanyu Hou,Hongwei Li,Zhicong Huang,Kangjie Chen,Yi Yu,Wenbo Jiang,Guowen Xu,Tianwei Zhang

Main category: cs.CV

TL;DR: 提出了一种名为TEAR的时序感知自动化红队框架,用于发现文本到视频生成模型中的安全风险,尤其针对动态时序特性,通过两阶段优化方法生成看似无害但能引发违规视频输出的提示,并循环优化提示的隐蔽性和攻击效果,实验显示在开源和商业系统中攻击成功率超过80%。

Details Motivation: 现有安全评估方法主要关注静态图像和文本生成,无法充分捕捉视频生成中的复杂时序动态,因此需要专门针对T2V模型时序特性设计新的安全评测框架。 Method: 提出TEAR框架,采用时序感知测试生成器,结合初始生成器训练和时序感知在线偏好学习两阶段优化,生成可触发违规内容的文本提示,并引入精炼模型循环提升提示的隐蔽性和攻击有效性。 Result: 在多个开源与商业T2V系统上进行实验,TEAR实现了超过80%的攻击成功率,相比之前最高的57%有显著提升。 Conclusion: TEAR能有效揭示T2V模型中由时序动态引发的安全隐患,为未来视频生成模型的安全评估提供了新思路和工具。 Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80% attack success rate, a significant boost from prior best result of 57%.

[109] LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

Shichu Sun,Yichen Zhang,Haolin Song,Zonghao Guo,Chi Chen,Yidan Zhang,Yuan Yao,Zhiyuan Liu,Maosong Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为LLaVA-UHD v3的多模态大语言模型,其核心是渐进式视觉压缩(PVC)方法,能够在保持高性能的同时显著降低推理延迟。

Details Motivation: 研究全局原分辨率视觉编码与切片方法的差异,解决全局编码带来的高计算开销问题。 Method: 提出PVC方法,包含精细化的patch嵌入和分层的窗口化token压缩模块,集成于标准ViT中以实现高效编码。 Result: ViT-UHD在多个基准上表现优异,相比MoonViT将首令牌时间(TTFT)减少2.4倍;LLaVA-UHD v3性能媲美Qwen2-VL的同时,TTFT降低1.9倍。 Conclusion: PVC方法能有效平衡多模态模型的效率与性能,为构建高效的MLLM提供了可行路径。 Abstract: Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.

[110] Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation

Joonhyung Park,Hyeongwon Jang,Joowon Kim,Eunho Yang

Main category: cs.CV

TL;DR: 提出GridAR框架,通过网格划分的渐进生成和布局指定的提示重构策略,提升视觉自回归模型在测试时计算扩展下的生成质量与效率。

Details Motivation: 现有视觉自回归模型在测试时计算扩展方面研究不足,传统Best-of-N等方法因缺乏整体画布规划且无法中途修正错误路径,导致计算资源浪费和生成效果受限。 Method: 设计GridAR框架:1)采用网格分区的渐进式生成,允许在同一位置生成多个候选片段并早期剪枝无效路径,保留有效结果作为锚点引导后续生成;2)提出布局指定的提示重构策略,通过分析局部视图推断可行布局,指导后续生成以弥补缺乏全局蓝图的问题。 Result: 在N=4时,GridAR在T2I-CompBench++上比Best-of-N(N=8)提升14.4%,同时降低25.6%计算成本;在PIE-Bench上图像编辑任务中语义保持能力优于大N基线13.9%。 Conclusion: GridAR有效提升了视觉自回归模型在有限测试时计算资源下的生成质量和效率,解决了传统方法中错误累积和缺乏全局规划的问题,并展现出在文本到图像生成和图像编辑任务中的良好泛化能力。 Abstract: Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.

[111] Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding

Yutao Tang,Cheng Zhao,Gaurav Mittal,Rohith Kukkala,Rama Chellappa,Cheng Peng,Mei Chen

Main category: cs.CV

TL;DR: 本文提出NDTokenizer3D,一种基于多尺度NDT表示的通用3D视觉语言模型,通过三阶段场景分词 pipeline 实现细粒度、统一的3D场景理解,在多项任务中表现优异。

Details Motivation: 现有3D视觉语言模型难以有效将3D场景整体分词化并应用于多样化任务,缺乏统一框架支持语言推理与空间理解的结合。 Method: 提出NDTokenizer3D,采用多尺度NDT表示构建点云结构,设计MSDec解码器进行跨尺度特征融合生成全局场景token,并复用MSDec支持交互提示和分割解码,实现单一架构下多种3D理解任务的统一。 Result: 在3D指代表达分割、3D视觉问答和3D密集描述等任务上取得显著性能提升,验证了模型的细粒度理解和通用性能力。 Conclusion: NDTokenizer3D通过新颖的多尺度NDT分词 pipeline 和统一架构,成功桥接了语言推理与3D空间理解,为通用3D视觉语言模型提供了高效、灵活的解决方案。 Abstract: Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.

[112] When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models

Hui Lu,Yi Yu,Yiming Yang,Chenyu Yi,Qixin Zhang,Bingquan Shen,Alex C. Kot,Xudong Jiang

Main category: cs.CV

TL;DR: 提出UPA-RFAS框架,实现对视觉-语言-动作(VLA)模型的通用、可迁移对抗补丁攻击,具有跨模型、跨任务和实际物理场景的高转移性。

Details Motivation: 现有对抗补丁大多过拟合于单一模型,在黑盒设置下表现差,缺乏在未知架构、微调变体和仿真到现实迁移下的通用性和可迁移性。 Method: 提出UPA-RFAS框架:结合基于共享特征空间的特征目标(含ℓ₁偏差先验和排斥性InfoNCE损失)、鲁棒性增强的两阶段min-max优化(内层学习样本级扰动,外层优化通用补丁),以及两个VLA专用损失——补丁注意力主导和语义错位损失,以实现无标签下的文本-视觉注意力劫持和图文不匹配。 Result: 在多种VLA模型、操作任务和真实物理实验中验证了UPA-RFAS的优越性能,表现出强跨模型、跨任务和跨视角的迁移能力,且能在物理世界中成功实施攻击。 Conclusion: UPA-RFAS揭示了VLA系统在现实场景下面临的实用化补丁攻击威胁,为后续防御机制研究提供了强有力的基础。 Abstract: Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.

[113] You Can Trust Your Clustering Model: A Parameter-free Self-Boosting Plug-in for Deep Clustering

Hanyang Li,Yuheng Jia,Hui Liu,Junhui Hou

Main category: cs.CV

TL;DR: 提出了一种名为DCBoost的无参数插件,通过利用可靠的局部结构信息来增强深度聚类模型的全局特征结构,显著提升了多种现有方法的聚类性能。

Details Motivation: 现有深度聚类方法存在全局与局部特征结构不一致的问题,局部结构紧凑而全局结构边界交错、分离不佳,影响聚类效果。 Method: 通过自适应k近邻一致性过滤筛选高置信度样本作为可靠锚点,并基于这些样本计算判别性损失,以增强类内紧凑性和类间可分性,从而优化网络。 Result: 在多个基准数据集上实验表明,DCBoost显著提升了多种深度聚类模型的性能,相比当前最优方法(如ProPos)提升超过3%,轮廓系数提高逾7倍。 Conclusion: DCBoost作为一种即插即用的无参数模块,有效弥合了局部与全局特征结构的差距,显著增强了深度聚类模型的性能。 Abstract: Recent deep clustering models have produced impressive clustering performance. However, a common issue with existing methods is the disparity between global and local feature structures. While local structures typically show strong consistency and compactness within class samples, global features often present intertwined boundaries and poorly separated clusters. Motivated by this observation, we propose DCBoost, a parameter-free plug-in designed to enhance the global feature structures of current deep clustering models. By harnessing reliable local structural cues, our method aims to elevate clustering performance effectively. Specifically, we first identify high-confidence samples through adaptive $k$-nearest neighbors-based consistency filtering, aiming to select a sufficient number of samples with high label reliability to serve as trustworthy anchors for self-supervision. Subsequently, these samples are utilized to compute a discriminative loss, which promotes both intra-class compactness and inter-class separability, to guide network optimization. Extensive experiments across various benchmark datasets showcase that our DCBoost significantly improves the clustering performance of diverse existing deep clustering models. Notably, our method improves the performance of current state-of-the-art baselines (e.g., ProPos) by more than 3% and amplifies the silhouette coefficient by over $7\times$. Code is available at .

[114] BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data

Selene Cerna,Sara Si-Moussi,Wilfried Thuiller,Hadrien Hendrikx,Vincent Miele

Main category: cs.CV

TL;DR: 本文提出了BotaCLIP,一种轻量级多模态对比框架,用于将领域知识(特别是生态结构)注入预训练的地球观测基础模型(DOFA),通过高分辨率航拍图像与植物调查数据对齐,实现无需从头训练的领域自适应。生成的嵌入在多个生态任务中表现优于原始DOFA和监督基线方法,展示了在数据稀缺场景下高效注入专家知识的潜力。

Details Motivation: 在生物多样性建模等现实应用中,需要将领域特定知识(如生态结构)融入预训练的基础模型,但重新训练成本高昂且数据有限,因此亟需一种低成本、高效的适应方法。 Method: 提出BotaCLIP,一个轻量级多模态对比学习框架,通过将预训练的地球观测模型(DOFA)的嵌入与植物调查数据(botanical relevés)对齐来注入生态知识;采用正则化策略缓解灾难性遗忘,从而在不重训模型的情况下实现领域自适应。 Result: 在植物存在预测、蝴蝶出现建模和土壤营养类群丰度估计三个生态任务中,BotaCLIP生成的表示均一致优于DOFA原始表示和监督基线方法,验证了其有效性。 Conclusion: 领域感知的基础模型适应方法(如BotaCLIP)能够在数据稀缺环境中有效注入专家知识,实现低成本、可迁移的表示学习,为生态学及其他专业领域提供了可行的Frugal AI解决方案。 Abstract: Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.

[115] Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

Baoli Sun,Yihan Wang,Xinzhu Ma,Zhihui Wang,Kun Lu,Zhiyong Wang

Main category: cs.CV

TL;DR: 提出Action-Region Tracking (ART)框架,通过查询-响应机制发现并跟踪视频中细粒度动作的局部动态,提升细粒度动作识别性能。

Details Motivation: 现有方法难以捕捉细粒度动作类别之间的细微局部差异,主要局限于粗粒度运动模式的建模。 Method: 设计区域特定语义激活模块,利用判别性和文本约束语义作为查询,捕获每帧中最相关的区域响应;构建动作轨迹(action tracklets)以表征跨帧的局部动作动态;采用多层级轨迹对比约束优化空间与时间上的区域响应,并通过任务特定微调机制优化视觉语言模型中的文本语义表示。 Result: 在多个主流动作识别基准上实验表明,该方法显著优于先前的最先进方法。 Conclusion: ART框架有效提升了细粒度动作识别的性能,通过显式建模局部区域的时空动态和语义约束查询机制,实现了对相似动作的精准区分。 Abstract: Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.

[116] From Diffusion to One-Step Generation: A Comparative Study of Flow-Based Models with Application to Image Inpainting

Umang Agarwal,Rudraksh Sangore,Sumit Laddha

Main category: cs.CV

TL;DR: 本研究对三种生成模型(DDPM、CFM和MeanFlow)进行了比较,使用统一的TinyUNet架构在CIFAR-10上实现,结果显示CFM在50步采样下FID为24.15,显著优于DDPM;MeanFlow虽为单步生成但FID为29.15,推理时间减少50倍;此外,CFM扩展至图像修复任务,在多种掩码下表现优异。

Details Motivation: 比较不同生成模型在相同条件下的性能差异,并探索高效采样方法及在图像修复中的应用潜力。 Method: 采用统一的TinyUNet架构(<1.5M参数)在CIFAR-10数据集上实现DDPM、CFM和MeanFlow三种方法,并评估其FID指标;进一步将CFM应用于图像修复,设计掩码引导采样策略并测试四种掩码类型。 Result: CFM在50步采样下达到FID 24.15,显著优于DDPM的402.98;MeanFlow实现单步生成,FID为29.15,推理速度快50倍;在图像修复中,PSNR从4.95提升至8.57 dB,SSIM从0.289提升至0.418。 Conclusion: CFM在生成质量上表现最佳,MeanFlow在推理效率方面优势明显,且CFM在图像修复任务中通过特定训练可显著提升性能。 Abstract: We present a comprehensive comparative study of three generative modeling paradigms: Denoising Diffusion Probabilistic Models (DDPM), Conditional Flow Matching (CFM), and MeanFlow. While DDPM and CFM require iterative sampling, MeanFlow enables direct one-step generation by modeling the average velocity over time intervals. We implement all three methods using a unified TinyUNet architecture (<1.5M parameters) on CIFAR-10, demonstrating that CFM achieves an FID of 24.15 with 50 steps, significantly outperforming DDPM (FID 402.98). MeanFlow achieves FID 29.15 with single-step sampling -- a 50X reduction in inference time. We further extend CFM to image inpainting, implementing mask-guided sampling with four mask types (center, random bbox, irregular, half). Our fine-tuned inpainting model achieves substantial improvements: PSNR increases from 4.95 to 8.57 dB on center masks (+73%), and SSIM improves from 0.289 to 0.418 (+45%), demonstrating the effectiveness of inpainting-aware training.

[117] 3-Tracer: A Tri-level Temporal-Aware Framework for Audio Forgery Detection and Localization

Shuhan Xia,Xuannan Liu,Xing Cui,Peipei Li

Main category: cs.CV

TL;DR: 本文提出了T3-Tracer,首个联合帧、片段和音频三级分析的框架,用于检测部分音频伪造,通过两个互补模块实现细粒度检测与边界识别,显著提升了检测性能。

Details Motivation: 现有方法缺乏多层次时间结构建模能力,难以捕捉部分音频伪造中的瞬时和持续异常,导致检测效果受限。 Method: 提出T3-Tracer框架,包含帧-音频特征聚合模块(FA-FAM)和片段级多尺度差异感知模块(SMDAM),分别在帧级和片段级融合多层级时序信息并检测异常。 Result: 在三个具有挑战性的数据集上实验表明,该方法在部分音频伪造检测任务中达到最先进性能。 Conclusion: T3-Tracer通过联合建模帧、段和音频三级信息,有效提升部分音频伪造的检测能力,尤其在定位伪造边界和识别细微篡改方面表现优越。 Abstract: Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3-Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3-Tracer consists of two complementary core modules: the Frame-Audio Feature Aggregation Module (FA-FAM) and the Segment-level Multi-Scale Discrepancy-Aware Module (SMDAM). FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state-of-the-art performance.

[118] FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision

Chen Ling,Henglin Shi,Hedvig Kjellström

Main category: cs.CV

TL;DR: 本文提出了FIELDS方法,通过引入直接的3D表情参数监督和情感识别分支,解决了现有3D人脸重建方法在捕捉细微情感细节上的不足,实现了从单张图像生成高保真、情感丰富的三维人脸模型。

Details Motivation: 现有的3D人脸重建方法主要依赖2D监督,缺乏真实的3D标注数据,难以准确还原细微的表情和情感信息,尤其是在自然场景下表现不佳。因此需要一种能够结合真实3D表情监督并保留情感内容的方法。 Method: 提出FIELDS框架,扩展了自监督的2D图像一致性线索,引入直接的3D表情参数监督,并利用自发性4D面部扫描获取真实表情参数;同时设计了一个强度感知的情感损失函数,使3D表情参数能真实反映情绪内容而不被夸大。 Result: FIELDS在单图输入下生成了情感丰富且高度逼真的3D人脸模型,在野外场景下的面部表情识别性能显著提升,同时保持了自然性,有效弥合了2D/3D域间差距并缓解了表情强度偏差问题。 Conclusion: 通过结合直接3D监督与情感感知损失,FIELDS实现了更真实、细腻的3D人脸重建,能够在无牺牲自然性的前提下提升表情识别效果,为情感驱动的3D人脸建模提供了有效解决方案。 Abstract: Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.

[119] Shift-Equivariant Complex-Valued Convolutional Neural Networks

Quentin Gabot,Teck-Yian Lim,Jérémy Fix,Joana Frontera-Pons,Chengfang Ren,Jean-Philippe Ovarlez

Main category: cs.CV

TL;DR: 本文将Learnable Polyphase Sampling (LPS) 方法扩展到复数神经网络,并提出一种从复数到实数的投影层,以在分类、重建和语义分割任务中实现平移不变性和等变性,尤其应用于极化SAR图像。

Details Motivation: 传统卷积神经网络因下采样和上采样操作破坏了平移等变性和不变性,缺乏系统性的理论保障机制,因此需要设计能理论上保证这些性质的采样层。 Method: 将LPS方法推广至复数神经网络,并引入一个在Gumbel Softmax之前的从复数域到实数域的投影层,以保持网络的等变/不变性质。 Result: 在多个计算机视觉任务(包括分类、重建和语义分割)中验证了所提方法的有效性,特别是在极化SAR图像上的实验表明其能有效实现平移不变性和等变性。 Conclusion: 所提出的复数域LPS扩展方法结合投影层,能够系统性地在复数神经网络中实现平移等变与不变性,为相关应用提供了理论一致且有效的解决方案。 Abstract: Convolutional neural networks have shown remarkable performance in recent years on various computer vision problems. However, the traditional convolutional neural network architecture lacks a critical property: shift equivariance and invariance, broken by downsampling and upsampling operations. Although data augmentation techniques can help the model learn the latter property empirically, a consistent and systematic way to achieve this goal is by designing downsampling and upsampling layers that theoretically guarantee these properties by construction. Adaptive Polyphase Sampling (APS) introduced the cornerstone for shift invariance, later extended to shift equivariance with Learnable Polyphase up/downsampling (LPS) applied to real-valued neural networks. In this paper, we extend the work on LPS to complex-valued neural networks both from a theoretical perspective and with a novel building block of a projection layer from $\mathbb{C}$ to $\mathbb{R}$ before the Gumbel Softmax. We finally evaluate this extension on several computer vision problems, specifically for either the invariance property in classification tasks or the equivariance property in both reconstruction and semantic segmentation problems, using polarimetric Synthetic Aperture Radar images.

[120] AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

Shuhan Xia,Peipei Li,Xuannan Liu,Dongsen Zhang,Xinyu Guo,Zekun Li

Main category: cs.CV

TL;DR: 本文提出了首个全面的音视频伪造检测基准AVFakeBench,涵盖人类与非人类对象的多种伪造类型,并设计了多任务评估框架,用于评估音视频大语言模型在细粒度感知与推理上的能力与不足。

Details Motivation: 现有音视频伪造检测基准局限于DeepFake和单一粒度标注,难以反映真实世界中复杂多样的伪造场景,因此需要一个更全面、语义更丰富的基准来推动研究发展。 Method: 提出AVFakeBench,包含12K个音视频问题,覆盖七类伪造类型和四级标注;构建多阶段混合伪造框架,结合专有任务规划模型与专家生成模型生成高质量伪造样本;设计多任务评估体系,包括二分类判断、伪造类型识别、细节选择和解释性推理。 Result: 在AVFakeBench上评测了11种音视频大语言模型和2种主流检测方法,结果显示AV-LMMs具备作为伪造检测器的潜力,但在细粒度感知和推理方面仍存在明显弱点。 Conclusion: AVFakeBench为音视频伪造检测提供了更全面、更具挑战性的评估平台,揭示了当前AV-LMMs在伪造识别中的优势与局限,推动未来在细粒度理解和多模态推理方向的研究。 Abstract: The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.

[121] LaGen: Towards Autoregressive LiDAR Scene Generation

Sizhuo Zhou,Xiaosong Jia,Fanrui Zhang,Junjie Li,Juyong Zhang,Yukang Feng,Jianwen Sun,Songbur Wong,Junqi You,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出了LaGen,首个能够进行长时域激光雷达场景逐帧自回归生成的框架,支持从单帧LiDAR输入开始、结合边界框条件生成高保真的4D点云,并通过场景解耦估计和噪声调制模块提升交互性和减少误差累积。

Details Motivation: 现有LiDAR数据生成方法仅支持单帧生成,预测方法缺乏交互性且无法实现长时域的逐帧生成,难以满足自动驾驶中对交互式世界模型的需求。 Method: 提出LaGen框架,采用自回归方式生成长时域LiDAR序列;引入场景解耦估计模块以增强对象级内容的交互生成能力;设计噪声调制模块减轻长期生成中的误差累积;使用nuScenes数据集构建评估协议。 Result: 实验表明LaGen在长时域LiDAR场景生成任务上优于现有的生成与预测模型,尤其在后续帧的生成质量上有显著提升。 Conclusion: LaGen是首个支持长时域、交互式LiDAR场景生成的世界模型,为基于LiDAR的自动驾驶仿真与预测提供了新思路和技术基础。 Abstract: Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long-horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame-by-frame autoregressive generation of long-horizon LiDAR scenes. LaGen is able to take a single-frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high-fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model's interactive generation capability for object-level content, as well as a noise modulation module to mitigate error accumulation during long-horizon generation. We construct a protocol based on nuScenes for evaluating long-horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on the later frames.

[122] Unlocking Zero-shot Potential of Semi-dense Image Matching via Gaussian Splatting

Juncheng Chen,Chao Xu,Yanjun Cao

Main category: cs.CV

TL;DR: 本文提出了MatchGS,首个利用3D高斯点阵(3DGS)进行鲁棒零样本图像匹配的框架,通过几何校正和2D-3D表示对齐显著提升匹配性能。

Details Motivation: 基于学习的图像匹配依赖高质量训练数据,但现有3DGS生成的数据存在几何不准确和深度偏差问题,难以支持可靠对应点标注。 Method: 提出两阶段方法:(1) 几何保真数据生成 pipeline,优化3DGS几何以生成高精度对应标签;(2) 2D-3D表示对齐策略,将3DGS的显式三维知识注入2D匹配器,使其学习视角不变的3D表示。 Result: 生成的真值对应关系使极线误差减少达40倍,支持极端视角变化下的监督,并在公开基准上使最先进匹配器实现最高17.7%的零样本性能提升。 Conclusion: 经过适当几何修正后,3DGS可作为可扩展、高保真且结构丰富的数据源,推动新一代鲁棒零样本图像匹配器的发展。 Abstract: Learning-based image matching critically depends on large-scale, diverse, and geometrically accurate training data. 3D Gaussian Splatting (3DGS) enables photorealistic novel-view synthesis and thus is attractive for data generation. However, its geometric inaccuracies and biased depth rendering currently prevent robust correspondence labeling. To address this, we introduce MatchGS, the first framework designed to systematically correct and leverage 3DGS for robust, zero-shot image matching. Our approach is twofold: (1) a geometrically-faithful data generation pipeline that refines 3DGS geometry to produce highly precise correspondence labels, enabling the synthesis of a vast and diverse range of viewpoints without compromising rendering fidelity; and (2) a 2D-3D representation alignment strategy that infuses 3DGS' explicit 3D knowledge into the 2D matcher, guiding 2D semi-dense matchers to learn viewpoint-invariant 3D representations. Our generated ground-truth correspondences reduce the epipolar error by up to 40 times compared to existing datasets, enable supervision under extreme viewpoint changes, and provide self-supervisory signals through Gaussian attributes. Consequently, state-of-the-art matchers trained solely on our data achieve significant zero-shot performance gains on public benchmarks, with improvements of up to 17.7%. Our work demonstrates that with proper geometric refinement, 3DGS can serve as a scalable, high-fidelity, and structurally-rich data source, paving the way for a new generation of robust zero-shot image matchers.

[123] Co-Training Vision Language Models for Remote Sensing Multi-task Learning

Qingyun Li,Shuran Ma,Junwei Luo,Yi Yu,Yue Zhou,Fengxiang Wang,Xudong Lu,Xiaoxing Wang,Xin He,Yushi Chen,Xue Yang,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出RSCoVLM,一个简单而灵活的遥感视觉语言模型(VLM)基线,支持多任务学习(MTL),通过统一的文本接口和动态分辨率策略,在多种遥感任务上实现最先进的性能,并开源所有工具与数据以促进通用遥感模型的发展。

Details Motivation: 随着Transformer在单一遥感任务上的成功,构建一个能在多个任务上表现优异的统一模型成为可能。现有的单任务方法缺乏泛化性和可扩展性,而多任务学习和视觉语言模型(VLMs)展现出潜力,但尚缺乏有效的基线模型来应对遥感数据的复杂性与多样性。 Method: 提出RSCoVLM,包含:1)构建数据整理引擎,用于数据获取、离线整合与在线加载加权,生成灵活的视觉-语言对话;2)设计统一的动态分辨率策略以适应不同尺度的遥感图像;3)针对超高分辨率(UHR)图像引入Zoom-in Chain机制及配套数据集LRS-VQA-Zoom;4)增强模型的目标检测能力并提出新的评估协议。 Result: 实验表明RSCoVLM在多样化的遥感任务中达到最先进水平,优于现有遥感VLM,并媲美专用专家模型。所提方法有效缓解了计算负担,且具备良好的灵活性与可扩展性。 Conclusion: RSCoVLM作为一个简单、灵活且高性能的多任务遥感VLM基线,推动了通用遥感模型的发展。其开源的模型、工具和数据集有助于提升研究可复现性,为未来研究提供了坚实基础。 Abstract: With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.

[124] PathMamba: A Hybrid Mamba-Transformer for Topologically Coherent Road Segmentation in Satellite Imagery

Jules Decaestecker,Nicolas Vigne

Main category: cs.CV

TL;DR: 本文提出了一种名为PathMamba的新型混合架构,结合Mamba的状态空间模型与Transformer的全局推理能力,用于卫星图像中的道路分割,在保持高精度的同时显著提升拓扑连续性,并实现线性计算效率。

Details Motivation: 现有基于Vision Transformer的方法虽能捕捉全局上下文,但其二次计算复杂度限制了在资源受限设备上的部署;而道路网络具有长连续结构,需要高效建模能力。 Method: 提出PathMamba,将Mamba块用于捕捉道路的连续性和拓扑结构,同时引入Transformer块以融合全局上下文信息,形成互补的混合架构。 Result: 在DeepGlobe和Massachusetts Roads数据集上达到SOTA性能,尤其在APLS指标上显著提升,验证了方法在拓扑连续性方面的优势,同时保持较低计算开销。 Conclusion: PathMamba通过结合Mamba的线性效率与Transformer的全局建模能力,实现了高精度、高拓扑连续性的道路分割,为实际应用提供了高效可行的解决方案。 Abstract: Achieving both high accuracy and topological continuity in road segmentation from satellite imagery is a critical goal for applications ranging from urban planning to disaster response. State-of-the-art methods often rely on Vision Transformers, which excel at capturing global context, yet their quadratic complexity is a significant barrier to efficient deployment, particularly for on-board processing in resource-constrained platforms. In contrast, emerging State Space Models like Mamba offer linear-time efficiency and are inherently suited to modeling long, continuous structures. We posit that these architectures have complementary strengths. To this end, we introduce PathMamba, a novel hybrid architecture that integrates Mamba's sequential modeling with the Transformer's global reasoning. Our design strategically uses Mamba blocks to trace the continuous nature of road networks, preserving topological structure, while integrating Transformer blocks to refine features with global context. This approach yields topologically superior segmentation maps without the prohibitive scaling costs of pure attention-based models. Our experiments on the DeepGlobe Road Extraction and Massachusetts Roads datasets demonstrate that PathMamba sets a new state-of-the-art. Notably, it significantly improves topological continuity, as measured by the APLS metric, setting a new benchmark while remaining computationally competitive.

[125] CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation

Chenyu Liu,Hongze Chen,Jingzhi Bao,Lingting Zhu,Runze Zhang,Weikai Chen,Zeyu Hu,Yingda Yin,Keyang Luo,Xin Wang

Main category: cs.CV

TL;DR: 本文提出CaliTex,一种基于几何校准注意力的3D纹理生成框架,通过结构化注意力机制解决跨视角不一致问题。

Details Motivation: 现有扩散模型在3D纹理生成中存在跨视角不一致性,源于注意力机制的歧义性导致几何与外观耦合不稳定。 Method: 引入几何校准注意力,包括部件对齐注意力和条件路由注意力,结合两阶段扩散Transformer,在网络设计中内建几何一致性。 Result: 实验表明CaliTex在视觉连贯性和跨视角一致性上优于开源及商业基线方法。 Conclusion: CaliTex将几何相干性作为网络固有行为,有效提升了3D纹理生成的质量与稳定性。 Abstract: Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling. To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure. It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity. Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization. Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.

[126] HTTM: Head-wise Temporal Token Merging for Faster VGGT

Weitian Wang,Lukas Meiner,Rai Shubham,Cecilia De La Parra,Akash Kumar

Main category: cs.CV

TL;DR: 提出了一种无需训练的3D token合并方法HTTM,用于加速VGGT模型,在保持性能的同时实现最高7倍的推理加速。

Details Motivation: VGGT模型在3D场景重建中需全局注意力机制,导致处理长序列输入时存在显著延迟瓶颈。 Method: 提出头级时间合并(HTTM),以多头粒度进行token合并,保留特征唯一性,并利用头级别的时空相关性实现更高合并比和更低计算成本。 Result: HTTM在GPU推理中实现了最高7倍的加速,且性能下降可忽略不计。 Conclusion: HTTM是一种高效、无需训练的加速方法,显著提升了VGGT在大场景3D重建中的推理效率。 Abstract: The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.

[127] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Stefanos Koutoupis,Michaela Areti Zervou,Konstantinos Kontras,Maarten De Vos,Panagiotis Tsakalides,Grigorios Tsagatakis

Main category: cs.CV

TL;DR: 本文提出了Contrastive Fusion (ConFu)框架,通过扩展传统的成对对比学习目标,引入融合模态对比项,实现多模态及其组合的联合嵌入,有效捕捉高阶依赖关系并保持良好的成对对齐性能。

Details Motivation: 现有方法在处理多模态表示时多局限于成对对齐,难以充分建模多个模态间的高阶交互,且常牺牲单模态任务性能;如何同时保留成对关系并捕获高阶依赖成为关键挑战。 Method: 提出ConFu框架,在传统成对对比学习基础上增加融合模态对比项,将各单个模态及其融合组合共同嵌入统一表示空间,并通过对齐模态与融合表示来建模高阶依赖(如XOR型关系),同时维持强成对对应性。 Result: 在合成数据和真实多模态基准上验证了ConFu的有效性,结果显示其在检索和分类任务中表现优异,能有效利用跨模态互补性、捕捉高阶依赖,并支持单一框架下的单对一和双对一检索。 Conclusion: ConFu通过联合建模单模态与融合模态的对比学习,在保持良好成对对齐的同时成功捕获了多模态间的高阶交互关系,为统一的多模态表示学习提供了有效且可扩展的解决方案。 Abstract: Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

[128] Hybrid SIFT-SNN for Efficient Anomaly Detection of Traffic Flow-Control Infrastructure

Munish Rathee,Boris Bačić,Maryam Doborjeh

Main category: cs.CV

TL;DR: 提出SIFT-SNN框架,结合SIFT特征与脉冲神经网络,实现低延迟、高精度的交通基础设施结构异常实时检测。

Details Motivation: 传统CNN方法在实时性和可解释性方面存在不足,难以满足边缘设备上低功耗、高效推理的需求,尤其是在复杂环境下的结构安全监测场景中。 Method: 将SIFT用于空间特征编码,结合基于延迟驱动的脉冲转换层和LIF脉冲神经网络进行分类,构建端到端的低延迟神经形态信号处理流程。 Result: 在包含6000帧的真实与合成数据集上达到92.3%的准确率,单帧推理时间仅9.5ms,脉冲稀疏性为8.1%,支持实时边缘部署。 Conclusion: SIFT-SNN框架在保持空间特征可解释性的同时实现了高性能与低功耗,适用于可移动混凝土护栏等交通基础设施的安全监测,具备广泛推广潜力。 Abstract: This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport infrastructure. The proposed approach integrates Scale-Invariant Feature Transform (SIFT) for spatial feature encoding with a latency-driven spike conversion layer and a Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN) for classification. The Auckland Harbour Bridge dataset is recorded under various weather and lighting conditions, comprising 6,000 labelled frames that include both real and synthetically augmented unsafe cases. The presented system achieves a classification accuracy of 92.3% (+- 0.8%) with a per-frame inference time of 9.5 ms. Achieved sub-10 millisecond latency, combined with sparse spike activity (8.1%), enables real-time, low-power edge deployment. Unlike conventional CNN-based approaches, the hybrid SIFT-SNN pipeline explicitly preserves spatial feature grounding, enhances interpretability, supports transparent decision-making, and operates efficiently on embedded hardware. Although synthetic augmentation improved robustness, generalisation to unseen field conditions remains to be validated. The SIFT-SNN framework is validated through a working prototype deployed on a consumer-grade system and framed as a generalisable case study in structural safety monitoring for movable concrete barriers, which, as a traffic flow-control infrastructure, is deployed in over 20 cities worldwide.

[129] SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Tae-Min Choi,Tae Kyeong Jeong,Garam Kim,Jaemin Lee,Yeongyoon Koh,In Cheul Choi,Jae-Ho Chung,Jong Woong Park,Juyoun Park

Main category: cs.CV

TL;DR: SurgMLLMBench是一个专为外科场景理解设计的统一多模态基准,整合了像素级分割掩码和结构化视觉问答标注,涵盖腹腔镜、机器人辅助和显微手术领域,并提出新的MAVIS数据集,支持跨域一致评估与多模态大模型的交互式推理。

Details Motivation: 现有外科数据集多采用异构分类体系的视觉问答格式,缺乏像素级分割支持,限制了多模态大模型在手术理解中的统一评估与应用。 Method: 构建名为SurgMLLMBench的统一多模态基准,包含新采集的MAVIS数据集,整合像素级器械分割掩码和结构化VQA标注,并在腹腔镜、机器人辅助和显微手术领域下采用统一分类体系,支持跨域训练与评估。 Result: 基线实验表明,在SurgMLLMBench上训练的单一模型可在不同手术域中实现一致性能,并有效泛化至未见数据集。 Conclusion: SurgMLLMBench将作为公开资源推动多模态外科AI研究,支持可重复评估和交互式手术推理模型的发展。 Abstract: Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

[130] PFF-Net: Patch Feature Fitting for Point Cloud Normal Estimation

Qing Li,Huifang Feng,Kanle Shi,Yue Gao,Yi Fang,Yu-Shen Liu,Zhizhong Han

Main category: cs.CV

TL;DR: 提出一种基于多尺度特征融合的点云法线估计方法,通过多尺度特征聚合与跨尺度特征补偿实现鲁棒且高效的法线估计。

Details Motivation: 现有方法在处理不同数据或几何形状时难以选择合适的邻域大小,且参数量大、效率低,难以兼顾精度与速度。 Method: 引入多尺度特征融合策略,设计了多尺度特征聚合模块和跨尺度特征补偿模块,逐步聚合不同尺度的局部特征并补偿大尺度特征信息,以逼近最优几何描述。 Result: 在合成与真实世界数据集上均达到最先进性能,同时减少网络参数量和运行时间。 Conclusion: 所提方法能有效适应不同尺度的局部补丁,提供更优的特征描述,实现了高效、准确且鲁棒的点云法线估计。 Abstract: Estimating the normal of a point requires constructing a local patch to provide center-surrounding context, but determining the appropriate neighborhood size is difficult when dealing with different data or geometries. Existing methods commonly employ various parameter-heavy strategies to extract a full feature description from the input patch. However, they still have difficulties in accurately and efficiently predicting normals for various point clouds. In this work, we present a new idea of feature extraction for robust normal estimation of point clouds. We use the fusion of multi-scale features from different neighborhood sizes to address the issue of selecting reasonable patch sizes for various data or geometries. We seek to model a patch feature fitting (PFF) based on multi-scale features to approximate the optimal geometric description for normal estimation and implement the approximation process via multi-scale feature aggregation and cross-scale feature compensation. The feature aggregation module progressively aggregates the patch features of different scales to the center of the patch and shrinks the patch size by removing points far from the center. It not only enables the network to precisely capture the structure characteristic in a wide range, but also describes highly detailed geometries. The feature compensation module ensures the reusability of features from earlier layers of large scales and reveals associated information in different patch sizes. Our approximation strategy based on aggregating the features of multiple scales enables the model to achieve scale adaptation of varying local patches and deliver the optimal feature description. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets with fewer network parameters and running time.

[131] Endo-G$^{2}$T: Geometry-Guided & Temporally Aware Time-Embedded 4DGS For Endoscopic Scenes

Yangle Liu,Fengze Li,Kan Liu,Jieming Ma

Main category: cs.CV

TL;DR: 提出Endo-G²T,一种面向动态内窥镜场景的几何引导、时间感知的4D高斯溅射训练框架,通过几何先验蒸馏、时变高斯场建模和关键帧约束流式优化,在单目重建中实现先进性能。

Details Motivation: 内窥镜视频存在强烈的视角依赖效应(如镜面反射、湿反射和遮挡),纯光度监督易导致早期几何漂移,难以纠正错误形状,因此需要在4D高斯溅射中尽早锚定几何并保持时间一致性和效率。 Method: 1) 几何引导先验蒸馏:将置信门控的单目深度转化为尺度不变的深度和深度梯度损失,并采用预热到上限策略避免早期过拟合;2) 时嵌入高斯场:在XYZT中使用类旋量旋转参数化建模动态,辅以轻量正则化提升时序一致性;3) 关键帧约束流式训练:在最大点数预算下聚焦关键帧优化,非关键帧轻量更新,提升效率与长程稳定性。 Result: 在EndoNeRF和StereoMIS-P1数据集上,Endo-G²T在单目重建方法中达到最先进水平,有效抑制几何漂移,提升重建精度与时间连贯性。 Conclusion: Endo-G²T通过早期几何锚定与时间感知建模,显著提升了动态内窥镜场景下的4D高斯溅射重建质量,兼顾效率与稳定性,适用于真实医疗场景的时序重建任务。 Abstract: Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns with geometry and triggers early geometric drift, where erroneous shapes are reinforced during densification and become hard to correct. We ask how to anchor geometry early for 4D Gaussian splatting (4DGS) while maintaining temporal consistency and efficiency in dynamic endoscopic scenes. Thus, we present Endo-G$^{2}$T, a geometry-guided and temporally aware training scheme for time-embedded 4DGS. First, geo-guided prior distillation converts confidence-gated monocular depth into supervision with scale-invariant depth and depth-gradient losses, using a warm-up-to-cap schedule to inject priors softly and avoid early overfitting. Second, a time-embedded Gaussian field represents dynamics in XYZT with a rotor-like rotation parameterization, yielding temporally coherent geometry with lightweight regularization that favors smooth motion and crisp opacity boundaries. Third, keyframe-constrained streaming improves efficiency and long-horizon stability through keyframe-focused optimization under a max-points budget, while non-keyframes advance with lightweight updates. Across EndoNeRF and StereoMIS-P1 datasets, Endo-G$^{2}$T achieves state-of-the-art results among monocular reconstruction baselines.

[132] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Xin Gu,Haoji Zhang,Qihang Fan,Jingxuan Niu,Zhipeng Zhang,Libo Zhang,Guang Chen,Fan Chen,Longyin Wen,Sijie Zhu

Main category: cs.CV

TL;DR: 本文提出了STVG-o1,首个无需修改架构即可实现最先进时空间视频定位(STVG)性能的即插即用多模态大语言模型框架,通过引入边界框链式思维机制和多维强化奖励函数,在多个基准上显著超越现有方法。

Details Motivation: 现有的多模态大语言模型在STVG任务上表现不佳,主要由于训练目标不一致以及标准视觉编码器中细粒度区域-词语对齐能力弱。 Method: 提出STVG-o1框架,引入边界框链式思维机制以在最终预测前显式推理时空位置,并设计包含格式、一致性、时间、空间和思考奖励的多维强化奖励函数,通过强化微调提供几何感知监督。 Result: 在HCSTVG-v1/v2和VidSTG数据集上达到最先进性能,其中在HCSTVG-v1上m_tIoU超过最佳任务特定方法7.3%,在VidSTG上与专用模型相当,且显著优于所有现有基于MLLM的方法,同时展现出强大的跨数据集开放词汇泛化能力。 Conclusion: STVG-o1证明了无需架构修改的多模态大语言模型可通过合适的推理机制和强化学习策略成为精确时空间定位的强大基础模型。 Abstract: Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3\% m\_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.

[133] Monet: Reasoning in Latent Visual Space Beyond Images and Language

Qixun Wang,Yang Shi,Yifei Wang,Yuanxing Zhang,Pengfei Wan,Kun Gai,Xianghua Ying,Yisen Wang

Main category: cs.CV

TL;DR: 本文提出了Monet框架,通过在潜在视觉空间中生成连续嵌入作为中间视觉思维,使多模态大语言模型能够进行图像推理。为解决训练中的计算成本和监督不足问题,设计了三阶段蒸馏式微调流程,并提出VLPO强化学习方法以增强潜在推理。基于125K规模的多类型CoT数据集训练的Monet-7B模型在多种视觉推理任务上表现优异,具备强泛化能力。

Details Motivation: 现有视觉推理方法受限于外部工具,缺乏人类抽象视觉思维的灵活性,难以在潜在空间中进行端到端的视觉思考。 Method: 提出Monet框架,采用三阶段蒸馏式监督微调(SFT)流程来对齐潜在视觉与语言表示,并引入VLPO(视觉-潜在策略优化)方法,在强化学习中将潜在嵌入显式纳入策略梯度更新,以提升潜在空间中的视觉推理能力。同时构建了包含125K样本的高质量图文交错CoT数据集Monet-SFT-125K用于训练。 Result: Monet-7B在真实场景感知、图表理解、OCR和几何推理等基准上均取得一致提升,并在具挑战性的抽象视觉推理任务中展现出强分布外泛化能力。消融实验验证了各训练组件的有效性,并分析了GRPO在潜在推理中的局限性。 Conclusion: 通过在潜在空间中直接生成视觉思维,Monet实现了更接近人类的抽象视觉推理方式,推动了多模态模型在无需依赖外部工具下的自主视觉推理发展,为未来视觉-语言联合推理提供了新路径。 Abstract: "Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.

[134] DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models

Mingue Park,Prin Phunyaphibarn,Phillip Y. Lee,Minhyuk Sung

Main category: cs.CV

TL;DR: 提出DiverseVAR框架,通过在测试时注入文本嵌入噪声和引入scale-travel细化方法,在不重训练的情况下提升视觉自回归模型的生成多样性,同时保持图像质量。

Details Motivation: 视觉自回归模型(VAR)在图像生成中表现优异,但生成结果缺乏多样性,常对同一提示生成几乎相同的图像,这一问题在追求图像质量的研究中被忽视。 Method: 首先在文本嵌入中注入噪声以增加多样性;然后提出scale-travel技术,利用多尺度自编码器提取粗粒度token,在生成中间阶段恢复生成过程,以保持图像质量。 Result: 实验表明,结合噪声注入和scale-travel显著提升了生成多样性,同时最小化了图像质量下降,实现了多样性和质量之间新的帕累托前沿。 Conclusion: DiverseVAR在无需重新训练或微调的前提下,有效解决了VAR模型生成多样性不足的问题,为自回归图像生成提供了更优的多样性-质量权衡方案。 Abstract: We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.

[135] SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning

Futian Wang,Mengqi Wang,Xiao Wang,Haowen Wang,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种结合SAM基础模型的遥感变化描述方法,通过融合全局视觉特征、语义与运动层次的变化区域以及知识图谱中的感兴趣对象信息,实现了更准确的自然语言变化描述,并在多个基准数据集上达到先进性能。

Details Motivation: 现有遥感变化描述方法区域感知弱、时序对齐能力有限,缺乏对感兴趣区域的有效建模,限制了描述质量。 Method: 采用CNN/Transformer提取全局视觉特征,利用SAM模型分割语义和运动层次的变化区域,并构建知识图谱引入感兴趣对象知识,通过跨注意力机制融合多源异构信息,最后用Transformer解码器生成自然语言描述。 Result: 在多个主流遥感变化描述数据集上取得最先进的性能表现,显著优于现有方法。 Conclusion: 将SAM基础模型与知识图谱结合用于遥感变化描述是有效的,增强了模型的区域感知与语义理解能力,为该任务提供了新的解决方案。 Abstract: Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning

[136] E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework

Adeela Islam,Stefano Fiorini,Manuel Lecha,Theodore Tsesmelis,Stuart James,Pietro Morerio,Alessio Del Bue

Main category: cs.CV

TL;DR: 本文提出了一种名为E-M3RF的等变多模态3D重装配框架,结合几何与颜色特征,利用SE(3)流匹配预测碎片变换,显著提升了在文化遺产数据集上的重装配精度。

Details Motivation: 现有基于深度学习的3D重装配方法主要依赖几何特征,在几何信息不足或模糊时表现不佳,且缺乏物理约束防止重叠组装。为此,本文引入颜色等多模态信息并增强模型对称性一致性。 Method: 提出E-M3RF框架:使用旋转等变编码器提取点云位置的几何特征,用Transformer编码每个点的颜色特征,并融合为多模态表示;通过SE(3)流匹配预测各碎片的刚体变换以实现重装配。 Result: 在Breaking Bad、Fantastic Breaks、RePAIR和Presious四个数据集上实验表明,E-M3RF在RePAIR数据集上较现有方法旋转误差降低23.1%,平移误差降低13.2%,Chamfer距离减少18.4%。 Conclusion: E-M3RF通过融合几何与颜色的多模态表示及SE(3)等变建模,有效提升了3D碎片重装配的精度与鲁棒性,尤其适用于几何信息有限的文化遗产场景。 Abstract: 3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.

[137] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Jiajie Zhang,Sören Schwertfeger,Alexander Kleiner

Main category: cs.CV

TL;DR: 提出了一种无监督框架,用于从连续工业视频流中自动提取和组织视觉-语言-动作(VLA)预训练数据。

Details Motivation: 解锁大量未标注的人类操作视频数据,以支持制造业中具身AI的可扩展集成。 Method: 首先训练轻量级运动分词器编码运动动态,然后利用基于“潜在动作能量”的无监督动作分割器发现并分割语义连贯的动作原语。 Result: 在公开基准和专有电机装配数据集上验证了关键任务的有效分割,并通过视觉-语言模型聚类和量化评估确认了动作原语的语义一致性。 Conclusion: 这是首个从非结构化工业视频中全自动提取VLA预训练数据的端到端系统,为制造业中的具身AI提供了可扩展的数据解决方案。 Abstract: We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.

[138] EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream Aggregation

Futian Wang,Fan Zhang,Xiao Wang,Mengqi Wang,Dexing Huang,Jin Tang

Main category: cs.CV

TL;DR: 提出一种基于超图的时空事件流补全机制,通过超图连接不同时空位置的事件,并利用上下文信息传递来补全稀疏事件,支持RGB模态融合,实验验证了其在单/多标签事件分类中的有效性。

Details Motivation: 现有事件表示学习方法因事件空间稀疏性导致采样不足问题,难以有效利用时空信息。 Method: 设计超图引导的时空事件流补全机制,将事件token通过超图连接并进行消息传递;引入RGB token作为超图节点实现多模态补全;通过自注意力聚合不同时刻的超图节点信息以融合多模态特征。 Result: 在单标签和多标签事件分类任务上取得优异性能,验证了所提框架的有效性。 Conclusion: 该方法有效缓解了事件数据的空间稀疏性问题,实现了高效的多模态信息补全与融合,为事件表示学习提供了新思路。 Abstract: Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.

[139] MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices

Shuai Zhang,Bao Tang,Siyuan Yu,Yueting Zhu,Jingfeng Yao,Ya Zou,Shanglin Yuan,Li Yu,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: 本文提出了MobileI2V,一种轻量级扩散模型,首次实现在移动设备上实时生成720p高清视频,通过线性混合架构、时间步蒸馏和移动端注意力优化,在保持质量的同时大幅提升速度。

Details Motivation: 由于扩散模型计算复杂度高、生成速度慢,现有方法难以在资源受限的移动设备上实现实时高清图像到视频生成,因此需要一个高效且轻量的解决方案。 Method: 提出MobileI2V模型:1)设计线性混合架构去噪器,平衡生成效率与质量;2)采用时间步蒸馏策略,将采样步数从20步以上压缩至仅2步;3)应用移动端专用注意力优化,提升推理速度。 Result: 实现了在移动设备上每帧720p视频生成时间小于100ms,生成速度提升10倍,注意力操作加速2倍,且生成质量与现有模型相当。 Conclusion: MobileI2V首次实现了在移动设备上的实时高质量图像到视频生成,为边缘设备上的视频生成应用提供了可行方案。 Abstract: Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the substantial computational complexity and slow generation speed of diffusion models pose significant challenges for real-time, high-resolution video generation on resource-constrained mobile devices. In this work, we propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices. The core lies in: (1) We analyzed the performance of linear attention modules and softmax attention modules on mobile devices, and proposed a linear hybrid architecture denoiser that balances generation efficiency and quality. (2) We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss, resulting in a 10-fold increase in generation speed. (3) We apply mobile-specific attention optimizations that yield a 2-fold speed-up for attention operations during on-device inference. MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models. Under one-step conditions, the generation speed of each frame of 720p video is less than 100 ms. Our code is available at: https://github.com/hustvl/MobileI2V.

[140] Frequency-Aware Token Reduction for Efficient Vision Transformer

Dong-Jae Lee,Jiwan Hur,Jaehyun Choi,Jaemyung Yu,Junmo Kim

Main category: cs.CV

TL;DR: 本文提出了一种频率感知的token缩减策略,通过分离高频和低频token并聚合低频分量来缓解秩坍缩问题,在降低计算开销的同时提升了视觉Transformer的性能。

Details Motivation: 现有的token缩减方法忽略了自注意力中的频率特性,如秩坍缩和过平滑现象,导致性能下降。 Method: 将token划分为高频和低频部分,选择性保留高频token,并将低频token聚合为一个紧凑的直流token以保留关键低频信息。 Result: 实验表明该方法在减少计算量的同时显著提高了准确率,并有效缓解了秩坍缩和过平滑问题。 Conclusion: 所提出的频率感知token缩减策略有效地结合了频率域分析与模型效率优化,为视觉Transformer的设计提供了新的视角。 Abstract: Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations.

[141] Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning

Taehoon Kim,Donghwan Jang,Bohyung Han

Main category: cs.CV

TL;DR: 提出了一种名为Merge-and-Bound(M&B)的类增量学习新训练方法,通过在参数空间中直接操作模型权重来优化,无需修改架构或学习目标,有效减少灾难性遗忘,并在标准基准上表现出优越性能。

Details Motivation: 为了解决类增量学习中的灾难性遗忘问题,探索不依赖回放或正则化的新型优化策略。 Method: 提出两种权重合并方式:跨任务权重合并(平均之前阶段的模型权重)和任务内权重合并(结合当前阶段的参数),并引入有界更新技术以最小化累积更新,保持旧知识。 Result: 在多个标准CIL基准上进行了广泛评估,结果表明M&B优于现有最先进方法。 Conclusion: M&B是一种无需修改模型结构或学习目标即可集成到现有CIL方法中的有效策略,证明了在参数空间中通过受限权重合并实现持续学习的可行性。 Abstract: We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the parameter space for optimization. Our algorithm involves two types of weight merging: inter-task weight merging and intra-task weight merging. Inter-task weight merging unifies previous models by averaging the weights of models from all previous stages. On the other hand, intra-task weight merging facilitates the learning of current task by combining the model parameters within current stage. For reliable weight merging, we also propose a bounded update technique that aims to optimize the target model with minimal cumulative updates and preserve knowledge from previous tasks; this strategy reveals that it is possible to effectively obtain new models near old ones, reducing catastrophic forgetting. M&B is seamlessly integrated into existing CIL methods without modifying architecture components or revising learning objectives. We extensively evaluate our algorithm on standard CIL benchmarks and demonstrate superior performance compared to state-of-the-art methods.

[142] CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation

Shizhe Sun,Wataru Ohyama

Main category: cs.CV

TL;DR: 提出了一种基于交叉注意力的非局部知识蒸馏方法(CanKD),通过跨教师与学生特征图的像素级动态交互,增强知识迁移效果,在目标检测和分割任务中优于现有方法。

Details Motivation: 传统基于自注意力的知识蒸馏方法独立对齐教师和学生特征图,未能充分捕捉跨特征图的像素间关系,限制了知识转移效率。 Method: 引入交叉注意力机制,使学生特征图的每个像素能够动态关注教师特征图的所有像素,从而实现非局部知识传递,并仅通过添加一个额外的损失函数来优化训练过程。 Result: 在目标检测和图像分割任务上进行了大量实验,结果表明CanKD优于当前最先进的特征蒸馏和混合蒸馏方法。 Conclusion: CanKD通过交叉注意力实现了更全面的像素级关系建模,有望成为计算机视觉中注意力引导蒸馏的新范式。 Abstract: We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD's potential as a new paradigm for attention-guided distillation in computer vision tasks. Code is available at https://github.com/tori-hotaru/CanKD

[143] Generalized Design Choices for Deepfake Detectors

Lorenzo Pellegrini,Serafino Pandolfini,Davide Maltoni,Matteo Ferrara,Marco Prati,Marco Ramilli

Main category: cs.CV

TL;DR: 本文系统地研究了不同设计选择对深度伪造检测模型准确性与泛化能力的影响,旨在建立与架构无关的最佳实践,以提升检测性能并在AI-GenBench基准上实现最先进效果。

Details Motivation: 深度伪造检测方法的性能常受实现细节(如数据预处理、增强策略和优化技术)影响,导致难以公平比较和识别真正有效的因素。 Method: 通过隔离训练、推理和增量更新等各个设计因素的影响,进行系统性实验分析。 Result: 确定了一组能持续提升深度伪造检测性能的设计选择,并在AI-GenBench基准上达到最先进的表现。 Conclusion: 提出了一套与模型架构无关的、可推广的深度伪造检测最佳实践方案。 Abstract: The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentation strategies, and optimization techniques. These factors make it difficult to fairly compare detectors and to understand which factors truly contribute to their performance. To address this, we systematically investigate how different design choices influence the accuracy and generalization capabilities of deepfake detection models, focusing on aspects related to training, inference, and incremental updates. By isolating the impact of individual factors, we aim to establish robust, architecture-agnostic best practices for the design and development of future deepfake detection systems. Our experiments identify a set of design choices that consistently improve deepfake detection and enable state-of-the-art performance on the AI-GenBench benchmark.

[144] Self-Paced Learning for Images of Antinuclear Antibodies

Yiyang Jiang,Guangwu Qian,Jiaxin Wu,Qi Huang,Qing Li,Yongkang Wu,Xiao-Yong Wei

Main category: cs.CV

TL;DR: 提出了一种用于抗核抗体(ANA)检测的新框架,通过实例采样、伪标签分配和自步学习解决多实例多标签学习的复杂性,在多个数据集上实现了最先进的性能。

Details Motivation: 手动ANA检测耗时、费力且需要多年训练,现有自动化方法难以应对真实临床环境中多实例多标签(MIML)学习的复杂性。 Method: 设计了一个端到端框架,包含实例采样器、概率伪标签分配器和自步学习权重调整机制,直接使用原始显微镜图像进行ANA检测。 Result: 在ANA数据集上比先前最佳方法F1-Macro提升7.0%,mAP提升12.6%;在公共MIML医学数据集上排名前二,Hamming loss和one-error分别最多降低18.2%和26.9%。 Conclusion: 该框架有效解决了ANA检测中的MIML挑战,无需人工预处理,支持端到端优化,显著提升了检测性能,具有良好的临床应用潜力。 Abstract: Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren's syndrome, and scleroderma. Despite its importance, manual ANA detection is slow, labor-intensive, and demands years of training. ANA detection is complicated by over 100 coexisting antibody types, resulting in vast fluorescent pattern combinations. Although machine learning and deep learning have enabled automation, ANA detection in real-world clinical settings presents unique challenges as it involves multi-instance, multi-label (MIML) learning. In this paper, a novel framework for ANA detection is proposed that handles the complexities of MIML tasks using unaltered microscope images without manual preprocessing. Inspired by human labeling logic, it identifies consistent ANA sub-regions and assigns aggregated labels accordingly. These steps are implemented using three task-specific components: an instance sampler, a probabilistic pseudo-label dispatcher, and self-paced weight learning rate coefficients. The instance sampler suppresses low-confidence instances by modeling pattern confidence, while the dispatcher adaptively assigns labels based on instance distinguishability. Self-paced learning adjusts training according to empirical label observations. Our framework overcomes limitations of traditional MIML methods and supports end-to-end optimization. Extensive experiments on one ANA dataset and three public medical MIML benchmarks demonstrate the superiority of our framework. On the ANA dataset, our model achieves up to +7.0% F1-Macro and +12.6% mAP gains over the best prior method, setting new state-of-the-art results. It also ranks top-2 across all key metrics on public datasets, reducing Hamming loss and one-error by up to 18.2% and 26.9%, respectively. The source code can be accessed at https://github.com/fletcherjiang/ANA-SelfPacedLearning.

[145] EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

Pierre Adorni,Minh-Tan Pham,Stéphane May,Sébastien Lefèvre

Main category: cs.CV

TL;DR: 提出一种高效的遥感基础模型集成框架(Ensemble-of-Specialists),通过轻量级、可复用的任务特定专家模型提升效率、可解释性和可扩展性,支持联邦学习与持续集成,推动可持续AI发展。

Details Motivation: 现有遥感基础模型依赖大规模模型和数据,资源消耗大,难以普及,且不符合可持续AI原则。 Method: 采用模块化设计,将训练分解为多个轻量级、任务特定的ConvNeXtV2专家模型,支持冻结、重用、联邦训练、剪枝和持续集成。 Result: 实现了高效、可扩展、资源友好的遥感基础模型框架,适合资源受限和协作环境。 Conclusion: 该集成框架为构建可持续、高效的遥感基础模型提供了新方向,具有良好的实际应用前景。 Abstract: Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs.

[146] The Age-specific Alzheimer 's Disease Prediction with Characteristic Constraints in Nonuniform Time Span

Xin Hong,Kaifeng Huang

Main category: cs.CV

TL;DR: 本研究提出了一种基于定量指标引导的序列图像生成方法,并结合年龄缩放因子生成特定年龄的MRI图像,以提高阿尔茨海默病长期进展预测的准确性。

Details Motivation: 阿尔茨海默病的早期识别对个性化治疗至关重要,但当输入序列在不规则时间间隔采集时,生成能准确反映疾病特征的图像具有挑战性。 Method: 提出一种由定量指标引导的序列图像生成方法,并引入年龄缩放因子,通过年龄缩放像素损失优化MRI图像的迭代生成过程。 Result: 消融实验表明,引入定量指标显著提升了MRI图像合成的准确性,年龄缩放像素损失改善了图像生成效果;在长期疾病预测中,结构相似性指数达到0.882,显示出合成图像的高度相似性。 Conclusion: 该方法能有效保留疾病进展的关键特征,生成高质量的年龄特异性MRI图像,有助于提升阿尔茨海默病的长期预测性能。 Abstract: Alzheimer's disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development of personalized treatment strategies that aim to mitigate its progression. The application of generated images for the prediction of Alzheimer's disease poses challenges, particularly in accurately representing the disease's characteristics when input sequences are captured at irregular time intervals. This study presents an innovative methodology for sequential image generation, guided by quantitative metrics, to maintain the essential features indicative of disease progression. Furthermore, an age-scaling factor is integrated into the process to produce age-specific MRI images, facilitating the prediction of advanced stages of the disease. The results obtained from the ablation study suggest that the inclusion of quantitative metrics significantly improves the accuracy of MRI image synthesis. Furthermore, the application of age-scaled pixel loss contributed to the enhanced iterative generation of MRI images. In terms of long-term disease prognosis, the Structural Similarity Index reached a peak value of 0.882, indicating a substantial degree of similarity in the synthesized images.

[147] Video Generation Models Are Good Latent Reward Models

Xiaoyue Mi,Wenqing Yu,Jiesong Lian,Shibo Jie,Ruizhe Zhong,Zijun Liu,Guozhen Zhang,Zixiang Zhou,Zhiyong Xu,Yuan Zhou,Qinglin Lu,Fan Tang

Main category: cs.CV

TL;DR: 提出PRFL框架,在潜在空间中进行视频生成的偏好优化,避免了VAE解码开销,显著降低内存和训练时间,同时提升与人类偏好的对齐。

Details Motivation: 现有视频奖励模型依赖像素空间输入,导致计算昂贵、内存开销大,且仅在去噪后期优化,缺乏对早期运动动态和结构一致性的监督。 Method: 利用预训练视频生成模型在噪声潜在空间中的时序建模能力,直接在潜在空间构建奖励模型,实现全去噪过程的梯度回传,无需VAE解码。 Result: PRFL在多个实验中显著优于RGB ReFL,在更少内存和更短训练时间内实现了更好的人类偏好对齐效果。 Conclusion: 在潜在空间进行视频生成的奖励反馈学习是高效且有效的,PRFL为视频生成的偏好优化提供了新范式。 Abstract: Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

[148] UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes

Kang Du,Xue Liao,Junpeng Xia,Chaozheng Guo,Yi Gu,Yirui Guan,Duotun Wang,ShengHuang,Zeyu Wang

Main category: cs.CV

TL;DR: UAVLight是一个用于光照鲁棒性3D重建的受控且真实的基准数据集,通过在一天中多个固定时间点沿可重复飞行路径采集数据,实现自然光照变化下的几何一致性与标准化评估。

Details Motivation: 光照不一致是多视角3D重建中的根本挑战,现有数据集无法有效分离光照变化与几何/语义变化,难以专门研究光照鲁棒性。 Method: 设计可重复、地理配准的无人机飞行路径,在多个固定时间点采集同一场景,保持几何、标定和视角一致,引入自然光照变化。 Result: 构建了UAVLight数据集,包含具有丰富光照变化但几何稳定的多场景数据,并提供跨光照条件的标准化评估协议。 Conclusion: UAVLight为户外光照变化下的3D重建方法提供了可靠基准,支持对一致性、保真度和可重光照能力的评估。 Abstract: Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable. However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness. We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.

[149] Multimodal Robust Prompt Distillation for 3D Point Cloud Models

Xiang Gu,Liming Lu,Xu Zheng,Anan Du,Yongbin Zhou,Shuchao Pang

Main category: cs.CV

TL;DR: 提出了一种高效的多模态鲁棒提示蒸馏框架(MRPD),用于防御针对3D点云模型的对抗攻击,训练时蒸馏鲁棒特征,推理时无额外开销。

Details Motivation: 现有防御方法存在计算开销高和对不同攻击泛化能力差的问题,亟需一种高效且通用的防御机制。 Method: 设计了一种教师-学生框架MRPD,通过深度投影的视觉模型、高性能3D模型和文本编码器三种教师模型生成鲁棒嵌入,利用轻量级提示学习和置信度门控机制对学生模型进行多模态知识蒸馏。 Result: 实验表明MRPD在多种白盒和黑盒攻击下显著优于现有最先进防御方法,且在干净数据上性能更优。 Conclusion: MRPD提供了一种实用的新范式,通过高效融合多模态知识来构建鲁棒的3D视觉系统。 Abstract: Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.

[150] Enhanced Landmark Detection Model in Pelvic Fluoroscopy using 2D/3D Registration Loss

Chou Mo,Yehyun Suh,J. Ryan Martin,Daniel Moyer

Main category: cs.CV

TL;DR: 提出一种结合2D/3D地标配准的U-Net框架,用于在可变患者姿态下提高骨盆术中影像地标检测的准确性。

Details Motivation: 现有骨盆X光地标检测方法大多假设为固定的前后位视角,难以应对术中成像设备或患者体位变化导致的姿态偏差。 Method: 将2D/3D地标配准信息融入U-Net模型训练,引入姿态估计损失(Pose Estimation Loss),并在可变姿态条件下对基线U-Net、带损失训练和微调的模型进行比较。 Result: 所提方法在真实术中可变姿态条件下显著提升了地标检测的准确性,优于基线U-Net及仅使用姿态损失训练或微调的版本。 Conclusion: 融合2D/3D配准信息的U-Net框架能有效提升复杂术中环境下地标检测的鲁棒性与精度,具有临床应用潜力。 Abstract: Automated landmark detection offers an efficient approach for medical professionals to understand patient anatomic structure and positioning using intra-operative imaging. While current detection methods for pelvic fluoroscopy demonstrate promising accuracy, most assume a fixed Antero-Posterior view of the pelvis. However, orientation often deviates from this standard view, either due to repositioning of the imaging unit or of the target structure itself. To address this limitation, we propose a novel framework that incorporates 2D/3D landmark registration into the training of a U-Net landmark prediction model. We analyze the performance difference by comparing landmark detection accuracy between the baseline U-Net, U-Net trained with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss under realistic intra-operative conditions where patient pose is variable.

[151] Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Teng Hu,Zhentao Yu,Guozhen Zhang,Zihan Su,Zhengguang Zhou,Youliang Zhang,Yuan Zhou,Qinglin Lu,Ran Yi

Main category: cs.CV

TL;DR: 本文提出了Harmony框架,通过跨任务协同训练、全局-局部解耦交互模块和同步增强型CFG,解决了生成式AI中音视频同步的三大挑战,显著提升了音视频对齐性能。

Details Motivation: 现有开源模型在音视频同步上存在对齐不稳定、时序细节捕捉不足和模态间偏差等问题,主要源于联合扩散过程中的根本缺陷。 Method: 提出Harmony框架:1)跨任务协同训练以减少对应漂移;2)全局-局部解耦交互模块提升时序对齐精度;3)设计SyncCFG在推理中增强同步信号。 Result: 实验表明,Harmony在生成质量与细粒度音视频同步方面均显著优于现有方法,达到SOTA水平。 Conclusion: Harmony通过机制化设计有效解决了音视频同步中的关键挑战,为多模态生成提供了新思路。 Abstract: The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

[152] Deep Learning-Based Multiclass Classification of Oral Lesions with Stratified Augmentation

Joy Naoum,Revana Salama,Ali Hamdi

Main category: cs.CV

TL;DR: 本研究提出了一种基于深度学习的多类分类器,用于16种不同口腔病变的早期识别,通过分层数据划分、数据增强和过采样技术应对数据不足与不平衡问题,在准确率、精确率和召回率方面优于现有方法。

Details Motivation: 由于口腔癌在早期难以通过视觉区分良性和恶性病变,常在晚期才被诊断,因此需要有效的计算机辅助诊断系统以提高早期检测能力。 Method: 采用深度学习构建多类分类器,结合分层数据划分、高级数据增强和过采样技术处理数据量少且类别不平衡的问题。 Result: 实验结果达到83.33%的准确率、89.12%的精确率和77.31%的召回率,显著优于当前主流方法,尤其在少数类分类表现突出。 Conclusion: 所提框架在推动可靠、可用于临床环境的口腔癌早期计算机辅助诊断系统方面具有潜力,是迈向实际应用的重要一步。 Abstract: Oral cancer is highly common across the globe and is mostly diagnosed during the later stages due to the close visual similarity to benign, precancerous, and malignant lesions in the oral cavity. Implementing computer aided diagnosis systems early on has the potential to greatly improve clinical outcomes. This research intends to use deep learning to build a multiclass classifier for sixteen different oral lesions. To overcome the challenges of limited and imbalanced datasets, the proposed technique combines stratified data splitting and advanced data augmentation and oversampling to perform the classification. The experimental results, which achieved 83.33 percent accuracy, 89.12 percent precision, and 77.31 percent recall, demonstrate the superiority of the suggested model over state of the art methods now in use. The suggested model effectively conveys the effectiveness of oversampling and augmentation strategies in situations where the minority class classification performance is noteworthy. As a first step toward trustworthy computer aided diagnostic systems for the early detection of oral cancer in clinical settings, the suggested framework shows promise.

[153] MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

Haotian Xue,Qi Chen,Zhonghao Wang,Xun Huang,Eli Shechtman,Jinrong Xie,Yongxin Chen

Main category: cs.CV

TL;DR: MoGAN是一种无需奖励模型或人类偏好数据的运动中心型后训练框架,通过基于DiT的光流判别器和分布匹配正则化项,显著提升视频扩散模型的运动真实感与时间一致性,同时保持视觉保真度。

Details Motivation: 现有视频扩散模型在帧级保真度上表现良好,但缺乏对时间一致性的直接监督,导致生成视频存在抖动、重影或不合理的动态问题。 Method: 在3步蒸馏视频扩散模型基础上,训练一个基于DiT的光流判别器以区分真实与生成的运动,并引入分布匹配正则化来保持视觉质量。 Result: 在Wan2.1-T2V-1.3B上实验显示,MoGAN在VBench上比50步教师模型提升+7.3%的运动得分,比3步DMD模型提升+13.3%;在VideoJAM-Bench上分别提升+7.4%和+8.8%,且保持甚至提升了美学与图像质量评分。人类偏好研究也表明其在运动质量上更受青睐。 Conclusion: MoGAN能有效提升视频生成中的运动真实感与连贯性,同时不牺牲视觉质量和推理效率,为快速高质量视频生成提供了实用路径。 Abstract: Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.

[154] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

M. Naseer Subhani

Main category: cs.CV

TL;DR: 提出一种自提示、点监督的框架,通过Refine-Requery-Reinforce循环提升SAM在遥感图像上的分割性能。

Details Motivation: 由于域偏移严重且缺乏密集标注,现有交互式分割模型(如SAM)在遥感图像上表现不佳。 Method: 采用Refine-Requery-Reinforce循环:从初始点生成粗略伪掩码(Refine),利用自构建的框提示改进(Requery),并通过迭代对齐嵌入减少确认偏差(Reinforce)。 Result: 在WHU、HRSID和NWPU VHR-10三个遥感图像基准数据集上,该方法 consistently 超越预训练SAM及近期点监督方法。 Conclusion: 自提示与语义对齐为基于点级标注的基础分割模型在遥感应用中的可扩展适应提供了有效路径。 Abstract: Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM's segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.

[155] Active Learning for GCN-based Action Recognition

Hichem Sahbi

Main category: cs.CV

TL;DR: 提出了一种标签高效的图卷积网络(GCN)模型,通过新颖的采样策略和双向稳定架构,在少量标注数据下显著提升了骨架动作识别性能。

Details Motivation: 现有的GCN在骨架动作识别中依赖大量标注数据,而在实际场景中标注数据往往稀缺,限制了其应用。 Method: 设计了一种基于对抗策略的新型获取函数,用于选择具有代表性、多样性和不确定性的关键样本进行标注;同时引入了双向且稳定的GCN架构,以更好建模环境空间与潜在空间之间的映射关系。 Result: 在两个具有挑战性的骨架动作识别基准上进行了广泛实验,结果表明所提方法相比先前工作在少标签设置下显著提升性能。 Conclusion: 所提出的标签高效GCN模型能有效利用有限标注数据,为骨架动作识别提供了一种更实用、鲁棒的解决方案。 Abstract: Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of labeled data, which are frequently scarce in practical settings. To address this limitation, we propose a novel label-efficient GCN model. Our work makes two primary contributions. First, we develop a novel acquisition function that employs an adversarial strategy to identify a compact set of informative exemplars for labeling. This selection process balances representativeness, diversity, and uncertainty. Second, we introduce bidirectional and stable GCN architectures. These enhanced networks facilitate a more effective mapping between the ambient and latent data spaces, enabling a better understanding of the learned exemplar distribution. Extensive evaluations on two challenging skeleton-based action recognition benchmarks reveal significant improvements achieved by our label-efficient GCNs compared to prior work.

[156] Qwen3-VL Technical Report

Shuai Bai,Yuxuan Cai,Ruizhe Chen,Keqin Chen,Xionghui Chen,Zesen Cheng,Lianghao Deng,Wei Ding,Chang Gao,Chunjiang Ge,Wenbin Ge,Zhifang Guo,Qidong Huang,Jie Huang,Fei Huang,Binyuan Hui,Shutong Jiang,Zhaohai Li,Mingsheng Li,Mei Li,Kaixin Li,Zicheng Lin,Junyang Lin,Xuejing Liu,Jiawei Liu,Chenglong Liu,Yang Liu,Dayiheng Liu,Shixuan Liu,Dunjie Lu,Ruilin Luo,Chenxu Lv,Rui Men,Lingchen Meng,Xuancheng Ren,Xingzhang Ren,Sibo Song,Yuchong Sun,Jun Tang,Jianhong Tu,Jianqiang Wan,Peng Wang,Pengfei Wang,Qiuyue Wang,Yuxuan Wang,Tianbao Xie,Yiheng Xu,Haiyang Xu,Jin Xu,Zhibo Yang,Mingkun Yang,Jianxin Yang,An Yang,Bowen Yu,Fei Zhang,Hang Zhang,Xi Zhang,Bo Zheng,Humen Zhong,Jingren Zhou,Fan Zhou,Jing Zhou,Yuanzhi Zhu,Ke Zhu

Main category: cs.CV

TL;DR: Qwen3-VL是Qwen系列中目前最先进的视觉语言模型,支持长达256K token的文本-图像-视频交错上下文,在纯文本理解、长上下文建模和多模态推理方面表现卓越,并通过三项架构创新提升了时空建模与跨模态对齐能力。

Details Motivation: 为了提升视觉语言模型在真实场景中的多模态理解与推理能力,尤其是在长上下文、复杂视觉输入(如多图、视频)以及跨模态对齐方面的性能瓶颈。 Method: 提出Qwen3-VL模型,包含密集型和MoE两种架构;引入增强的交错式MRoPE、DeepStack集成以融合多级ViT特征,并采用基于文本的时间戳对齐替代T-RoPE以实现更精确的视频时间定位。 Result: 在MMMU、MathVista、MathVision等多个权威多模态基准上取得领先表现,具备强大的纯文本理解能力、256K长上下文处理能力以及跨图像、多图和视频的高级推理能力。 Conclusion: Qwen3-VL在多种规模和架构下均实现了卓越的多模态性能,有望成为支持图像推理、智能体决策和多模态代码理解的实际应用基础模型。 Abstract: We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

[157] Continual Error Correction on Low-Resource Devices

Kirill Paramonov,Mete Ozay,Aristeidis Mystakidis,Nikolaos Tsalikidis,Dimitrios Sotos,Anastasios Drosou,Dimitrios Tzovaras,Hyunjun Kim,Kiseok Chang,Sangdok Mo,Namwoong Kim,Woojong Yoo,Jijoong Moon,Umberto Michieli

Main category: cs.CV

TL;DR: 提出一种基于原型更新的高效AI错误纠正系统,结合服务器端知识蒸馏与设备端原型适应,在资源受限设备上实现低开销、高准确率的少样本错误纠正。

Details Motivation: 现有AI错误检测方法缺乏高效的纠正机制,尤其在资源受限设备上难以进行模型重训练。需要一种低计算、低存储的实时纠正方案以提升用户体验。 Method: 采用服务器端基础模型进行知识蒸馏,训练轻量级设备模型;设备端通过原型学习机制,利用少量样本更新类别原型而非重新训练模型,实现快速错误纠正。 Result: 在Food-101和Flowers-102数据集上的一次性学习场景中,错误纠正率超过50%,遗忘率低于0.02%,计算开销极低,并通过Android应用验证了实际可用性。 Conclusion: 该系统实现了在资源受限设备上高效、可持续的AI错误纠正,平衡了性能、存储与用户体验,具备良好的实际部署前景。 Abstract: The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system's effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system's practicality in real-world scenarios.

[158] CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow

Ruisheng Han,Kanglei Zhou,Shuang Chen,Amir Atapour-Abarghouei,Hubert P. H. Shum

Main category: cs.CV

TL;DR: 提出CaFlow框架,结合反事实去混淆与双向时间条件流,用于长时动作质量评估,实现SOTA性能。

Details Motivation: 现有方法依赖昂贵标注或单向时序建模,难以建模长期动态且易受混杂因素影响,导致预测不稳定。 Method: 设计因果反事实正则化(CCR)模块以自监督方式解耦因果与混杂特征,并通过反事实干预增强鲁棒性;采用BiT-Flow模块在循环一致性约束下建模双向时序动态。 Result: 在多个长时AQA基准上取得最优性能,验证了方法在去混杂和时序建模上的有效性。 Conclusion: CaFlow通过因果推理与双向时序建模提升了长时动作质量评估的准确性和稳定性,具有较强泛化能力。 Abstract: Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. The Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions, while the BiT-Flow module models forward and backward dynamics with a cycle-consistency constraint to produce smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance. Code is available at https://github.com/Harrison21/CaFlow

[159] Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Tianyi Xiong,Yi Ge,Ming Li,Zuolong Zhang,Pranav Kulkarni,Kaishen Wang,Qi He,Zeying Zhu,Chenxi Liu,Ruibo Chen,Tong Zheng,Yanshuo Chen,Xiyao Wang,Renrui Zhang,Wenhu Chen,Heng Huang

Main category: cs.CV

TL;DR: Multi-Crit是一个用于评估多模态模型在多样化、细粒度评价标准下判断能力的基准,揭示了现有大模型在遵循多元标准和准则级判断上的不足。

Details Motivation: 探索大模型作为多模态评估裁判时对多样化、细粒度评价标准的遵循能力,当前这一能力尚未被充分研究。 Method: 构建Multi-Crit基准,包含开放式生成与可验证推理任务,通过严格的数据筛选流程收集具有多准则人工标注的挑战性响应对,并提出三个新指标评估模型在多元标准遵循、准则切换灵活性和偏好冲突识别上的表现。 Result: 对25个大模型的分析显示:1)专有模型在保持多元标准一致性上仍有困难,尤其在开放式评估中;2)开源模型在灵活遵循多样标准方面更落后;3)基于整体判断信号的批评微调增强了视觉定位但无法泛化到多元准则级判断。 Conclusion: Multi-Crit为构建可靠且可控的多模态AI评估系统奠定了基础,揭示了当前多模态裁判模型的局限性与改进方向。 Abstract: Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.

[160] Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Naifu Zhang,Wei Tao,Xi Xiao,Qianpu Sun,Yuxin Zheng,Wentao Mo,Peiqiang Wang,Nan Zhang

Main category: cs.CV

TL;DR: 提出ADVLA框架,通过在视觉编码器投影到文本特征空间的特征上直接施加对抗性扰动,实现高效、低幅度、稀疏且不易察觉的攻击,显著优于传统基于补丁的对抗攻击方法。

Details Motivation: 现有VLA模型的对抗攻击方法需要昂贵的端到端训练,且生成的扰动补丁通常明显可见,限制了其实际应用。因此,需要一种更高效、隐蔽且低成本的攻击方式。 Method: ADVLA直接在视觉编码器输出的文本特征空间上施加对抗扰动,并引入注意力引导机制和三种策略(增强敏感性、强制稀疏性、集中扰动),结合Top-K掩码实现局部稀疏扰动。 Result: 在L∞=4/255约束下,ADVLA仅修改不到10%的图像块,攻击成功率接近100%,扰动集中在关键区域,几乎不可察觉,单步迭代耗时约0.06秒,显著优于传统方法。 Conclusion: ADVLA在低幅度和局部稀疏条件下有效削弱VLA模型的动作预测能力,避免了高训练成本和明显扰动,展现出对VLA特征空间攻击的独特有效性与实用价值。 Abstract: In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.

[161] Revolutionizing Glioma Segmentation & Grading Using 3D MRI - Guided Hybrid Deep Learning Models

Pandiyaraju V,Sreya Mynampati,Abishek Karthik,Poovarasan L,D. Saraswathi

Main category: cs.CV

TL;DR: 提出一种结合U-Net分割和DenseNet-VGG分类的混合深度学习模型,利用多头注意力和空间-通道注意力机制,实现对胶质瘤的高精度3D MRI分割与分类,实验显示Dice系数达98%,分类准确率达99%。

Details Motivation: 胶质瘤具有高死亡率,早期准确诊断对治疗至关重要,但传统方法在精度和解释性方面存在不足,需更高效的自动化诊断方案。 Method: 采用U-Net进行肿瘤分割,结合DenseNet与VGG的混合网络进行分类,引入多头注意力和空间-通道注意力机制,并对3D MRI数据进行归一化、重采样和数据增强等预处理。 Result: 分割性能达到Dice系数98%、Mean IoU较高;分类性能达到99%准确率,在精度、召回率和F1-score上均优于传统CNN和无注意力机制模型。 Conclusion: 该混合框架在胶质瘤的自动分割与分类中表现出卓越性能,有助于临床医生及时可靠地诊断和分级胶质瘤,提升治疗规划效率。 Abstract: Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors. To address this difficulty, the proposed research will develop a hybrid deep learning model which integrates U-Net based segmentation and a hybrid DenseNet-VGG classification network with multihead attention and spatial-channel attention capabilities. The segmentation model will precisely demarcate the tumors in a 3D volume of MRI data guided by spatial and contextual information. The classification network which combines a branch of both DenseNet and VGG, will incorporate the demarcated tumor on which features with attention mechanisms would be focused on clinically relevant features. High-dimensional 3D MRI data could successfully be utilized in the model through preprocessing steps which are normalization, resampling, and data augmentation. Through a variety of measures the framework is evaluated: measures of performance in segmentation are Dice coefficient and Mean Intersection over Union (IoU) and measures of performance in classification are accuracy precision, recall, and F1-score. The hybrid framework that has been proposed has demonstrated through physical testing that it has the capability of obtaining a Dice coefficient of 98% in tumor segmentation, and 99% on classification accuracy, outperforming traditional CNN models and attention-free methods. Utilizing multi-head attention mechanisms enhances notions of priority in aspects of the tumor that are clinically significant, and enhances interpretability and accuracy. The results suggest a great potential of the framework in facilitating the timely and reliable diagnosis and grading of glioma by clinicians is promising, allowing for better planning of patient treatment.

[162] Seeing without Pixels: Perception from Camera Trajectories

Zihui Xue,Kristen Grauman,Dima Damen,Andrew Zisserman,Tengda Han

Main category: cs.CV

TL;DR: 本文首次系统研究了仅通过相机轨迹(而非像素)感知视频内容的可能性,提出了一种名为CamFormer的对比学习框架,将相机姿态序列映射到与自然语言对齐的联合嵌入空间。结果表明,相机运动轨迹是一个高度信息丰富的信号,能够揭示视频中的行为或场景内容,且具有跨模态、分类和时序分析等多种应用潜力。

Details Motivation: 探索在不依赖视频像素的情况下,仅通过相机运动轨迹来理解视频内容的可行性,挑战传统视觉理解范式,挖掘非外观线索的语义潜力。 Method: 提出CamFormer,一种基于对比学习的编码器框架,将相机姿态轨迹(如平移和旋转序列)编码为嵌入表示,并与对应的文本描述对齐,从而实现跨模态语义匹配。 Result: CamFormer在多种下游任务中表现出色,包括跨模态检索、动作分类和时序分析;其表现在不同相机位姿估计方法(高精度多传感器与标准RGB-based)下均保持鲁棒性,验证了相机轨迹作为轻量且稳健模态的有效性。 Conclusion: 相机运动轨迹本身蕴含丰富的语义信息,足以支持对视频内容的理解,可作为一种轻量、通用且鲁棒的新模态用于视频分析任务。 Abstract: Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

[163] Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Yusuf Dalva,Guocheng Gordon Qian,Maya Goldenberg,Tsai-Shien Chen,Kfir Aberman,Sergey Tulyakov,Pinar Yanardag,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Canvas-to-Image的统一框架,通过将多种异构控制信号(如文本提示、参考图像、空间布局等)编码为单一的复合画布图像,实现高保真、多模态的图像生成。

Details Motivation: 现有扩散模型在同时处理文本、参考图像、姿态、布局等多种控制信号时难以保证生成图像的忠实性和组合性。 Method: 将多种控制信号整合到一个画布中,并采用多任务画布训练策略,在统一的学习范式下联合优化模型对异构控制的理解与融合。 Result: 实验表明,该方法在多任务基准(如多人组合、姿态控制、布局约束和多控制生成)上显著优于现有最先进方法,尤其在身份保持和控制一致性方面表现突出。 Conclusion: Canvas-to-Image实现了对多样化用户意图的精确建模,支持复杂场景下的高保真图像生成,具备良好的泛化能力。 Abstract: While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.