Skip to content

Table of Contents

cs.CL [Back]

[1] The Overlooked Repetitive Lengthening Form in Sentiment Analysis

Lei Wang,Eduard Dragut

Main category: cs.CL

TL;DR: 本文探讨了重复延长形式(RLF)在情感分析中的重要性,并提出了首个专注于RLF的多领域数据集Lengthening及一种名为ExpInstruct的两阶段指令微调框架,以提升大语言模型对RLF的理解能力与可解释性。

Details Motivation: 重复延长形式(RLF)作为一种独特且强调的非正式表达风格,在情感分析中长期被忽视,本文旨在探究其重要性及语言模型对其的理解能力。 Method: 构建首个专注于RLF的多领域数据集Lengthening(含85万样本),并提出ExpInstruct两阶段指令微调框架;设计统一方法量化语言模型对非正式表达的理解能力。 Result: RLF句子具有强情感表达力,可作为文档级情感标志;微调后的预训练语言模型在RLF任务上性能超越零样本GPT-4,但解释性不足;ExpInstruct可在少量样本下使开源大模型在性能和解释性上均达到零样本GPT-4水平。 Conclusion: RLF是情感分析中不可忽视的重要非正式表达形式,ExpInstruct框架有效提升了模型对RLF的理解与可解释性,为在线内容分析提供了新思路。 Abstract: Individuals engaging in online communication frequently express personal opinions with informal styles (e.g., memes and emojis). While Language Models (LMs) with informal communications have been widely discussed, a unique and emphatic style, the Repetitive Lengthening Form (RLF), has been overlooked for years. In this paper, we explore answers to two research questions: 1) Is RLF important for sentiment analysis (SA)? 2) Can LMs understand RLF? Inspired by previous linguistic research, we curate \textbf{Lengthening}, the first multi-domain dataset with 850k samples focused on RLF for SA. Moreover, we introduce \textbf{Exp}lainable \textbf{Instruct}ion Tuning (\textbf{ExpInstruct}), a two-stage instruction tuning framework aimed to improve both performance and explainability of LLMs for RLF. We further propose a novel unified approach to quantify LMs' understanding of informal expressions. We show that RLF sentences are expressive expressions and can serve as signatures of document-level sentiment. Additionally, RLF has potential value for online content analysis. Our results show that fine-tuned Pre-trained Language Models (PLMs) can surpass zero-shot GPT-4 in performance but not in explanation for RLF. Finally, we show ExpInstruct can improve the open-sourced LLMs to match zero-shot GPT-4 in performance and explainability for RLF with limited samples. Code and sample data are available at https://github.com/Tom-Owl/OverlookedRLF

[2] Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming

Qianfan Zhang,Tianyu Guo,Xuandi Ren,Jiale Chen,Ming Ding,Ran Xin,Xia Xiao

Main category: cs.CL

TL;DR: 本文提出了一种通过训练时强化学习(RL)和测试时并行思维来扩展竞争性编程中推理token预算的方法,显著提升了模型在难题上的性能。

Details Motivation: 竞争性编程需要大量推理token,但单次生成的推理扩展在全注意力机制下成本高昂,亟需更高效的token利用方法。 Method: 1)训练时采用验证引导的RL warmup和随机截断策略优化推理token使用;2)测试时设计多轮并行思维流水线,将token预算分配到多个线程与轮次中,进行生成、验证与精炼,并端到端训练模型适配该结构。 Result: 基于Seed-OSS-36B模型,16线程×16轮的系统在456道AetherCode难题上以pass@1指标超越GPT-5-high,且平均仅用7.6M tokens/题即达到原RL模型的oracle pass@16性能。 Conclusion: 训练与测试协同优化推理token分配可高效提升复杂推理任务性能,验证了并行思维架构在代码推理中的有效性。 Abstract: We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.

[3] M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency

Abolfazl Ansari,Delvin Ce Zhang,Zhuoyang Zou,Wenpeng Yin,Dongwon Lee

Main category: cs.CL

TL;DR: 本文提出了M2-Verify,一个大规模、多模态、跨领域的科学主张一致性验证数据集,用于评估模型在复杂视觉与文本证据间保持严格一致性的能力;实验表明现有SOTA模型在此任务上表现不佳,尤其在高复杂度场景下,并易产生幻觉。

Details Motivation: 现有基准缺乏足够规模、领域多样性和视觉复杂性,难以真实评估科学主张与其多模态证据之间的一致性。 Method: 构建了源自PubMed和arXiv的M2-Verify数据集,包含16个领域、46.9万+样本,经专家审核验证;并开展基线实验与专家评估分析模型表现。 Result: SOTA模型在低复杂度医学扰动下Micro-F1最高达85.8%,但在高复杂度解剖学偏移等挑战下骤降至61.6%;专家评估发现模型生成的科学解释常含幻觉。 Conclusion: M2-Verify填补了多模态科学论证验证基准的空白,揭示了当前模型在严格一致性推理上的显著局限,为未来研究提供了可靠评测平台与使用指南。 Abstract: Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8\% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6\% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset's utility and provide comprehensive usage guidelines.

[4] Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences

Simona-Vasilica Oprea,Adela Bâra

Main category: cs.CL

TL;DR: 本文探讨了语言模型中人类偏好学习的挑战,提出了一种特征增强的奖励建模框架,通过引入响应长度、拒绝信号、毒性分数和语义相似度等可解释特征,在HHRLHF数据集上显著提升了ROC AUC(最高达0.84)与配对准确率,并结合SHAP/LIME提升可解释性,揭示决策依赖于安全性和支持性语境而非孤立关键词。

Details Motivation: 当前奖励建模依赖主观、模糊的偏好比较,缺乏清晰标签,导致性能瓶颈(如ROC AUC < 0.74),难以捕捉人类判断的多维性。 Method: 在HHRLHF数据集上采用标准两两偏好设置评估10个LLM;引入响应长度、拒绝指示符、毒性分数、提示-响应语义相似度等可解释特征,构建特征增强的混合建模框架;结合SHAP和LIME进行可解释性分析,并考察特征交互对偏差放大的影响。 Result: 所有模型ROC AUC提升至最高0.84,配对准确率显著提高;DeBERTav3Large表现最优;可解释性分析表明模型决策依赖语境化安全与支持性表达;特征间交互而非单个特征主导偏好学习。 Conclusion: 特征增强能有效提升奖励模型性能与可解释性,强调需建模多维、交互式的人类判断机制,而非仅依赖纯文本表示。 Abstract: Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similarity, enabling models to explicitly capture key aspects of helpfulness, safety and relevance. The proposed hybrid approach yields consistent improvements across all models, achieving up to 0.84 ROC AUC and significantly higher pairwise accuracy, with DeBERTav3Large demonstrating the best performance. Beyond accuracy, we integrate SHAP and LIME to provide fine-grained interpretability, revealing that model decisions depend on contextualized safety and supportive framing rather than isolated keywords. We further analyze bias amplification, showing that while individual features have weak marginal effects, their interactions influence preference learning.

[5] Procedural Knowledge at Scale Improves Reasoning

Di Wu,Devendra Singh Sachan,Wen-tau Yih,Mingda Chen

Main category: cs.CL

TL;DR: 本文提出Reasoning Memory,一种检索增强生成(RAG)框架,通过从大量推理轨迹中提取并复用‘过程性知识’(如问题重构、策略选择、验证回溯等),在推理时动态检索相关子程序以提升语言模型在数学、科学与编程任务上的表现。

Details Motivation: 现有测试时扩展方法多孤立处理每个问题,未能系统复用过往推理轨迹中的过程性知识(如如何重构问题、选择方法、验证或回溯),导致知识利用不足。 Method: 构建Reasoning Memory框架:1)将现有逐步推理轨迹分解为3200万条自包含的‘子问题-子程序’对,构成过程性知识库;2)推理时使用轻量级‘in-thought’提示让模型生成核心子问题,并检索匹配子程序,将其作为隐式过程先验指导推理。 Result: 在六个数学、科学和编程基准上,Reasoning Memory持续优于基于文档、轨迹或模板的RAG方法,以及计算量匹配的测试时扩展基线;高预算下相较无检索提升最高达19.2%,相较最强基线提升7.9%;消融实验证明收益源于过程性知识覆盖广度与分解-检索设计的有效性。 Conclusion: 显式建模与复用过程性知识是提升大模型复杂推理能力的关键路径,Reasoning Memory为测试时推理提供了可扩展、可检索、可复用的新范式。 Abstract: Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.

[6] No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents

Tiankai Yang,Jiate Li,Yi Nian,Shen Dong,Ruiyao Xu,Ryan Rossi,Kaize Ding,Yue Zhao

Main category: cs.CL

TL;DR: 本文提出并形式化了无意跨用户污染(UCC)这一新型故障模式,指出在多用户共享状态的LLM代理中,良性交互产生的范围受限信息可能被错误复用,导致其他用户结果受损;通过实验发现污染率高达57–71%,并指出仅靠文本级清洗不足,需引入面向可执行构件的防御机制。

Details Motivation: LLM代理在多用户共享持久化知识层的部署中,因未区分信息适用范围而导致良性交互引发跨用户干扰,现有研究未系统识别和建模此类非对抗性、静默式失败。 Method: 提出UCC概念并形式化其定义,构建受控评估协议,建立三类污染类型的分类法,并在两种共享状态机制(纯共享状态与带写时清洗的共享状态)下进行实证评估。 Result: 原始共享状态下污染率达57–71%;写时文本清洗在对话型共享状态中有效,但在含可执行构件的共享状态中仍存显著残留风险,常表现为静默错误答案。 Conclusion: 共享状态LLM代理必须超越文本级清洗,发展构件粒度的防御机制,以防止静默跨用户失效。 Abstract: LLM-based agents increasingly operate across repeated sessions, maintaining task states to ensure continuity. In many deployments, a single agent serves multiple users within a team or organization, reusing a shared knowledge layer across user identities. This shared persistence expands the failure surface: information that is locally valid for one user can silently degrade another user's outcome when the agent reapplies it without regard for scope. We refer to this failure mode as unintentional cross-user contamination (UCC). Unlike adversarial memory poisoning, UCC requires no attacker; it arises from benign interactions whose scope-bound artifacts persist and are later misapplied. We formalize UCC through a controlled evaluation protocol, introduce a taxonomy of three contamination types, and evaluate the problem in two shared-state mechanisms. Under raw shared state, benign interactions alone produce contamination rates of 57--71%. A write-time sanitization is effective when shared state is conversational, but leaves substantial residual risk when shared state includes executable artifacts, with contamination often manifesting as silent wrong answers. These results indicate that shared-state agents need artifact-level defenses beyond text-level sanitization to prevent silent cross-user failures.

[7] Open-Domain Safety Policy Construction

Di Wu,Siyue Liu,Zixiang Ji,Ya-Liang Chang,Zhe-Yu Liu,Andrew Pleffer,Kai-Wei Chang

Main category: cs.CL

TL;DR: 本文提出Deep Policy Research (DPR),一种轻量级、任务定制的智能体系统,仅需少量人工编写的领域种子信息,即可通过迭代式网络搜索、信息蒸馏与结构化组织,自动生成内容审核政策。在多个基准测试中,DPR性能优于定义式和上下文学习基线,甚至媲美专家撰写的政策片段。

Details Motivation: 制定和维护领域特定的安全政策成本高昂,亟需自动化方法辅助政策起草。 Method: DPR采用单次网络搜索工具与轻量级框架,通过迭代生成搜索查询、从多样化网页源中提炼政策规则,并将规则组织成索引化文档。 Result: 在OpenAI不良内容基准(5个领域)和内部多模态广告审核基准上,DPR持续优于定义-only和上下文学习基线;在端到端设置下,其生成的政策片段在多个领域接近专家水平;且优于通用深度研究系统。 Conclusion: 任务定制、结构化的研究闭环比通用网络研究更适用于政策起草,DPR为低成本、高质量政策生成提供了可行路径。 Abstract: Moderation layers are increasingly a core component of many products built on user- or model-generated content. However, drafting and maintaining domain-specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human-written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in-house multimodal advertisement moderation benchmark. DPR consistently outperforms definition-only and in-context learning baselines, and in our end-to-end setting it is competitive with expert-written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general-purpose deep research system, suggesting that a task-specific, structured research loop can be more effective than generic web research for policy drafting. We release our experiment code at https://github.com/xiaowu0162/deep-policy-research.

[8] Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

Itay Yona,Dan Barzilay,Michael Karasik,Mor Geva

Main category: cs.CL

TL;DR: 本文研究了语言模型中实体相关事实问答的内部机制,通过定位实体选择性MLP神经元并进行因果干预,发现早期层中存在集中分布的神经元,激活单个神经元即可恢复实体一致预测,支持实体规范化解释。

Details Motivation: 语言模型能回答许多以实体为中心的事实性问题,但其内部机制尚不清楚。 Method: 使用模板化提示定位每个实体的选择性MLP神经元,并在PopQA数据集上的问答示例中进行因果干预验证。 Result: 在200个精选实体上,定位到的神经元集中在早期层;负向消融导致实体特异性失忆;在占位符位置进行受控注入可提升答案检索效果;对许多实体,仅激活单个定位神经元即可恢复实体一致预测;对别名、缩写、拼写错误和多语言形式具有鲁棒性。 Conclusion: 结果表明,语言模型中存在稀疏且可因果操作的接入点,可用于分析和调节实体条件下的事实行为。 Abstract: Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.

[9] Assessing Pause Thresholds for empirical Translation Process Research

Devi Sri Bandaru,Michael Carl,Xinyue Ren

Main category: cs.CL

TL;DR: 本文比较了三种计算打字暂停阈值的方法,并提出并评估了一种新的生成单元中断(Production Unit Breaks)计算方法,旨在更准确地区分自动化与反思性翻译过程。

Details Motivation: 现有研究对如何确定区分自动化与反思性翻译过程的暂停阈值存在长期争议,需系统比较并提出更优方法。 Method: 比较三种近期暂停阈值计算方法,并提出和评估一种新的生成单元中断(Production Unit Breaks)计算方法。 Result: 提出了一个新方法用于识别翻译中的生成单元中断,并对其有效性进行了评估。 Conclusion: 新提出的生成单元中断计算方法在区分翻译过程类型方面具有潜力,为后续基于键盘记录的翻译过程研究提供了更可靠的技术支持。 Abstract: Text production (and translations) proceeds in the form of stretches of typing, interrupted by keystroke pauses. It is often assumed that fast typing reflects unchallenged/automated translation production while long(er) typing pauses are indicative of translation problems, hurdles or difficulties. Building on a long discussion concerning the determination of pause thresholds that separate automated from presumably reflective translation processes (O'Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016), this paper compares three recent approaches for computing these pause thresholds, and suggest and evaluate a novel method for computing Production Unit Breaks.

[10] Adaptive Stopping for Multi-Turn LLM Reasoning

Xiaofan Zhou,Huy Nguyen,Bo Yu,Chenxi Liu,Lu Cheng

Main category: cs.CL

TL;DR: 本文提出MiCP,首个面向多轮推理的符合性预测(CP)框架,通过在不同轮次间动态分配误差预算,实现自适应停止,同时保证整体覆盖度,显著降低推理成本和预测集大小。

Details Motivation: 现有大语言模型多轮推理方法缺乏形式化停止准则,导致高风险领域中过早停止或过度推理,影响准确性和效率。 Method: 提出Multi-Turn Language Models with Conformal Prediction(MiCP),在多轮RAG与ReAct流程中动态分配误差预算,支持自适应停止并保持整体覆盖保证。 Result: MiCP在单跳与多跳问答基准上达成目标覆盖度,同时减少轮次、推理开销与预测集大小;并引入兼顾覆盖有效性与回答效率的新评估指标。 Conclusion: MiCP首次将符合性预测扩展至多轮语言模型推理,为高可靠性AI系统提供了具备统计保障的自适应停止机制。 Abstract: Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.

[11] Cost-Efficient Estimation of General Abilities Across Benchmarks

Michael Krumdick,Adam Wiemerslage,Seth Ebner,Charles Lovering,Chris Tanner

Main category: cs.CL

TL;DR: 本文提出了一种基于预测效度的高效LLM评估框架,利用WILD数据集和改进的多维IRT模型结合自适应题目选择,在仅观察16个题目、22,000 tokens下实现对112个未见任务性能预测的MAE<7%。

Details Motivation: 现有LLM基准繁多但冗余,性能可由少数潜在能力解释;亟需一种以预测未见任务性能效率为标准的、可比较的基准评估框架。 Method: 构建大规模细粒度数据集WILD(65模型 × 109,564题目 × 163任务);提出改进的多维项目反应理论(IRT)模型;结合最优实验设计实现自适应题目选择;引入成本感知折扣因子优化token消耗。 Result: 在112个预留任务上实现平均绝对误差(MAE)<7%的性能预测;仅需观测16个题目;总token消耗从141,000降至22,000(降低85%)。 Conclusion: 以预测效度为核心的基准评估更高效、可量化;结合心理测量模型与自适应采样能大幅降低LLM评估成本,为未来基准设计提供新范式。 Abstract: Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.

[12] The power of context: Random Forest classification of near synonyms. A case study in Modern Hindi

Jacek Bąkowski

Main category: cs.CL

TL;DR: 本研究利用分布语义(词向量)和随机森林分类器,发现印地语中源自梵语与波斯-阿拉伯语的同义词即使语义相近,其上下文使用模式仍能显著区分其词源,为同义词承载不同视角与文化关联提供了量化证据。

Details Motivation: 检验同义词是否真能反映不同认知视角或文化关联,特别是印地语中因长期接触波斯语而产生的梵语/波斯-阿拉伯语同义对,其词源信息是否仍在现代用法中留存。 Method: 基于印地语同义词对的词嵌入表示,训练随机森林分类器预测其词源(梵语 vs. 波斯-阿拉伯语),控制语义相似性变量,验证分布语境能否独立编码词源信息。 Result: 分类器在语义无关的同义词对上仍能高准确率区分词源;上下文分布确实携带可被机器学习识别的词源信号。 Conclusion: 同义现象并非冗余,而是承载历史、文化与认知差异的系统性语言机制;词源塑造了词语所处的概念子空间,形成一种由历史起源定义的新语义框架。 Abstract: Synonymy is a widespread yet puzzling linguistic phenomenon. Absolute synonyms theoretically should not exist, as they do not expand language's expressive potential. However, it was suggested that even if synonyms denote the same concept, they may reflect different perspectives or carry distinct cultural associations, claims that have rarely been tested quantitatively. In Hindi, prolonged contact with Persian produced many Perso-Arabic loanwords coexisting with their Sanskrit counterpart, forming numerous synonym pairs. This study investigates whether centuries after these borrowings appeared in the Subcontinent their origin can still be distinguished using distributional data alone and regardless of their semantic content. A Random Forest trained on word embeddings of Hindi synonyms successfully classified words by Sanskrit or Perso-Arabic origin, even when they were semantically unrelated, suggesting that usage patterns preserve traces of etymology. These findings provide quantitative evidence that context encodes etymological signals and that synonymy may reflect subtle but systematic distinctions linked to origin. They support the idea that synonymous words can offer different perspectives and that etymologically related words may form distinct conceptual subspaces, creating a new type of semantic frame shaped by historical origin. Overall, the results highlight the power of context in capturing nuanced distinctions beyond traditional semantic similarity.

[13] Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation

Hexuan Wang,Jingyu Zhang,Benjamin Van Durme,Daniel Khashabi

Main category: cs.CL

TL;DR: 本文研究了引用粒度(句子级、段落级、文档级)对归因生成模型性能的影响,发现中等粒度(段落级)在归因质量和答案正确性上达到最优平衡,而过细(句子级)或过粗(多段)的引用均会损害性能,且影响随模型规模非单调变化。

Details Motivation: 细粒度引用虽利于人工验证,但其对模型归因性能的影响尚不明确,需系统分析不同引用粒度与模型规模的交互效应。 Method: 在四种不同规模(8B–120B)的模型上,定量评估句子级、段落级和多段级引用对归因质量与答案正确性的影响,并分析语义依赖与信息合成机制。 Result: 段落级引用在所有模型上均取得最高归因质量,句子级引用使归因质量下降16–276%;细粒度约束对大模型惩罚更显著;最优粒度同时提升归因质量与答案正确性。 Conclusion: 引用粒度应与模型的自然语义范围对齐,而非一味追求人类可验证的细粒度;单纯优化人工验证目标会损害归因忠实性与生成可靠性。 Abstract: Citation granularity - whether to cite individual sentences, paragraphs, or documents - is a critical design choice in attributed generation. While fine-grained citations are often preferred for precise human verification, their impact on model performance remains under-explored. We analyze four model scales (8B-120B) and demonstrate that enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity. We observe a consistent performance pattern where attribution quality peaks at intermediate granularities (paragraph-level). Our analysis suggests that fine-grained (sentence-level) citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively coarse citations (multi-paragraph) introduce distracting noise. Importantly, the magnitude of this performance gap varies non-monotonically with model scale: fine-grained constraints disproportionately penalize larger models, suggesting that atomic citation units disrupt the multi-sentence information synthesis at which these models excel. Strikingly, citation-optimal granularity leads to substantial gains in attribution quality while preserving or even improving answer correctness. Overall, our findings demonstrate that optimizing solely for human verification via fine-grained citation disregards model constraints, compromising both attribution faithfulness and generation reliability. Instead, effective attribution requires aligning citation granularity with the model's natural semantic scope.

[14] Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs

Tianyi Zhao,Yinhan He,Wendy Zheng,Yujie Zhang,Chen Chen

Main category: cs.CL

TL;DR: 本文通过电路级机制分析,揭示了大语言模型(LLM)在生成错误答案时仍表现出过高口头置信度(verbalized overconfidence)的内在机制,并发现中后期层中的特定MLP模块和注意力头负责该现象;通过针对性推理时干预,可显著提升模型校准效果。

Details Motivation: 大语言模型常出现‘自信地错误’现象,即生成事实错误答案时却表现出过高的口头置信度,这会误导用户并削弱置信度作为不确定性信号的可靠性,但其内部机制尚不清楚。 Method: 提出一种电路级机制分析方法,围绕三个维度展开:(1)将口头置信度建模为可微分的内部信号;(2)识别因果性地放大该信号的神经电路;(3)基于发现进行推理时的定向校准干预。在两个指令微调LLM和三个数据集上开展实证分析。 Result: 发现中后期Transformer层中一组紧凑的MLP块和注意力头,在最终token位置稳定写入置信度膨胀信号;对这些电路实施定向推理时干预,能显著改善模型校准性能。 Conclusion: LLM中的口头过度自信由可识别的内部电路驱动,可通过针对性干预有效缓解,为提升模型可靠性提供了新路径。 Abstract: Large language models are often not just wrong, but \emph{confidently wrong}: when they produce factually incorrect answers, they tend to verbalize overly high confidence rather than signal uncertainty. Such verbalized overconfidence can mislead users and weaken confidence scores as a reliable uncertainty signal, yet its internal mechanisms remain poorly understood. We present a circuit-level mechanistic analysis of this inflated verbalized confidence in LLMs, organized around three axes: capturing verbalized confidence as a differentiable internal signal, identifying the circuits that causally inflate it, and leveraging these insights for targeted inference-time recalibration. Across two instruction-tuned LLMs on three datasets, we find that a compact set of MLP blocks and attention heads, concentrated in middle-to-late layers, consistently writes the confidence-inflation signal at the final token position. We further show that targeted inference-time interventions on these circuits substantially improve calibration. Together, our results suggest that verbalized overconfidence in LLMs is driven by identifiable internal circuits and can be mitigated through targeted intervention.

[15] A Dynamic Atlas of Persian Poetic Symbolism: Families, Fields, and the Historical Rewiring of Meaning

Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar

Main category: cs.CL

TL;DR: 本文通过构建包含129,451首波斯诗歌的语料库,识别并追踪象征符号家族(如‘夜’、‘日’、‘土’等),区分意象性与神圣/宫廷指涉性成分,建立多层关系图谱,揭示波斯诗歌象征系统随时间演化的动态结构:核心象征稀疏稳定,指涉成分密集变化,世纪间模块性增强、跨范畴连接减弱,特定符号(如‘苏菲袍’)后期崛起,‘酒杯’始终居中。

Details Motivation: 现有计算研究将波斯诗歌符号扁平化为孤立词或宽泛文档语义,忽视了其以‘家族’形式反复出现、通过关系强化的诗学组织单元。 Method: 基于129,451首波斯诗歌语料库,将反复出现的符号聚类为可追踪的‘家族’,分离意象性与神圣/宫廷指涉成分,并构建多层关系图谱;按11个回历世纪分箱,分析图结构演化(模块性、跨范畴连接、桥接强度、中心节点变迁)。 Result: 发现象征核心稀疏而稳定(如Shab、Ruz、Khaak长期广泛分布),指涉成分更密集且时序权重分明(酒器、花园、火焰、抒情声音后期增强;尊贵/英雄-宫廷词汇早期突出);世纪图谱显示模块性上升、跨范畴链接下降、宫廷桥梁弱化、神圣桥梁强化;中心节点动态迁移(Kherqe后期凸显,Farkhondeh与Banafsheh衰退,Saaghar始终居中)。 Conclusion: 波斯诗歌象征并非静态词库,而是一个长生命周期的动态系统,其内部权重与关联结构随时间持续演化。 Abstract: Persian poetry is often remembered through recurrent symbols before it is remembered through plot. Wine vessels, gardens, flames, sacred titles, bodily beauty, and courtly names return across centuries, yet computational work still tends to flatten this material into isolated words or broad document semantics. That misses a practical unit of organization in Persian poetics: related forms travel as families and gain force through recurring relations. Using a corpus of 129,451 poems, we consolidate recurrent forms into traceable families, separate imagistic material from sacred and courtly reference, and map their relations in a multi-layer graph. The symbolic core is relatively sparse, the referential component much denser, and the attachment zone between them selective rather than diffuse. Across 11 Hijri-century bins, some families remain widely distributed, especially Shab (Night), Ruz (Day), and Khaak (Earth). Wine vessels, garden space, flame, and lyric sound strengthen later, while prestige-coded and heroic-courtly vocabulary is weighted earlier. Century-specific graphs show change in arrangement as well as membership. Modularity rises, cross-scope linkage declines, courtly bridges weaken, and sacred bridges strengthen. Hub positions shift too: Kherqe (Sufi Robe) gains late prominence, Farkhondeh {Blessed} and Banafsheh (Violet) recede, and Saaghar (Wine Cup) stays central across the chronology. In this corpus, Persian symbolism appears less as a fixed repertory than as a long-lived system whose internal weights and connections change over time.

[16] Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once

Harnoor Dhingra

Main category: cs.CL

TL;DR: 本文提出Magic, Madness, Heaven, Sin框架,从任务的规范性目标出发,系统分析大语言模型输出变异性的四个维度(认知、交互、社会、安全),指出变异评价应依上下文而定,而非视为模型固有属性。

Details Motivation: 现有LLM研究中关于“多样性”的术语混乱,主因是任务背后的规范性目标未被显式阐明。 Method: 构建四象限框架(Magic, Madness, Heaven, Sin),按同质-异质轴划分输出变异,并对应四种规范性语境(认知/事实性、交互/用户效用、社会/表征公平、安全/鲁棒性);系统梳理各语境下的失效模式与术语(如幻觉、模式坍缩、偏见、抹除),并分析跨语境权衡。 Result: 揭示优化某一目标(如安全性)可能损害其他目标(如群体表征或创造性多样性);强调需采用任务目标驱动的上下文敏感评估方式。 Conclusion: 输出变异不是模型的内在属性,而是由具体任务及其规范性目标所塑造的可塑性特征;应摒弃通用‘多样性’指标,转向语境化评估范式。 Abstract: Research on Large Language Models (LLMs) studies output variation across generation, reasoning, alignment, and representational analysis, often under the umbrella of "diversity." Yet the terminology remains fragmented, largely because the normative objectives underlying tasks are rarely made explicit. We introduce the Magic, Madness, Heaven, Sin framework, which models output variation along a homogeneity-heterogeneity axis, where valuation is determined by the task and its normative objective. We organize tasks into four normative contexts: epistemic (factuality), interactional (user utility), societal (representation), and safety (robustness). For each, we examine the failure modes and vocabulary such as hallucination, mode collapse, bias, and erasure through which variation is studied. We apply the framework to analyze all pairwise cross-contextual interactions, revealing that optimizing for one objective, such as improving safety, can inadvertently harm demographic representation or creative diversity. We argue for context-aware evaluation of output variation, reframing it as a property shaped by task objectives rather than a model's intrinsic trait.

[17] Why Instruction-Based Unlearning Fails in Diffusion Models?

Zeliang Zhang,Rui Sun,Jiani Liu,Qi Wu,Chenliang Xu

Main category: cs.CL

TL;DR: 本文研究了基于指令的遗忘方法在扩散模型中的有效性,发现仅靠自然语言指令无法有效抑制目标概念,揭示了提示级别指令在扩散模型中的根本局限性。

Details Motivation: 探究基于指令的遗忘范式是否适用于扩散模型等其他生成模型。 Method: 通过在多个概念和提示变体上进行受控实验,分析CLIP文本编码器和去噪过程中的交叉注意力动态。 Result: 扩散模型在仅使用自然语言遗忘指令时,无法系统性地抑制目标概念;未观察到对目标概念词元的注意力持续降低,导致目标概念表征在整个生成过程中持续存在。 Conclusion: 提示级别的指令在扩散模型中存在根本局限,有效的遗忘需要超越推理时语言控制的干预手段。 Abstract: Instruction-based unlearning has proven effective for modifying the behavior of large language models at inference time, but whether this paradigm extends to other generative models remains unclear. In this work, we investigate instruction-based unlearning in diffusion-based image generation models and show, through controlled experiments across multiple concepts and prompt variants, that diffusion models systematically fail to suppress targeted concepts when guided solely by natural-language unlearning instructions. By analyzing both the CLIP text encoder and cross-attention dynamics during the denoising process, we find that unlearning instructions do not induce sustained reductions in attention to the targeted concept tokens, causing the targeted concept representations to persist throughout generation. These results reveal a fundamental limitation of prompt-level instruction in diffusion models and suggest that effective unlearning requires interventions beyond inference-time language control.

[18] Read More, Think More: Revisiting Observation Reduction for Web Agents

Masafumi Enomoto,Ryoma Obara,Haochen Zhang,Masafumi Oyamada

Main category: cs.CL

TL;DR: 本文重新审视了网页代理中HTML观察表示的简化趋势,发现最优观察表示取决于模型能力和思考令牌预算:低能力模型适合紧凑的可访问性树表示,而高能力模型则从详细的HTML表示中获益更多,且增加思考令牌会进一步放大HTML的优势;此外,引入观察历史(尤其是基于差异的表示)能提升大多数模型的性能。

Details Motivation: 先前工作将HTML的冗长视为性能障碍并普遍采用观察简化,本文旨在重新评估这一假设,探究不同模型能力下最优观察表示的选择依据。 Method: 通过对比不同模型能力下使用可访问性树与HTML作为观察输入的效果,并分析思考令牌预算、错误类型(如幻觉)、观察历史及diff-based表示的影响,进行系统性实验与误差分析。 Result: 1) 高能力模型在HTML输入下表现更优,且受益于更多思考令牌;低能力模型更适合可访问性树;2) 高能力模型利用HTML中的布局信息更好定位动作,低能力模型在长输入下幻觉增多;3) 观察历史和diff-based表示均能提升性能。 Conclusion: 应根据模型能力与思考令牌预算自适应选择观察表示,并推荐结合diff-based的观察历史以兼顾性能与效率。 Abstract: Web agents based on large language models (LLMs) rely on observations of web pages -- commonly represented as HTML -- as the basis for identifying available actions and planning subsequent steps. Prior work has treated the verbosity of HTML as an obstacle to performance and adopted observation reduction as a standard practice. We revisit this trend and demonstrate that the optimal observation representation depends on model capability and thinking token budget: (1) compact observations (accessibility trees) are preferable for lower-capability models, while detailed observations (HTML) are advantageous for higher-capability models; moreover, increasing thinking tokens further amplifies the benefit of HTML. (2) Our error analysis suggests that higher-capability models exploit layout information in HTML for better action grounding, while lower-capability models suffer from increased hallucination under longer inputs. We also find that incorporating observation history improves performance across most models and settings, and a diff-based representation offers a token-efficient alternative. Based on these findings, we suggest practical guidelines: adaptively select observation representations based on model capability and thinking token budget, and incorporate observation history using diff-based representations.

[19] Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging

Mengxian Lyu,Cheng Peng,Ziyi Chen,Mengyuan Zhang,Jieting Li Lu,Yonghui Wu

Main category: cs.CL

TL;DR: 本文提出了一种模型融合框架,通过插值法融合临床基础模型(GatorTronLlama)与通用指令模型(Llama-3.1-8B-Instruct),在保持指令遵循能力的同时提升医学任务性能,有效缓解大语言模型在医学微调中的灾难性遗忘问题,并具备训练高效性和资源可扩展性。

Details Motivation: 大语言模型在医学领域应用时,微调后常出现严重“遗忘”指令遵循能力的问题,阻碍其临床部署。 Method: 采用插值型权重空间合并方法,将临床基础模型GatorTronLlama与通用指令模型Llama-3.1-8B-Instruct融合,构建兼具临床专业性与指令遵循能力的域适配模型。 Result: 融合模型在多个医学基准及五类临床生成任务(如放射科、出院小结生成)上显著缓解灾难性遗忘,同时保持临床领域性能与指令遵循能力;仅用64个样本即可达到全量微调(256样本)的性能水平。 Conclusion: 权重空间融合是一种高效、可扩展的开源大模型医学适配方案,适用于资源受限的医疗环境。 Abstract: Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often "forget" a significant amount of instruction-following ability when fine-tuned using a task-specific medical dataset, a critical challenge in adopting general-purpose LLMs for clinical applications. This study presents a model merging framework to efficiently adapt general-purpose LLMs to the medical domain by countering this forgetting issue. By merging a clinical foundation model (GatorTronLlama) with a general instruct model (Llama-3.1-8B-Instruct) via interpolation-based merge methods, we seek to derive a domain-adapted model with strong performance on clinical tasks while retaining instruction-following ability. Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain expertise, and retain instruction-following ability. In addition, our model merging strategies demonstrate training efficiency, achieving performance on par with fully fine-tuned baselines under severely constrained supervision (e.g., 64-shot vs. 256-shot). Consequently, weight-space merging constitutes a highly scalable solution for adapting open-source LLMs to clinical applications, facilitating broader deployment in resource-constrained healthcare environments.

[20] DeltaMem: Towards Agentic Memory Management via Reinforcement Learning

Qi Zhang,Shen Huang,Chu Liu,Shouqing Yang,Junbo Zhao,Haobo Wang,Pengjun Xie

Main category: cs.CL

TL;DR: 本文提出DeltaMem,一种单智能体的个性化记忆管理系统,通过借鉴人类记忆演化,构建对话数据集与记忆更新标签,并引入基于记忆的Levenshtein距离和强化学习框架,在多个长期记忆基准上超越现有基线。

Details Motivation: 现有基于多智能体的个性化记忆管理方法存在信息丢失和跨场景鲁棒性差的问题,导致性能不佳。 Method: 提出DeltaMem单智能体记忆管理系统;构建用户-助手对话数据集及操作级记忆更新标签;设计基于记忆的Levenshtein距离作为记忆更新奖励;提出定制化强化学习框架优化记忆管理。 Result: 训练前(零样本)和强化学习训练后的DeltaMem在LoCoMo、HaluMem、PersonaMem等多个长期记忆基准上均优于所有产品级基线。 Conclusion: DeltaMem通过简化架构、引入认知启发的奖励机制与强化学习,显著提升了个性化记忆管理的准确性与鲁棒性,验证了单智能体范式在该任务中的有效性。 Abstract: Recent advances in persona-centric memory have revealed the powerful capability of multi-agent systems in managing persona memory, especially in conversational scenarios. However, these complex frameworks often suffer from information loss and are fragile across varying scenarios, resulting in suboptimal performance. In this paper, we propose DeltaMem, an agentic memory management system that formulates persona-centric memory management as an end-to-end task within a single-agent setting. To further improve the performance of our agentic memory manager, we draw inspiration from the evolution of human memory and synthesize a user-assistant dialogue dataset along with corresponding operation-level memory updating labels. Building on this, we introduce a novel Memory-based Levenshtein Distance to formalize the memory updating reward, and propose a tailored reinforcement learning framework to further enhance the management capabilities of DeltaMem. Extensive experiments show that both training-free and RL-trained DeltaMem outperform all product-level baselines across diverse long-term memory benchmarks, including LoCoMo, HaluMem, and PersonaMem.

[21] Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression

Ruoling Qi,Yirui Liu,Xuaner Wu,Xiangyu Wang,Ming Li,Chen Chen,Jian Chen,Yin Chen,Qizhen Weng

Main category: cs.CL

TL;DR: 本文提出Swift-SVD,一种激活感知、闭式求解的SVD压缩框架,兼顾理论最优性、实际高效性与数值稳定性,显著提升大语言模型权重与KV缓存压缩的速度与精度。

Details Motivation: 大型语言模型部署受限于静态权重和动态KV缓存的内存与带宽需求;现有SVD压缩方法在重构误差或计算效率上存在缺陷。 Method: 提出Swift-SVD:基于批量输入输出激活协方差的增量聚合与单次特征值分解,实现免训练、快速、最优的逐层低秩近似;引入有效秩分析层可压缩性,并设计兼顾局部重构误差与端到端层重要性的动态秩分配策略。 Result: 在六个LLM和八个数据集上实验表明,Swift-SVD在压缩精度上达到最优,端到端压缩时间比SOTA方法快3–70倍。 Conclusion: Swift-SVD是一种高效、稳定、理论最优的大模型权重与KV缓存压缩新范式,兼具实用性与泛化能力。 Abstract: The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3-70X speedups in end-to-end compression time. Our code will be released upon acceptance.

[22] Grounding AI-in-Education Development in Teachers' Voices: Findings from a National Survey in Indonesia

Nurul Aisyah,Muhammad Dehan Al Kautsar,Arif Hidayat,Fajri Koto

Main category: cs.CL

TL;DR: 本研究通过对印尼全国349名K-12教师的调查,揭示了AI在教学实践中的使用现状、差异及挑战:小学教师使用更频繁,高中教师较少;中年教师更重视AI;东印尼教师感知价值更高;主要用途是减轻备课负担,但面临通用输出、基础设施和本地化适配等障碍。

Details Motivation: 填补印尼课堂中AI实际应用与教师支持需求的大规模、以教师为中心的实证研究空白,以支撑本土化AI系统与政策制定。 Method: 面向印尼全国小学、初中、高中共349名K-12教师开展问卷调查,分析AI使用频率、场景、群体差异及阻碍因素。 Result: 发现AI在教学法、内容开发与教学媒介中使用呈上升趋势但不均衡:小学教师使用更一致,高中教师参与度低;中年教师更重视AI;东印尼教师感知价值更高;主要用途为减轻备课负担(如测评、教案、材料生成);主要障碍包括输出泛化、基础设施限制与情境适配不足。 Conclusion: 需开发更契合印尼教育语境、支持本地化内容生成并适配薄弱基础设施的AI工具,并配套针对性教师培训与政策支持,以促进AI在课堂中的有效整合。 Abstract: Despite emerging use in Indonesian classrooms, there is limited large-scale, teacher-centred evidence on how AI is used in practice and what support teachers need, hindering the development of context-appropriate AI systems and policies. To address this gap, we conduct a nationwide survey of 349 K-12 teachers across elementary, junior high, and senior high schools. We find increasing use of AI for pedagogy, content development, and teaching media, although adoption remains uneven. Elementary teachers report more consistent use, while senior high teachers engage less; mid-career teachers assign higher importance to AI, and teachers in Eastern Indonesia perceive greater value. Across levels, teachers primarily use AI to reduce instructional preparation workload (e.g., assessment, lesson planning, and material development). However, generic outputs, infrastructure constraints, and limited contextual alignment continue to hinder effective classroom integration.

[23] Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations

Shou-Tzu Han,Rodrigue Rizk,KC Santosh

Main category: cs.CL

TL;DR: 本文研究大型语言模型在数学推理任务中对表面扰动的脆弱性,提出MPD框架分析失败机制,并基于诊断结果提出故障分类与修复策略。

Details Motivation: 大型语言模型在数学推理基准测试中表现优异,但对保持语义不变的表面扰动却异常脆弱,亟需深入理解其失败机制。 Method: 对Mistral-7B、Llama-3-8B和Qwen2.5-7B三个开源大模型,在677个GSM8K问题及其语义等价变体(姓名替换与数字格式改写)上系统评测;提出Mechanistic Perturbation Diagnostics(MPD)框架,整合logit lens分析、激活修补、组件消融和新指标Cascading Amplification Index(CAI)。 Result: 三模型答案翻转率高达28.8%–45.1%,数字改写比姓名替换更具破坏性;CAI在两模型上优于首层发散作为失败预测器(AUC最高0.679);logit lens显示错误样本更早层偏离正确预测;激活修补揭示Llama-3失败具局部可修复性,而Mistral与Qwen呈广泛分布;据此构建‘局部型/分布型/纠缠型’故障分类,并验证修复效果差异。 Conclusion: 模型对表面扰动的鲁棒性缺陷源于不同架构下失败机制的本质差异,MPD框架可有效诊断并指导针对性修复,为提升数学推理鲁棒性提供可解释路径。 Abstract: Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistently more disruptive than name swaps. To trace the mechanistic basis of these failures, we introduce the Mechanistic Perturbation Diagnostics (MPD) framework, combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI) into a unified diagnostic pipeline. CAI, a novel metric quantifying layer-wise divergence amplification, outperforms first divergence layer as a failure predictor for two of three architectures (AUC up to 0.679). Logit lens reveals that flipped samples diverge from correct predictions at significantly earlier layers than stable samples. Activation patching reveals a stark architectural divide in failure localizability: Llama-3 failures are recoverable by patching at specific layers (43/60 samples), while Mistral and Qwen failures are broadly distributed (3/60 and 0/60). Based on these diagnostic signals, we propose a mechanistic failure taxonomy (localized, distributed, and entangled) and validate it through targeted repair experiments: steering vectors and layer fine-tuning recover 12.2% of localized failures (Llama-3) but only 7.2% of entangled (Qwen) and 5.2% of distributed (Mistral) failures.

[24] What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis

Delip Rao,Chris Callison-Burch

Main category: cs.CL

TL;DR: 本文系统分析了24K个声明验证样本的推理模式,发现现有基准主要测试直接证据提取和词汇匹配,缺乏多句合成与数值推理等高阶能力;不同领域错误类型差异显著,高分更多反映检索+蕴含能力而非深层推理。

Details Motivation: 缺乏对声明验证基准实际考察哪些推理能力的系统性理解,需揭示当前评估体系的覆盖偏差与能力盲区。 Method: 利用GPT-4o-mini为9个数据集共24K样本生成结构化推理轨迹,并使用1B参数推理验证器分析五类错误在通用、科学、数学领域的分布特征。 Result: 发现直接证据提取占主导,多句合成与数值推理严重不足;各数据集存在显著偏差(如部分仅测词汇匹配,部分近半需信息合成);错误类型依领域而异:通用领域以词汇重叠偏差为主,科学领域为过度谨慎,数学领域为算术推理失败。 Conclusion: 当前高基准分数主要体现检索加蕴含能力,而非真正复杂推理;需构建更具挑战性的评测套件以全面评估验证系统的推理能力。 Abstract: Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain -- general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.

[25] PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation

Yanxin Luo,Xiaoyu Zhang,Jing Li,Yan Gao,Donghong Han

Main category: cs.CL

TL;DR: 本文提出PRCCF框架,通过人格引导的检索和因果感知的认知过滤机制,提升情感支持对话中的上下文理解与共情响应生成性能。

Details Motivation: 现有情感支持对话方法在深度上下文理解方面存在不足,难以有效建模用户情绪与外部知识间的因果关系。 Method: 提出PRCCF框架,包含 persona-guided retrieval(联合建模语义兼容性与人格一致性)和 causality-aware cognitive filtering(筛选因果相关外部知识以增强情绪推理的认知理解)。 Result: 在ESConv数据集上,PRCCF在自动评估指标和人工评价中均优于当前最优基线模型。 Conclusion: 人格引导与因果感知的协同建模可显著提升情感支持对话系统的共情能力与上下文理解深度。 Abstract: Emotional Support Conversation (ESC) aims to alleviate individual emotional distress by generating empathetic responses. However, existing methods face challenges in effectively supporting deep contextual understanding. To address this issue, we propose PRCCF, a Persona-guided Retrieval and Causality-aware Cognitive Filtering framework. Specifically, the framework incorporates a persona-guided retrieval mechanism that jointly models semantic compatibility and persona alignment to enhance response generation. Furthermore, it employs a causality-aware cognitive filtering module to prioritize causally relevant external knowledge, thereby improving contextual cognitive understanding for emotional reasoning. Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations. Our code is publicly available at: https://github.com/YancyLyx/PRCCF.

[26] PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment

Chenning Xu,Mao Zheng,Mingyang Song

Main category: cs.CL

TL;DR: 本文提出PRISM框架,通过引入句子级事实性风险标签和句间依赖标注,在监督微调中对事实关键位置进行差异化学习,以减少大模型生成中的幻觉问题。

Details Motivation: 监督微调(SFT)使用词元级硬标签易导致模型过度自信地模仿缺乏事实依据的目标输出,从而在多句生成中传播幻觉。 Method: 提出PRISM——一种可微的风险门控框架,在标准SFT基础上增加轻量、模型感知的概率重分配目标:在风险标记的token上抑制高置信度预测,其作用范围由跨度级风险权重与模型感知门控控制。 Result: 在幻觉敏感的事实性基准和通用评测中,PRISM在多个骨干模型上均提升事实性指标,同时保持整体能力竞争力;消融实验证明辅助信号需保守使用,知识掩蔽与模型感知重分配协同实现事实纠正与能力保留的平衡。 Conclusion: 引入结构化事实性风险信号并进行位置敏感的模型感知优化,是缓解大语言模型幻觉的有效且低开销路径。 Abstract: Supervised fine-tuning (SFT) with token-level hard labels can amplify overconfident imitation of factually unsupported targets, causing hallucinations that propagate in multi-sentence generation. We study an augmented SFT setting in which training instances include coarse sentence-level factuality risk labels and inter-sentence dependency annotations, providing structured signals about where factual commitments are weakly supported. We propose \textbf{PRISM}, a differentiable risk-gated framework that modifies learning only at fact-critical positions. PRISM augments standard SFT with a lightweight, model-aware probability reallocation objective that penalizes high-confidence predictions on risky target tokens, with its scope controlled by span-level risk weights and model-aware gating. Experiments on hallucination-sensitive factual benchmarks and general evaluations show that PRISM improves factual aggregates across backbones while maintaining a competitive overall capability profile. Ablations further show that the auxiliary signal is most effective when used conservatively, and that knowledge masking and model-aware reallocation play complementary roles in balancing factual correction and capability preservation.

[27] On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning

Zhaoyi Li,Xiangyu Xi,Zhengyu Chen,Wei Wang,Gangwei Jiang,Ranran Shen,Linqi Song,Ying Wei,Defu Lian

Main category: cs.CL

TL;DR: 本文研究了不同来源的验证链式思维(CoT)轨迹对大模型泛化能力的影响,发现训练损失更低的数据(DeepSeek-R1-0528)反而导致更差的泛化性能;分析揭示其源于推理模式差异——DeepSeek-R1偏向发散、多分支探索,易陷入冗余路径;据此提出过滤高分支轨迹的简单方法,显著提升多个推理基准上的性能。

Details Motivation: 尽管监督微调(SFT)在长链式思维(CoT)轨迹上已成为构建强推理大模型的关键阶段,但不同来源的高质量CoT轨迹如何影响模型泛化能力仍不清楚。 Method: 通过控制问题集一致,对比使用DeepSeek-R1-0528和gpt-oss-120b生成的已验证CoT轨迹进行SFT;开展多维度分析(token级损失、step级推理行为)以揭示推理模式差异;进而提出基于分支频率过滤CoT轨迹的改进策略。 Result: 发现DeepSeek-R1数据虽带来更低训练损失,却导致更差的泛化性能;分析表明其推理轨迹更发散、分支更多,易致模型陷入冗余探索;过滤高分支轨迹后,在AIME25、BeyondAIME等五个基准上平均提升3.6%,最高达5.5%。 Conclusion: CoT轨迹的质量不仅取决于答案正确性,更取决于推理结构的合理性;过度发散、高分支的推理模式会损害泛化能力;有针对性地筛选推理轨迹可有效提升SFT效果。 Abstract: Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.

[28] Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

Ruijie Yang,Yan Zhu,Peiyao Fu,Te Luo,Zhihua Wang,Xian Yang,Quanlin Li,Pinghong Zhou,Shuo Wang

Main category: cs.CL

TL;DR: 本文提出EndoASR,一种面向胃肠内镜场景的领域自适应语音识别系统,通过两阶段合成数据适配策略,在真实临床环境中显著提升识别准确率与医学术语准确性,并具备低延迟、小模型、可边缘部署等优势,验证了其在多中心临床场景中人机协同的可靠性。

Details Motivation: 自动语音识别(ASR)在胃肠内镜中是人机交互的关键接口,但受限于专业术语复杂、声学环境多变,现有模型在真实临床中可靠性不足。 Method: 提出EndoASR系统,采用基于合成内镜报告的两阶段领域适配策略,分别优化领域语言建模与噪声鲁棒性;并在多中心前瞻性研究中评估其泛化能力与实时性能(RTF)、模型规模及下游任务集成效果。 Result: 在回顾性评估中,CER从20.52%降至14.14%,Med ACC从54.30%升至87.59%;前瞻性多中心实验中,相比Paraformer基线,CER从16.20%降至14.97%,Med ACC从61.63%升至84.16%;RTF达0.005(优于Whisper-large-v3的0.055),模型仅220M参数;且提升ASR质量可增强下游结构化信息抽取与医工交互。 Conclusion: EndoASR作为领域自适应ASR系统,可在真实多中心内镜场景中稳定、高效、可靠地支撑人机协同,为临床AI落地提供坚实语音接口基础。 Abstract: Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

[29] Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

Yanchen Wu,Tenghui Lin,Yingli Zhou,Fangyuan Zhang,Qintian Guo,Xun Zhou,Sibo Wang,Xilin Liu,Yuchi Ma,Yixiang Fang

Main category: cs.CL

TL;DR: 本文系统性地综述并统一建模了LLM智能体中的各类记忆方法,通过在两个主流基准上的全面实验对比,分析其有效性,并提出一种融合现有模块的新记忆方法,性能超越当前最优方法。

Details Motivation: 现有LLM智能体的记忆方法缺乏在统一实验设置下的系统性、全面性比较,难以评估其真实效果与适用边界。 Method: 1)提出一个涵盖所有现有记忆方法的统一框架;2)在两个知名基准上对代表性记忆方法进行大规模实验对比;3)基于分析结果设计一种融合式新记忆方法。 Result: 实验揭示了各记忆方法的优劣与适用场景;所提新记忆方法在多个指标上超越现有SOTA方法。 Conclusion: 统一框架和实证分析为记忆机制的理解与改进提供了坚实基础,指明了未来研究方向,如可扩展性、动态适配与认知建模等。 Abstract: Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. A number of memory methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework that incorporates all the existing agent memory methods from a high-level perspective. We then extensively compare representative agent memory methods on two well-known benchmarks and examine the effectiveness of all methods, providing a thorough analysis of those methods. As a byproduct of our experimental analysis, we also design a new memory method by exploiting modules in the existing methods, which outperforms the state-of-the-art methods. Finally, based on these findings, we offer promising future research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide valuable new insights for future research.

[30] Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition

Truc Nguyen,Then Tran,Binh Truong,Phuoc Nguyen T. H

Main category: cs.CL

TL;DR: 本文提出了一种人机协同框架,结合大语言模型(LLM)推理与声学特征模型,通过置信度路由和迭代规则优化,提升越南语语音情感识别在模糊样本和低资源场景下的性能。

Details Motivation: 越南语语音情感识别面临声学模式模糊、标注数据稀缺、真实场景中情感边界不清等挑战,亟需融合人类知识以提升模型鲁棒性。 Method: 构建以LLM推理为核心的人机协同框架:利用声学模型提供置信度与特征证据;设计置信度驱动的路由机制,将模糊样本交由LLM基于人类标注行为导出的结构化规则进行深度推理;引入迭代错误分析与规则更新策略持续优化系统。 Result: 在包含2764条样本、三类情绪(平静、愤怒、恐慌)、高标注一致性(Fleiss Kappa=0.8574)的越南语数据集上,准确率达86.59%,Macro F1达0.85–0.86,显著提升对模糊难分样本的识别能力。 Conclusion: 融合数据驱动建模与人类推理的协同范式,可有效应对低资源语言的情感识别难题,具备模型无关性与强泛化潜力。 Abstract: Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.

[31] Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text

Melania Berbatova,Tsvetoslav Vasev

Main category: cs.CL

TL;DR: 本文提出了一种针对保加利亚语文本的细粒度有毒内容检测方法,通过构建领域本体和标注数据集,并训练BERT模型实现高精度分类(F1宏平均0.89),兼顾毒性识别与关键信息(如医学术语、少数群体相关表述)的保留。

Details Motivation: 现有有毒内容检测方法常误伤重要信息(如医学术语、少数群体相关内容),亟需更精细、语言适配的解决方案。 Method: 构建保加利亚语潜在有毒词本体;人工标注4384句四类数据(有毒、医学、非毒、少数群体相关);基于BERT训练多类别分类模型。 Result: BERT模型在四分类任务上达到0.89的F1宏平均分,具备实际部署能力,可集成至内容审核系统。 Conclusion: 该方法显著提升了保加利亚语有毒内容识别的准确性与语义包容性,为小语种、高敏感场景的内容治理提供了可复用的技术路径。 Abstract: Toxic content detection in online communication remains a significant challenge, with current solutions often inadvertently blocking valuable information, including medical terms and text related to minority groups. This paper presents a more nu-anced approach to identifying toxicity in Bulgarian text while preserving access to essential information. The research explores two distinct methodologies for detecting toxic content. The developed methodologies have po-tential applications across diverse online platforms and content moderation systems. First, we propose an ontology that models the potentially toxic words in Bulgarian language. Then, we compose a dataset that comprises 4,384 manually anno-tated sentences from Bulgarian online forums across four categories: toxic language, medical terminology, non-toxic lan-guage, and terms related to minority communities. We then train a BERT-based model for toxic language classification, which reaches a 0.89 F1 macro score. The trained model is directly applicable in a real environment and can be integrated as a com-ponent of toxic content detection systems.

[32] LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

Linyang He,Qiyao Yu,Hanze Dong,Baohao Liao,Xinxing Xu,Micah Goldblum,Jiang Bian,Nima Mesgarani

Main category: cs.CL

TL;DR: 本文提出LiveMathematicianBench,一个基于最新arXiv论文、具有逻辑分类与抗干扰机制的动态多选数学推理评测基准,揭示当前最强LLM在研究级数学推理上仍远未达标。

Details Motivation: 现有数学推理评测基准存在合成性过强和数据污染问题,难以真实评估大语言模型在研究级数学任务中的泛化与理解能力。 Method: 构建基于新近arXiv论文的动态多选基准;设计13类定理逻辑类型学;采用证明概要引导的干扰项生成方法;引入替换抵抗机制以区分答案识别与实质推理。 Result: Gemini-3.1-pro-preview在标准评测中仅达43.5%;在替换抵抗评测下,GPT-5.4最高仅30.6%,Gemini降至17.6%(低于20%随机基线);提供证明概要可稳定提升准确率。 Conclusion: LiveMathematicianBench是一个可扩展、抗污染、细粒度的研究级数学推理评测平台,证实当前LLM在深层数学推理上仍存在根本性局限。 Abstract: Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.

[33] Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens

Hanna Hubarava,Yingqiang Gao

Main category: cs.CL

TL;DR: 本文提出了一种基于指令微调与离散控制标记的领域无关可控自动文本简化(CATS)框架,旨在提升模型对可读性与压缩率的可控性,并指出当前数据与评估方法的局限性。

Details Motivation: 现有可控自动文本简化(CATS)将可控性视为解码问题,且常用评估指标无法真实反映控制效果;作者发现数据覆盖度与评估方式是制约可控性的关键瓶颈。 Method: 提出基于指令微调的CATS框架,引入离散控制标记来显式引导开源语言模型(Llama/Mistral/Qwen,1–14B)实现目标可读性(FKGL/ARI/Dale-Chall)与压缩率控制;在四个领域开展系统实验,并设计误差导向的控制评估指标及分层采样策略。 Result: 小模型(1–3B)在可控性上可媲美大模型;可读性控制稳定有效,而压缩率控制受限于训练数据中该属性信号变异不足;标准简化与相似度指标无法准确衡量控制能力,需采用目标-输出对齐的误差型指标;随机数据划分易导致分布偏移,影响训练与评估可靠性。 Conclusion: 可控ATS的成功高度依赖训练数据中目标属性的充分变异,而非仅靠模型规模或解码策略;需构建更具属性多样性的数据集,并采用更精准的误差型评估与分层数据划分方法。 Abstract: Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.

[34] DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

Liang Zhu,Feiteng Fang,Yuelin Bai,Longze Chen,Zhexiang Zhang,Minghuan Tan,Min Yang

Main category: cs.CL

TL;DR: 本文提出了一种名为DEFT的高效对齐框架,通过差分分布奖励进行数据过滤和分布引导,提升大语言模型与人类价值观对齐的效率和泛化能力,同时减少训练时间。

Details Motivation: 现有基于人类反馈的强化学习(RLHF)方法如PPO成本高、不稳定;替代方法仍需大量偏好数据,且可能削弱模型泛化能力。 Method: 提出Distribution-guided Efficient Fine-Tuning(DEFT),利用语言模型输出分布与偏好数据差异分布计算差分分布奖励,据此筛选高质量小规模子集,并融入现有对齐方法以引导模型输出分布。 Result: 实验表明,结合DEFT的方法在对齐能力和泛化能力上均优于原始方法,且训练时间显著减少。 Conclusion: DEFT是一种高效、稳定、兼顾泛化能力的语言模型价值对齐新范式,适用于多种现有对齐流程。 Abstract: Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.

[35] PLOT: Enhancing Preference Learning via Optimal Transport

Liang Zhu,Yuelin Bai,Xiankun Ren,Jiaxi Yang,Lei Zhang,Feiteng Fang,Hamid Alinejad-Rokny,Minghuan Tan,Min Yang

Main category: cs.CL

TL;DR: 本文提出PLOT方法,通过最优传输理论构建词元级损失函数,提升大语言模型偏好学习的性能、稳定性和全局语义建模能力。

Details Motivation: 现有偏好学习方法存在性能增益有限、计算成本高、超参数敏感、难以建模全局词元关系等问题。 Method: 将偏好学习建模为最优传输问题,设计基于词元嵌入的词元级损失函数,在保持原始LLM分布的同时对齐人类偏好。 Result: 在人类价值观和逻辑与问题求解两大类共七个子偏好任务上,PLOT持续提升对齐效果,同时保持生成流畅性与连贯性。 Conclusion: 最优传输为偏好学习提供了理论基础和新视角,PLOT建立了原理清晰、可解释性强的偏好学习新框架。 Abstract: Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global token-level relationships. We introduce PLOT, which enhances Preference Learning in fine-tuning-based alignment through a token-level loss derived from Optimal Transport. By formulating preference learning as an Optimal Transport Problem, PLOT aligns model outputs with human preferences while preserving the original distribution of LLMs, ensuring stability and robustness. Furthermore, PLOT leverages token embeddings to capture semantic relationships, enabling globally informed optimization. Experiments across two preference categories - Human Values and Logic & Problem Solving - spanning seven subpreferences demonstrate that PLOT consistently improves alignment performance while maintaining fluency and coherence. These results substantiate optimal transport as a principled methodology for preference learning, establishing a theoretically grounded framework that provides new insights for preference learning of LLMs.

[36] From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion

Liang Zhu,Haolin Chen,Lidong Zhao,Xian Wu

Main category: cs.CL

TL;DR: 本文提出了一种自适应占位符补全(APC)框架,通过在高熵位置输出显式占位符来替代硬性补全(HC),以降低用户编辑成本;理论证明其在特定熵阈值以上优于HC,并通过真实编辑日志构建训练数据和基于成本的奖励函数,在多个LLM上验证了编辑成本显著下降(19%-50%),同时保持原有补全性能。

Details Motivation: 现有大语言模型在代码补全中采用硬性补全(HC)范式,常因上下文不足而生成错误具体代码;对300万真实交互分析发现61%建议被修改或拒绝,表明模型在高不确定性位置易出错。 Method: 提出自适应占位符补全(APC)框架:将代码补全建模为不确定性下的成本最小化问题,理论推导出熵阈值;基于真实编辑日志构建训练数据,设计基于成本的强化学习奖励函数进行端到端训练。 Result: 在1.5B–14B参数模型上广泛评估显示,APC将期望编辑成本降低19%至50%,同时完全保持标准HC的补全性能。 Conclusion: APC为不确定性感知的代码补全提供了理论基础与实用训练框架,证明了端到端学习自适应‘不回答’(即插入占位符)策略可行且不损害传统补全质量。 Abstract: While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user's subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.

[37] Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution

Samuel Rose,Debarati Chakraborty

Main category: cs.CL

TL;DR: 本文提出了一种用于区分阅读障碍者与普通写作者拼写错误的二分类方法,结合语音、正字法和形态学特征,采用双输入神经网络模型,在作者无关条件下达到93.01%准确率;同时强调伦理先行,系统分析公平性、可解释性及部署所需的透明度、知情同意、人工监督与申诉机制。

Details Motivation: 现有研究多聚焦于拼写纠错而非错误归因,且忽视了自动识别学习者障碍类型所带来的伦理风险,如标签化、隐性筛查、算法偏见和机构滥用。 Method: 将阅读障碍拼写错误归因建模为二分类任务(给定错词及其正确形式,判断是否属于阅读障碍者);构建涵盖正字法、语音和形态学特征的综合特征集;设计并评估双输入神经网络模型,并与传统机器学习基线在作者无关设置下对比。 Result: 神经模型取得93.01%准确率和94.01% F1分数;语音上合理的错误和元音混淆是最强的归因信号;同时完成对子群体公平性、教育场景可解释性及负责任部署条件(含知情同意、透明度、人工监督、申诉机制)的系统分析。 Conclusion: 尽管高精度的阅读障碍错误归因技术可行,但仅凭可行性不足以支持其在高风险教育场景中的实际部署;必须配套健全的伦理与法律框架,并明确限制使用边界与潜在误用风险。 Abstract: Dyslexic spelling errors exhibit systematic phonological and orthographic patterns that distinguish them from the errors produced by typically developing writers. While this observation has motivated dyslexic-specific spell-checking and assistive writing tools, prior work has focused predominantly on error correction rather than attribution, and has largely neglected the ethical risks. The risk of harmful labelling, covert screening, algorithmic bias, and institutional misuse that automated classification of learners entails requires the development of robust ethical and legal frameworks for research in this area. This paper addresses both gaps. We formulate dyslexic error attribution as a binary classification task. Given a misspelt word and its correct target form, determine whether the error pattern is characteristic of a dyslexic or non-dyslexic writer. We develop a comprehensive feature set capturing orthographic, phonological, and morphological properties of each error, and propose a twin-input neural model evaluated against traditional machine learning baselines under writer-independent conditions. The neural model achieves 93.01% accuracy and an F1-score of 94.01%, with phonetically plausible errors and vowel confusions emerging as the strongest attribution signals. We situate these technical results within an explicit ethics-first framework, analysing fairness across subgroups, the interpretability requirements of educational deployment, and the conditions, consent, transparency, human oversight, and recourse, under which a system could be responsibly used. We provide concrete guidelines for ethical deployment and an open discussion of the systems limitations and misuse potential. Our results demonstrate that dyslexic error attribution is feasible at high accuracy while underscoring that feasibility alone is insufficient for deployment in high-stakes educational contexts.

[38] SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations

Yiqiang Cai,Chengyan Wu,Bolei Ma,Bo Chen,Yun Xue,Julia Hirschberg,Ziwei Gong

Main category: cs.CL

TL;DR: SURE框架通过不确定性感知的混合专家模块、迭代推理模块和Transformer门控模块,提升了多模态对话情感识别的鲁棒性和上下文建模能力,在基准数据集上超越现有方法。

Details Motivation: 现有方法过于强调模态融合,忽视了噪声特征中的不确定性及细粒度上下文推理需求。 Method: 提出SURE框架,包含不确定性感知的混合专家模块(处理模态特异性噪声)、迭代推理模块(多轮上下文推理)和Transformer门控模块(建模模态内与模态间交互)。 Result: 在多个MERC基准数据集上一致优于当前最优方法,验证了不确定性建模与迭代推理的有效性。 Conclusion: 不确定性建模与迭代推理对提升对话场景中多模态情感识别性能至关重要。 Abstract: Multimodal emotion recognition in conversations (MERC) requires integrating multimodal signals while being robust to noise and modeling contextual reasoning. Existing approaches often emphasize fusion but overlook uncertainty in noisy features and fine-grained reasoning. We propose SURE (Synergistic Uncertainty-aware REasoning) for MERC, a framework that improves robustness and contextual modeling. SURE consists of three components: an Uncertainty-Aware Mixture-of-Experts module to handle modality-specific noise, an Iterative Reasoning module for multi-turn reasoning over context, and a Transformer Gate module to capture intra- and inter-modal interactions. Experiments on benchmark MERC datasets show that SURE consistently outperforms state-of-the-art methods, demonstrating its effectiveness in robust multimodal reasoning. These results highlight the importance of uncertainty modeling and iterative reasoning in advancing emotion recognition in conversational settings.

[39] Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients

Oumaima El Khettari,Virgile Barthet,Guillaume Hocquet,Joconde Weller,Emmanuel Morin,Pierre Zweigenbaum

Main category: cs.CL

TL;DR: 本文评估了基于transformer的模型在法国心力衰竭队列中短期死亡率预测的表现,发现实体感知的多模态transformer效果最优,而当前大语言模型提示方法在临床决策支持中仍受限。

Details Motivation: 准确预测心力衰竭患者的短期死亡率具有挑战性,尤其仅依赖结构化电子健康记录数据时。 Method: 评估基于transformer的模型(文本、结构化、多模态及大语言模型)在法国心力衰竭队列上的表现,比较不同模态和融合策略的效果。 Result: 实体级文本表征优于CLS嵌入;监督式多模态融合(文本+结构化变量)性能最佳;大语言模型在不同模态和解码策略下表现不一致,纯文本提示效果优于结构化或多模态输入。 Conclusion: 实体感知的多模态transformer是短期心力衰竭结局预测最可靠的方案,而当前大语言模型提示方法在临床决策支持中仍存在局限性。 Abstract: Accurate short-term mortality prediction in heart failure (HF) remains challenging, particularly when relying on structured electronic health record (EHR) data alone. We evaluate transformer-based models on a French HF cohort, comparing text-only, structured-only, multimodal, and LLM-based approaches. Our results show that enriching clinical text with entity-level representations improves prediction over CLS embeddings alone, and that supervised multimodal fusion of text and structured variables achieves the best overall performance. In contrast, large language models perform inconsistently across modalities and decoding strategies, with text-only prompts outperforming structured or multimodal inputs. These findings highlight that entity-aware multimodal transformers offer the most reliable solution for short-term HF outcome prediction, while current LLM prompting remains limited for clinical decision support.

[40] ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues

Bhaskara Hanuma Vedula,Darshan Anghan,Ishita Goyal,Ponnurangam Kumaraguru,Abhijnan Chakraborty

Main category: cs.CL

TL;DR: 本文提出ImplicitBBQ基准,通过特征线索评估大语言模型在年龄、性别、地域、宗教、种姓和经济社会地位等维度上的隐式偏见,发现当前对齐与提示策略难以有效缓解文化根植的刻板印象关联。

Details Motivation: 现有基于姓名代理的隐式偏见检测方法关联弱、覆盖维度有限(如无法涵盖年龄或社会经济地位),需更可靠、文化相关的隐式线索来全面评估偏见。 Method: 构建ImplicitBBQ问答基准,采用文化上关联的特征线索(而非姓名)来隐式表征多种社会人口维度;在11个模型上系统评测显式与隐式偏见差异,并测试安全提示、思维链、少样本提示等缓解策略效果。 Result: 隐式偏见在模糊语境中比显式偏见高六倍以上;安全提示和思维链效果甚微;少样本提示虽降低84%隐式偏见,但种姓偏见仍为其他维度的四倍。 Conclusion: 当前对齐与提示策略仅缓解表层偏见,未能解决深层文化根植的刻板联想;ImplicitBBQ为后续偏见缓解研究提供了新基准与开源资源。 Abstract: Large Language Models increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally associated attributes that signal implicitly, across age, gender, region, religion, caste, and socioeconomic status. Evaluating 11 models, we find that implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap; even few-shot prompting, which reduces implicit bias by 84%, leaves caste bias at four times the level of any other dimension. These findings indicate that current alignment and prompting strategies address the surface of bias evaluation while leaving culturally grounded stereotypic associations largely unresolved. We publicly release our code and dataset for model providers and researchers to benchmark potential mitigation techniques.

[41] Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification

Géraud Faye,Benjamin Icard,Morgane Casanova,Guillaume Gadek,Guillaume Gravier,Wassila Ouerdane,Céline Hudelot,Sylvain Gatepaille,Paul Égré

Main category: cs.CL

TL;DR: 本文提出了一种结合非上下文文本嵌入(fastText)与符号化概念特征(如体裁、主题和说服技巧)的神经符号方法,以提高宣传新闻检测的鲁棒性和泛化能力,实验表明其优于纯文本方法。

Details Motivation: 现有基于BERT等语言模型的宣传新闻检测方法易因数据采集偏差而过拟合,导致在新来源上泛化能力差。 Method: 提出一种神经符号混合方法,融合fastText文本嵌入与符号化概念特征(包括体裁、主题、说服技巧)进行分类。 Result: 该方法在宣传新闻检测任务上性能优于纯文本基线;消融实验和可解释性分析验证了符号特征的有效性。 Conclusion: 引入符号化概念特征能显著提升模型鲁棒性与跨源泛化能力,神经符号融合是应对信息失序问题的可行路径。 Abstract: Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness

[42] How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization

Ramon Ferrer-i-Cancho

Main category: cs.CL

TL;DR: 本文提出了一种基于置换多面体(permutohedron)上交换距离最小化的语言词序与手势序优化性分析框架,并引入二次指派问题(QAP)作为统一建模工具,验证跨语言手势序至少77%最优,支持语言系统中存在普遍最优指派原则。

Details Motivation: 探究语言中词序和手势序是否在置换结构中趋于交换距离最小化,以解释其认知或交际成本最小化的潜在机制。 Method: 构建基于permutohedron的交换距离度量模型;提出量化词序/手势序优化程度的数学框架;将二次指派问题(QAP)引入语言研究,作为统一优化建模工具。 Result: 实证表明跨语言手势序至少77%最优,且该高优度不太可能源于偶然;建立了swap距离最小化在语言与手势系统中的理论基础。 Conclusion: 语言与手势系统的线性顺序遵循一种广义的最优指派原则,swap距离最小化是该原则的具体体现之一;QAP可作为整合多种语言优化现象的统一理论框架。 Abstract: The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.

[43] Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

Klaudia Thellmann,Bernhard Stadler,Michael Färber

Main category: cs.CL

TL;DR: 本文提出了一种三步自动化质量保证方法,用于评估和提升机器翻译基准数据集(EU20)的质量,发现低COMET得分的数据集在片段级存在更多误译错误,并发布了清洗后的数据集与代码。

Details Motivation: 机器翻译基准数据集虽具成本与规模优势,但其噪声、结构丢失和质量不均削弱了可信度;亟需可扩展的翻译可靠性测量与验证方法。 Method: 采用三步自动化质量保障流程:(i) 结构化语料库审计与针对性修复;(ii) 基于神经指标COMET(参考式与无参考式)进行质量画像,并对比DeepL/ChatGPT/Google翻译服务;(iii) 利用大语言模型构建片段级翻译错误图谱。 Result: 发现COMET得分较低的数据集(如HellaSwag)在片段级准确性/误译错误比例更高;MMLU上参考式COMET与人工校对样本结果一致;ARC相对更干净;并开源清洗后的EU20数据集及全部代码。 Conclusion: 自动化质量保证可提供实用、可扩展的质量指标,辅助优先开展人工审查,是对人类金标准的补充而非替代。 Abstract: Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.

[44] SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

Daeyong Kwon,Soyoung Yoon,Seung-won Hwang

Main category: cs.CL

TL;DR: 本文提出SAFE框架,通过知识图谱(KG)支持的验证流程,在训练和推理阶段分别消除多跳问答中不 grounded 的推理步骤,提升模型推理的可验证性与准确性。

Details Motivation: 现有多跳问答基准常因虚假正确性奖励大语言模型,掩盖其不 grounded 或有缺陷的推理步骤,亟需更严格的推理评估框架。 Method: 提出SAFE动态基准框架:训练时构建原子错误分类体系与KG支撑的验证流水线,识别并剔除不可回答样本;推理时使用在验证数据上训练的反馈模型实时检测不 grounded 步骤。 Result: SAFE在训练阶段揭示了现有基准的关键缺陷(如14%样本不可回答),并在推理阶段平均准确率提升8.4个百分点,同时确保推理路径可验证。 Conclusion: SAFE为多跳问答提供了更严谨、可验证的评估范式,推动模型从表面正确转向真正 grounded 的推理。 Abstract: Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.

[45] $k$NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection

Kahim Wong,Kemou Li,Haiwei Wu,Jiantao Zhou

Main category: cs.CL

TL;DR: 本文提出了一种无需训练、查询高效的零样本LLM生成文本检测方法kNNProxy,通过复用kNN-LM检索机制对固定代理LLM进行领域自适应,并进一步扩展为多代理混合(MoP)以提升跨领域鲁棒性。

Details Motivation: 现有零样本检测方法依赖代理LLM与源LLM高度对齐,但在黑盒场景下该假设常不成立;而现有对齐方法需监督微调或频繁API调用,带来高成本、脆弱性和领域迁移能力差等问题。 Method: 提出kNNProxy框架:构建轻量目标相关LGT语料库的数据存储,推理时利用k近邻检索获取邻近样本的token级预测分布,并与代理LLM输出插值得到对齐预测;进一步扩展为MoP,按输入路由至对应领域数据存储。 Result: 在多个数据集和LLM来源上验证了kNNProxy及MoP在检测准确率、跨领域鲁棒性和查询效率上的优越性能。 Conclusion: kNNProxy是一种无需训练、低查询开销、强领域适应能力的零样本LGT检测新范式,有效缓解了代理LLM对齐难题。 Abstract: LLM-generated text (LGT) detection is essential for reliable forensic analysis and for mitigating LLM misuse. Existing LGT detectors can generally be categorized into two broad classes: learning-based approaches and zero-shot methods. Compared with learning-based detectors, zero-shot methods are particularly promising because they eliminate the need to train task-specific classifiers. However, the reliability of zero-shot methods fundamentally relies on the assumption that an off-the-shelf proxy LLM is well aligned with the often unknown source LLM, a premise that rarely holds in real-world black-box scenarios. To address this discrepancy, existing proxy alignment methods typically rely on supervised fine-tuning of the proxy or repeated interactions with commercial APIs, thereby increasing deployment costs, exposing detectors to silent API changes, and limiting robustness under domain shift. Motivated by these limitations, we propose the $k$-nearest neighbor proxy ($k$NNProxy), a training-free and query-efficient proxy alignment framework that repurposes the $k$NN language model ($k$NN-LM) retrieval mechanism as a domain adapter for a fixed proxy LLM. Specifically, a lightweight datastore is constructed once from a target-reflective LGT corpus, either via fixed-budget querying or from existing datasets. During inference, nearest-neighbor evidence induces a token-level predictive distribution that is interpolated with the proxy output, yielding an aligned prediction without proxy fine-tuning or per-token API outputs. To improve robustness under domain shift, we extend $k$NNProxy into a mixture of proxies (MoP) that routes each input to a domain-specific datastore for domain-consistent retrieval. Extensive experiments demonstrate strong detection performance of our method.

[46] Why Gaussian Diffusion Models Fail on Discrete Data?

Alexander Shabalin,Simon Elistratov,Viacheslav Meshchaninov,Ildus Sadrtdinov,Dmitry Vetrov

Main category: cs.CL

TL;DR: 本文研究了高斯扩散模型(DDPM)在离散数据生成中的采样问题,发现其在特定噪声区间内因多模态密度导致采样质量下降,并提出结合自条件化与q-采样策略来提升生成质量。

Details Motivation: 高斯扩散模型在连续域中表现优异,但在离散数据(如文本、代码、蛋白质序列)上应用困难;现有方法在采样过程中易陷入低密度区域,导致分布外输入和样本质量下降。 Method: 通过随机层次模型分析DDPM在离散分布(表示为delta混合)下的行为,识别出导致多模态密度的关键噪声区间;引入自条件化与新提出的q-采样策略,并在该关键区间内动态切换求解器。 Result: 所提组合策略显著提升离散数据生成质量,在文本、编程代码和蛋白质建模等多任务、多领域实验中验证有效。 Conclusion: DDPM在离散数据上的失效源于关键噪声区间的多模态密度问题;结合自条件化与q-采样可缓解该问题,为离散扩散建模提供新思路。 Abstract: Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.

[47] Tracking the emergence of linguistic structure in self-supervised models learning from speech

Marianne de Heer Kloots,Martijn Bentum,Hosein Mohebbi,Charlotte Pouw,Gaofei Shen,Willem Zuidema

Main category: cs.CL

TL;DR: 本文研究了六种Wav2Vec2和HuBERT模型在荷兰语语音预训练过程中,不同层次和训练阶段对多种语言结构的编码规律,发现语言结构的出现具有层次和时间上的差异性,并受预训练目标层级(如伪标签迭代精化)显著影响。

Details Motivation: 探究自监督语音模型中语言结构何时以及如何在训练过程中涌现。 Method: 对六个Wav2Vec2和HuBERT荷兰语模型,在不同网络层和训练检查点上,系统分析其对多种语言结构的编码能力。 Result: 不同语言结构展现出显著不同的层间分布模式和学习轨迹;结构抽象程度和输入信息整合时间尺度影响其编码动态;预训练目标层级(尤其是高阶伪标签预测)显著影响语言结构的层组织与学习路径。 Conclusion: 语言结构在自监督语音模型中的涌现具有结构性和阶段性,不仅依赖模型架构,更受预训练任务设计(特别是目标抽象层级)的深刻调控。 Abstract: Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).

[48] BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs

Nicolas Boizard,Théo Deschamps-Berger,Hippolyte Gisserot-Boukhlef,Céline Hudelot,Pierre Colombo

Main category: cs.CL

TL;DR: 本文提出了一种将因果生成式语言模型(如Gemma3、Qwen3)高效转化为双向编码器(BidirLM)的开源方法,关键包括引入先验掩码阶段、线性权重融合+轻量多域数据混合防遗忘、以及与专用因果模型融合以增强多模态能力,在文本、视觉、音频表征任务上超越现有方法。

Details Motivation: 现有将因果生成模型转为双向编码器的方法缺乏统一训练目标、存在灾难性遗忘、且难以灵活整合各类专用生成模型。 Method: 通过在Gemma3和Qwen3系列上的系统消融实验,识别出先验掩码阶段的关键作用;提出无需原始预训练数据的双策略:线性权重合并 + 轻量多领域数据混合;进一步将编码器与专用因果模型融合以注入模态/领域知识。 Result: 构建了5个BidirLM双向编码器,在文本、视觉、音频表征基准上均优于现有方法。 Conclusion: 该开源方案通用性强,适用于任意因果解码器大模型,为高效构建高性能双向编码器提供了可复现、可扩展的新范式。 Abstract: Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.

[49] Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

Tao Jin,Phuong Minh Nguyen,Naoya Inoue

Main category: cs.CL

TL;DR: 本文提出GOOSE框架,通过构建自适应脊树(anisotropic tree)来提升推测解码效率,在不损失精度的前提下实现1.9-4.3倍加速。

Details Motivation: 现有无训练推测解码方法未区分不同候选token来源的质量差异,导致树结构低效;而n-gram匹配与统计预测两类常见token源接受率存在显著差距(中位数达6倍),亟需适配质量差异的树形设计。 Method: 提出GOOSE框架,构建‘脊树’结构:以高接受率的上下文匹配token构成深度主干链,低接受率token在各节点展开宽分支;理论证明该结构在固定验证预算下优于单一源或平衡树。 Result: 在5个LLM(7B–33B)和5个基准上,GOOSE实现1.9–4.3倍无损加速,相较平衡树基线提升12%–33%。 Conclusion: 当候选token质量存在显著差异时,非对称(各向异性)树结构比传统平衡树更优;GOOSE验证了该设计的有效性与普适性,为训练免费的高效推理提供了新范式。 Abstract: Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.

[50] Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

Yuhang Wu,Xiangqing Shen,Fanfan Wang,Cangqi Zhou,Zhen Wu,Xinyu Dai,Rui Xia

Main category: cs.CL

TL;DR: 本文提出了一种名为ReRanking Preference Optimization (RRPO)的强化学习框架,旨在通过LLM生成质量反馈直接优化重排序器,使其与下游生成任务对齐,摆脱对人工标注的相关性标签的依赖。

Details Motivation: 现有重排序模型仅基于静态人工标注的相关性标签进行优化,与下游大语言模型(LLM)的实际生成需求脱节,导致高相关性文档未必具备高上下文效用。 Method: 将重排序建模为序列决策过程,利用LLM反馈作为奖励信号进行偏好优化,并引入参考锚定的确定性基线以提升训练稳定性。 Result: 在知识密集型基准测试中显著优于强基线(如RankZephyr);框架具有跨LLM读者泛化性、与查询扩展模块正交兼容性,且对噪声监督鲁棒。 Conclusion: RRPO有效弥合了检索与生成之间的目标鸿沟,实现了无需人工标注、面向生成效用驱动的重排序优化。 Abstract: Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.

[51] Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations

Haitong Sun,Stephen McIntosh,Kwanghee Choi,Eunjung Yeo,Daisuke Saito,Nobuaki Minematsu

Main category: cs.CL

TL;DR: 本文提出了一种名为'prosodic ABX'的新方法,用于评估自监督语音模型(S3Ms)对韵律对比(如重音、声调、音高重音)的敏感性,并构建了英、日、中文的最小对立对数据集进行实验验证。

Details Motivation: 现有研究关注S3Ms对音素对比的敏感性,但缺乏对其对韵律对比敏感性的直接测量;需一种低资源、无需显式标签的评估方法。 Method: 扩展ABX判别任务为'prosodic ABX',构建英语、日语和汉语的韵律最小对立对数据集,在无显式标签条件下评估S3M表征对重音、音高重音和声调的区分能力。 Result: 实验证明prosodic ABX能有效评估不同语言的韵律对比敏感性;模型与层的性能排序在多种实验条件下具有一致性,适用于低资源场景。 Conclusion: prosodic ABX是一种可行、高效且可推广的韵律对比评估框架,为理解S3Ms的语音表征能力提供了新视角。 Abstract: Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.

[52] Reliable Control-Point Selection for Steering Reasoning in Large Language Models

Haomin Zhuang,Hojun Yoo,Xiaonan Luo,Kehan Guo,Xiangliang Zhang

Main category: cs.CL

TL;DR: 本文提出了一种基于稳定性过滤和内容子空间投影的推理行为控制方法,显著提升了大语言模型中自发现式推理行为(如自省)的可控性与泛化性。

Details Motivation: 现有基于关键词匹配检测自发推理行为(如自省)的方法假设所有检测到的边界都对应真实行为信号,但实证发现其中93.3%的行为不稳定、不可复现,导致 steering 向量效果差。 Method: 构建概率模型将内在推理行为建模为上下文依赖的随机事件;提出稳定性过滤(仅保留行为可复现的边界);结合内容子空间投影去除问题特异性噪声。 Result: 在MATH-500上达到0.784准确率(+5.0优于最强基线);生成的steering向量可在同架构族模型间直接迁移,提升Nemotron-Research-Reasoning-1.5B(+5.0)和DeepScaleR-1.5B-Preview(+6.0)。 Conclusion: 行为信号的稳定性是构建高质量steering向量的关键前提;所提方法兼顾可靠性与泛化性,为无训练推理控制提供了新范式。 Abstract: Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model's hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors -- such as self-reflection -- emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3\% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content-subspace projection that removes residual question-specific noise, our method achieves 0.784 accuracy on MATH-500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0). Code is available at https://github.com/zhmzm/stability-steering.

[53] GaelEval: Benchmarking LLM Performance for Scottish Gaelic

Peter Devine,William Lamb,Beatrice Alex,Ignatius Ezeani,Dawn Knight,Mícheál J. Ó Meachair,Paul Rayson,Martin Wynne

Main category: cs.CL

TL;DR: This paper introduces GaelEval, the first multi-dimensional benchmark for Scottish Gaelic, to evaluate multilingual LLMs on morphosyntax, translation, and cultural knowledge; results show some models surpass fluent human baselines in grammar tasks, proprietary models outperform open-weight ones, and in-language prompting gives a small but consistent boost.

Details Motivation: Multilingual LLMs show uneven, under-measured performance on unsupported minority languages like Scottish Gaelic, where standard translation benchmarks fail to assess structural linguistic competence. Method: The authors design GaelEval — a new benchmark with three components: (i) expert-authored morphosyntactic multiple-choice QA, (ii) culturally grounded translation, and (iii) large-scale cultural knowledge Q&A — and evaluate 19 LLMs against a human baseline of 30 fluent speakers. Result: Gemini 3 Pro Preview achieves 83.3% accuracy on the morphosyntactic task, exceeding the human baseline (78.1%); proprietary models consistently outperform open-weight ones; Gaelic prompting yields +2.4% gain on linguistic tasks; on cultural Q&A, top models exceed 90% accuracy but perform worse under Gaelic prompting and scores are inflated relative to manual evaluation. Conclusion: Frontier LLMs can surpass human-level performance in Gaelic morphosyntax, Gaelic prompting helps marginally, and proprietary models hold a consistent advantage — revealing both promise and limitations in evaluating minority language capabilities. Abstract: Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3\%$ accuracy on the linguistic task, surpassing the human baseline ($78.1\%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4\%$). On the cultural task, leading models exceed $90\%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.

[54] Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

Xuan Qi

Main category: cs.CL

TL;DR: 本文系统研究了链式思维(CoT)推理长度对函数调用型语言智能体性能的影响,发现短推理(32 tokens)显著提升准确率,而长推理反而严重损害性能;进一步分析揭示其核心机制在于函数路由功能,并据此提出结构化短CoT方法FR-CoT,在保持高性能的同时彻底消除函数幻觉。

Details Motivation: 链式思维(CoT)被广泛认为能提升智能体性能,但在结构化工具使用场景中,推理长度与准确率的关系尚不明确,亟需系统性实证研究。 Method: 在Berkeley Function Calling Leaderboard v3 Multiple基准的200个任务上,对Qwen2.5-1.5B-Instruct模型进行六档token预算(0–512)的全面扫频实验;结合三类错误分解、oracle分析及细粒度预算扫描;并提出结构化方法Function-Routing CoT(FR-CoT),将推理模板化为“Function: [name] / Key args: [...]”。 Result: 短CoT(32 tokens)使准确率从44.0%跃升至64.0%(+45%相对提升),而长CoT(256 tokens)降至25.0%;错误分析表明短CoT大幅降低错误函数选择(30.5%→1.5%),长CoT则引发高比例幻觉(18.0%);oracle显示88.6%可解任务仅需≤32 tokens,最优区间为8–16 tokens;FR-CoT实现与自由短CoT相当的准确率,且函数幻觉降为0.0%。 Conclusion: CoT在函数调用任务中主要起函数路由作用,而非传统意义上的深度推理;推理长度存在强非单调效应,过长反而有害;结构化短CoT(如FR-CoT)可在无需预算调优前提下提供更高可靠性与稳定性。 Abstract: How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.

[55] AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics

Atilla Kaan Alkan,Felix Grezes,Sergi Blanco-Cuaresma,Jennifer Lynn Bartlett,Daniel Chivvis,Anna Kelbert,Kelly Lockhart,Alberto Accomazzi

Main category: cs.CL

TL;DR: 本文介绍了AstroConcepts语料库,用于研究天体物理学文本多标签分类中的极端类别不平衡问题,并提出了频率分层评估方法,揭示了词汇约束大语言模型在该任务中的竞争力及领域适配对罕见术语的提升效果。

Details Motivation: 科学多标签文本分类面临极端类别不平衡挑战,现有语料库缺乏全面受控词汇表,难以系统研究此类问题。 Method: 构建AstroConcepts语料库(21,702篇天体物理论文摘要,标注2,367个统一天文词表概念),并采用传统模型、神经网络及词汇约束大语言模型进行基准实验,引入频率分层评估策略。 Result: 词汇约束大语言模型性能媲美领域适配模型;领域适配对罕见术语提升更显著但绝对性能仍有限;频率分层评估能揭示聚合指标掩盖的性能模式。 Conclusion: AstroConcepts为科学NLP中极端不平衡研究提供了新基准与资源,所提评估方法和发现对构建鲁棒科学文本分类系统具有实践指导意义。 Abstract: Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.

[56] Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions

Atilla Kaan Alkan,Felix Grezes,Jennifer Lynn Bartlett,Anna Kelbert,Kelly Lockhart,Alberto Accomazzi

Main category: cs.CL

TL;DR: 本文参与SOMD 2026跨文档软件提及共指解析共享任务,提出两种无需微调的方法——模糊匹配(FM)与上下文感知表示(CAR),在三个子任务中均取得优异性能(CoNLL F1达0.94–0.96),CAR略优且更鲁棒、更可扩展;研究揭示了二者互补的失效模式,并建议根据上游提及检测器的噪声特性及语料规模选择合适方法。

Details Motivation: 解决跨文档软件提及共指解析这一新兴且被低估的任务,探索无需微调的轻量高效方法,以应对软件名称高表面规律性带来的建模挑战。 Method: 提出两种无需微调的方法:Fuzzy Matching(FM),基于词法字符串相似度;Context Aware Representations(CAR),融合提及级和文档级嵌入;并开展噪声注入实验与推理效率分析。 Result: CAR在官方测试集上F1比FM高约1个百分点;在边界噪声下CAR鲁棒性更强(F1仅降0.07 vs. FM降0.20),而在提及替换下FM退化更平缓(F1 0.52 vs. CAR 0.63);CAR推理时间近似线性扩展,FM为超线性。 Conclusion: 对于软件提及共指解析,简单但精心设计的无监督/轻量方法(如CAR)已足够有效;系统选型应兼顾数据噪声类型与语料规模;该任务适合进一步探索低资源、高鲁棒性建模范式。 Abstract: We present our participation in the SOMD 2026 shared task on cross-document software mention coreference resolution, where our systems ranked second across all three subtasks. We compare two fine-tuning-free approaches: Fuzzy Matching (FM), a lexical string-similarity method, and Context Aware Representations (CAR), which combines mention-level and document-level embeddings. Both achieve competitive performance across all subtasks (CoNLL F1 of 0.94-0.96), with CAR consistently outperforming FM by 1 point on the official test set, consistent with the high surface regularity of software names, which reduces the need for complex semantic reasoning. A controlled noise-injection study reveals complementary failure modes: as boundary noise increases, CAR loses only 0.07 F1 points from clean to fully corrupted input, compared to 0.20 for FM, whereas under mention substitution, FM degrades more gracefully (0.52 vs. 0.63). Our inference-time analysis shows that FM scales superlinearly with corpus size, whereas CAR scales approximately linearly, making CAR the more efficient choice at large scale. These findings suggest that system selection should be informed by both the noise profile of the upstream mention detector and the scale of the target corpus. We release our code to support future work on this underexplored task.

[57] Adam's Law: Textual Frequency Law on Large Language Models

Hongyuan Adam Lu,Z. L.,Victor Wei,Zefan Zhang,Zhao Hong,Qiqi Xiang,Bowen Cao,Wai Lam

Main category: cs.CL

TL;DR: 本文提出了一种基于文本频率的新研究方向,构建了包含Textual Frequency Law (TFL)、Textual Frequency Distillation (TFD) 和 Curriculum Textual Frequency Training (CTFT) 的框架,通过估计和利用句子级文本频率提升LLM在多项任务上的性能。

Details Motivation: 文本频率已被证实影响人类阅读速度,但其对大语言模型(LLMs)的影响却鲜有研究;且LLM训练数据常不公开,需新方法估计文本频率。 Method: 提出Textual Frequency Law(TFL),用在线资源估计句子级频率;设计输入改写器将输入转为更高频表达;提出Textual Frequency Distillation(TFD)通过LLM续写扩展语料以优化频率估计;最后采用按频率递增顺序的Curriculum Textual Frequency Training(CTFT)进行微调。 Result: 在自建数据集TFPD(涵盖数学推理、机器翻译、常识推理与智能体工具调用)上实验验证了该框架的有效性。 Conclusion: 文本频率是影响LLM性能的重要因素,所提TFL-TFD-CTFT框架能系统性地利用频率信息提升模型表现,为LLM训练与提示工程提供了新范式。 Abstract: While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.

[58] The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Jeremy Herbst,Jae Hee Lee,Stefan Wermter

Main category: cs.CL

TL;DR: 本文研究了MoE架构中专家(experts)与密集前馈网络(FFNs)在可解释性上的差异,发现MoE中的专家神经元更单义化(monosemantic),尤其在路由更稀疏时更明显;进而提出以专家而非单个神经元为基本解释单元,并验证其作为细粒度任务专家(如LaTeX括号闭合)的有效性,表明MoE在专家层级具有天然可解释性。

Details Motivation: 探究MoE架构的稀疏性是否使其比密集FFN更具可解释性,解决专家是否为领域专家或简单token处理器的争议。 Method: 采用k-稀疏探测(k-sparse probing)比较MoE专家与密集FFN的polysemanticity;以专家为单位进行自动化大规模解释;分析专家功能类型。 Result: MoE专家神经元显著更单义化,且随路由稀疏性增强而加剧;专家并非宽泛领域专家或token处理器,而是执行具体语言学或语义任务(如LaTeX括号闭合)的细粒度任务专家。 Conclusion: MoE架构在专家层级具有内在可解释性,将专家作为解释基本单元比单个神经元更有效,为大模型可解释性提供了新路径。 Abstract: Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis

[59] Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model

Jaemin Kim,Jae O Lee,Sumyeong Ahn,Seo Yeon Park

Main category: cs.CL

TL;DR: 本文提出Neuro-RIT框架,通过神经元级归因分析区分相关与无关检索上下文,并采用两阶段指令微调实现噪声抑制与证据提炼,显著提升RALM在知识密集型任务中的鲁棒性。

Details Motivation: 现有RALM鲁棒性增强方法多在模块或层级别进行粗粒度参数更新,忽视LLM固有的神经元稀疏性,难以有效应对无关或噪声检索上下文导致的性能下降。 Method: 提出Neuro-RIT:首先基于归因方法挖掘关键神经元,解耦处理相关/无关上下文的神经元;再通过两阶段指令微调,功能性关闭仅响应无关上下文的神经元(噪声抑制),并优化特定层以提炼证据(证据蒸馏)。 Result: 在多个QA基准上,Neuro-RIT持续优于强基线及现有鲁棒性增强方法。 Conclusion: 神经元级精细调控比粗粒度适配更有效,Neuro-RIT为提升RALM鲁棒性提供了新范式。 Abstract: Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.

[60] Towards Position-Robust Talent Recommendation via Large Language Models

Silin Du,Hongyan Liu

Main category: cs.CL

TL;DR: 本文提出L3TR框架,通过块注意力机制、局部位置编码和ID采样方法,解决大语言模型在人才推荐中因点对式范式导致的高token消耗、位置偏差及中间丢失等问题,实现更高效、公平的列表式人才推荐。

Details Motivation: 现有基于大语言模型(LLM)的人才推荐系统多采用点对式范式,导致高token消耗、无法建模候选人关系,并受位置偏差和“中间丢失”问题影响,难以兼顾效率与效果。 Method: 提出列表式人才推荐框架L3TR,包含块注意力机制(增强文档间交互)、局部位置编码(缓解位置与并发token偏差)、ID采样(对齐训练与推理阶段候选集规模),并设计无训练去偏方法及偏差检测评估策略。 Result: 在两个真实数据集上的大量实验表明,L3TR在推荐效果上持续优于现有基线方法,同时有效缓解位置偏差与token偏差。 Conclusion: L3TR通过隐式利用LLM输出潜力与针对性结构设计,显著提升了LLM在人才推荐任务中的实用性、公平性与效率,为列表式LLM推荐提供了新范式。 Abstract: Talent recruitment is a critical, yet costly process for many industries, with high recruitment costs and long hiring cycles. Existing talent recommendation systems increasingly adopt large language models (LLMs) due to their remarkable language understanding capabilities. However, most prior approaches follow a pointwise paradigm, which requires LLMs to repeatedly process some text and fails to capture the relationships among candidates in the list, resulting in higher token consumption and suboptimal recommendations. Besides, LLMs exhibit position bias and the lost-in-the-middle issue when answering multiple-choice questions and processing multiple long documents. To address these issues, we introduce an implicit strategy to utilize LLM's potential output for the recommendation task and propose L3TR, a novel framework for listwise talent recommendation with LLMs. In this framework, we propose a block attention mechanism and a local positional encoding method to enhance inter-document processing and mitigate the position bias and concurrent token bias issue. We also introduce an ID sampling method for resolving the inconsistency between candidate set sizes in the training phase and the inference phase. We design evaluation methods to detect position bias and token bias and training-free debiasing methods. Extensive experiments on two real-world datasets validated the effectiveness of L3TR, showing consistent improvements over existing baselines.

[61] CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech

Youssef Saidi,Haroun Elleuch,Fethi Bougares

Main category: cs.CL

TL;DR: 本文提出了首个面向阿拉伯语语音的端到端命名实体识别(NER)公开数据集CV-18 NER,并在该数据集上对比了端到端模型(Whisper、AraBEST-RQ)与级联系统(ASR+文本NER)的性能,结果表明端到端方法显著更优;同时分析了预训练策略与模型规模对低资源阿拉伯语NER的影响。

Details Motivation: 阿拉伯语因其形态复杂性、缺失短元音及标注资源匮乏,在端到端语音NER中尚未被充分研究,亟需构建专用数据集与基准。 Method: 构建首个阿拉伯语语音NER数据集CV-18 NER(基于Common Voice 18并采用Wojood细粒度标注规范),并在其上评测端到端模型(Whisper、AraBEST-RQ)与级联系统(ASR+文本NER),分析预训练策略(阿拉伯语自监督 vs 多语言弱监督)和模型规模的影响。 Result: 端到端模型显著优于最佳级联系统:AraBEST-RQ 300M达37.0% CoER,Whisper-medium达38.0% CVER;阿拉伯语自监督预训练利于ASR,多语言弱监督更利于联合语音到实体学习;大模型在低资源下更难适配。 Conclusion: 端到端方法更适合阿拉伯语语音NER任务;CV-18 NER为该领域提供了首个开源基准,推动低资源语言语音NER研究。 Abstract: End-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech https://huggingface.co/datasets/Elyadata/CV18-NER.

[62] No Single Best Model for Diversity: Learning a Router for Sample Diversity

Yuhan Liu,Fangyuan Xu,Vishakh Padmakumar,Daphne Ippolito,Eunsol Choi

Main category: cs.CL

TL;DR: 本文研究如何通过多模型协作生成开放性问题的多样化答案,提出多样性覆盖率评估指标,并设计路由机制选择最优模型,显著提升答案多样性。

Details Motivation: 当面对允许大量有效答案的开放性提示时,全面生成这些答案是满足不同用户需求的第一步;现有单一模型无法在所有提示上都表现最优,因此需要一种机制来为每个提示选择最佳模型。 Method: 提出多样性覆盖率(diversity coverage)作为评估指标,衡量预测答案集中每个唯一答案的质量总和相对于同规模最优答案集的比例;评估18个大语言模型在多样回答生成上的表现;基于发现设计并训练一个路由模型,为每个查询选择最合适的生成模型。 Result: 在NB-Wildchat数据集上,所提路由器达到26.3%的多样性覆盖率,优于单模型最优基线(23.8%);并在NB-Curated等跨领域数据集及不同提示策略下验证了泛化能力。 Conclusion: 单一模型无法普适地生成高多样性答案,但针对每个提示存在最优模型;通过路由机制动态选择模型可显著提升多样性覆盖,为多模型协同生成全面答案提供了新范式。 Abstract: When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.

[63] Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Daiwei Chen,Zhoutong Fu,Chengming Jiang,Haichao Zhang,Ran Zhou,Tan Wang,Chunnan Yao,Guoyao Li,Rui Cai,Yihan Cao,Ruijie Jiang,Fedor Borisyuk,Jianqiang Shen,Jingwei Wu,Ramya Korlakai Vinayak

Main category: cs.CL

TL;DR: 本文发现语言模型中新增词汇的均值初始化会导致语义坍缩,提出基于语言学基础的GTI初始化方法,在生成式推荐任务中显著优于传统方法。

Details Motivation: 现有语言模型扩展新词表时采用均值初始化策略,但该策略可能导致新词嵌入坍缩到退化子空间,损害细粒度语义区分能力,影响下游任务性能。 Method: 提出Grounded Token Initialization(GTI)方法,在微调前利用配对语言监督将新词映射到预训练嵌入空间中语义明确、彼此区分的位置,实现轻量级语义锚定。 Result: GTI在多个生成式推荐基准(含工业级与公开数据集)上全面超越均值初始化及现有辅助任务适配方法;分析表明其生成的嵌入具有更丰富的词间结构,且该结构在微调后仍保持稳定。 Conclusion: 词嵌入初始化质量是扩展语言模型词表的关键瓶颈;语言学引导的初始化(GTI)能有效激活预训练知识,提升新领域泛化能力。 Abstract: Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

cs.CV [Back]

[64] DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation

Xinhao Huang,Jinke Yu,Wenhao Xu,Zeyi Wen,Ying Zhou,Junzhuo Liu,Junhao Ji,Zulong Chen

Main category: cs.CV

TL;DR: 本文提出DOne框架,通过解耦结构理解与元素渲染来提升设计到代码生成的质量,并引入HiFi2Code新基准进行评估。

Details Motivation: 现有视觉语言模型在设计到代码生成中存在‘整体瓶颈’,难以兼顾高层结构层次与细粒度视觉细节,导致布局失真或泛化占位符。 Method: 提出DOne端到端框架,包含:(1) 学习式布局分割模块以分解复杂设计;(2) 专用混合元素检索器处理UI组件的极端长宽比和高密度;(3) 模式引导的生成范式连接布局与代码。同时构建高复杂度基准HiFi2Code。 Result: 在HiFi2Code上,DOne在高层视觉相似性(如GPT Score提升超10%)和细粒度元素对齐上均优于现有方法;人工评估显示生产力提升3倍且视觉保真度更高。 Conclusion: DOne有效缓解了VLM在设计到代码任务中的结构-细节协调难题,显著提升了生成质量与实用性。 Abstract: While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a "holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. To bridge this gap, we propose DOne, an end-to-end framework that decouples structure understanding from element rendering. DOne introduces (1) a learned layout segmentation module to decompose complex designs, avoiding the limitations of heuristic cropping; (2) a specialized hybrid element retriever to handle the extreme aspect ratios and densities of UI components; and (3) a schema-guided generation paradigm that bridges layout and code. To rigorously assess performance, we introduce HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations on the HiFi2Code demonstrate that DOne outperforms exiting methods in both high-level visual similarity (e.g., over 10% in GPT Score) and fine-grained element alignment. Human evaluations confirm a 3 times productivity gain with higher visual fidelity.

[65] CLPIPS: A Personalized Metric for AI-Generated Image Similarity

Khoi Trinh,Jay Rothenberger,Scott Seidenberger,Dimitrios Diochnos,Anindya Maiti

Main category: cs.CV

TL;DR: 本文提出CLPIPS,一种基于LPIPS的轻量级、人类反馈驱动的定制化图像相似度度量方法,通过仅微调LPIPS层权重来提升其与人类感知判断的一致性,在人机协同文生图流程中增强感知对齐。

Details Motivation: 现有图像相似度指标(如LPIPS、CLIP)虽客观但常与人类主观判断不一致,尤其在特定上下文或用户驱动任务中;亟需一种能适配人类感知的可定制、轻量级相似度度量。 Method: 提出Customized Learned Perceptual Image Patch Similarity(CLPIPS),在人类对生成图像排序的数据集上,使用margin ranking loss仅微调LPIPS的层组合权重。 Result: CLPIPS在Spearman秩相关系数和组内相关系数(ICC)上均优于基线LPIPS,显著提升与人类排序判断的一致性;验证了少量人类标注即可有效提升感知对齐。 Conclusion: 轻量级、人类增强的微调可显著改善相似度指标与人类感知的对齐效果,使ISMs成为更可靠的‘人在回路’文生图工作流中的自适应组件。 Abstract: Iterative prompt refinement is central to reproducing target images with text to image generative models. Previous studies have incorporated image similarity metrics (ISMs) as additional feedback to human users. Existing ISMs such as LPIPS and CLIP provide objective measures of image likeness but often fail to align with human judgments, particularly in context specific or user driven tasks. In this paper, we introduce Customized Learned Perceptual Image Patch Similarity (CLPIPS), a customized extension of LPIPS that adapts a metric's notion of similarity directly to human judgments. We aim to explore whether lightweight, human augmented fine tuning can meaningfully improve perceptual alignment, positioning similarity metrics as adaptive components for human in the loop workflows with text to image tools. We evaluate CLPIPS on a human subject dataset in which participants iteratively regenerate target images and rank generated outputs by perceived similarity. Using margin ranking loss on human ranked image pairs, we fine tune only the LPIPS layer combination weights and assess alignment via Spearman rank correlation and Intraclass Correlation Coefficient. Our results show that CLPIPS achieves stronger correlation and agreement with human judgments than baseline LPIPS. Rather than optimizing absolute metric performance, our work emphasizes improving alignment consistency between metric predictions and human ranks, demonstrating that even limited human specific fine tuning can meaningfully enhance perceptual alignment in human in the loop text to image workflows.

[66] Camouflage-aware Image-Text Retrieval via Expert Collaboration

Yao Jiang,Zhongkuan Mao,Xuan Wu,Keren Fu,Qijun Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的‘伪装感知图像-文本检索’(CA-ITR)任务,并构建了首个专用数据集CamoIT;为解决该任务中因伪装物体和复杂场景带来的对齐难题,设计了伪装专家协同网络(CECNet),引入双分支视觉编码器与置信度条件图注意力机制(C²GA),显著提升了跨模态检索性能。

Details Motivation: 现有伪装场景理解(CSU)研究中,图像-文本跨模态对齐鲁棒性不足,限制了对伪装场景的深层理解与应用;亟需针对伪装特性建模的图像-文本检索新任务与方法。 Method: 构建含约10.5K样本、多粒度文本标注的CamoIT数据集;提出CECNet模型,包含整体表征分支与伪装物体表征注入分支,并引入置信度条件图注意力(C²GA)机制融合双分支互补信息。 Result: 在CamoIT上,CECNet相较七种代表性检索模型平均提升约29%的CA-ITR准确率;实验验证了伪装属性和复杂内容是当前方法的主要挑战。 Conclusion: CA-ITR是一项具有挑战性的新任务,CECNet通过显式建模伪装特性与分支协同机制有效提升了跨模态对齐能力,为伪装场景理解提供了新思路与实用工具(开源数据集与代码)。 Abstract: Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval'' (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising $\sim$10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript{2}GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves $\sim$29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at https://github.com/jiangyao-scu/CA-ITR.

[67] Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

Marco Morini,Sara Sarto,Marcella Cornia,Lorenzo Baraldi

Main category: cs.CV

TL;DR: 本文提出Look Twice(LoT)框架,无需训练即可在推理阶段提升多模态大语言模型(MLLMs)对视觉与外部知识联合推理的能力,通过注意力机制识别并高亮关键视觉区域和文本证据,显著提升知识密集型视觉问答性能。

Details Motivation: 现有MLLMs在知识密集型视觉问答中难以准确识别和融合相关视觉线索与外部检索文本,尤其面对噪声或部分相关文本及细粒度视觉定位时表现不佳。 Method: 提出无训练的推理时框架LoT,利用预训练MLLM的注意力模式估计查询相关的视觉区域和文本片段,并通过轻量级提示标记高亮这些证据,引导模型在生成答案时重新关注关键信息。 Result: 在多个知识型VQA基准上显著超越零样本MLLM;视觉证据高亮本身即能在无文本上下文的视觉中心任务及幻觉评估中提升性能。 Conclusion: LoT是一种高效、通用且无需修改模型结构或额外训练的推理增强方法,有效提升了MLLM在多模态证据整合中的准确性和鲁棒性。 Abstract: Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.

[68] Sparse Spectral LoRA: Routed Experts for Medical VLMs

Omid Nejati Manzari,Hojat Asgariandehkordi,Taha Koleilat,Yiming Xiao,Hassan Rivaz

Main category: cs.CV

TL;DR: 本文提出MedQwen,一种参数高效的医学视觉语言模型,通过谱路由的混合专家(MoE)结构和理论支持的缩放规则,在保持基座架构不变的前提下,显著提升跨数据集鲁棒性与持续学习能力,大幅减少参数量并缓解灾难性遗忘。

Details Motivation: 大型视觉语言模型在通用基准上表现优异,但在医学影像领域鲁棒性不足,主要受异构监督导致的跨数据集干扰、对数据范式敏感以及临床中数据任务流式到达引发的灾难性遗忘等问题制约。 Method: 提出MedQwen模型:采用谱路由的MoE结构;基于非重叠SVD分段初始化各专家;引入残差补偿与缩放机制以稳定专家特化和分布偏移下的路由一致性;设计理论支撑的低秩更新缩放规则,使其逼近全秩微调MoE性能。 Result: 在23个涵盖视觉问答、报告生成、放射学分类与幻觉缓解的医学数据集上验证:零样本分类性能接近全微调,仅需其1/339的可训练参数;顺序学习中遗忘率约5%,而强基线下降超20–50%。 Conclusion: MedQwen通过结构创新与理论驱动的参数高效微调策略,有效解决了医学VLM在异构监督与持续学习场景下的鲁棒性与可扩展性难题,为临床部署提供了实用新范式。 Abstract: Large vision-language models (VLMs) excel on general benchmarks but often lack robustness in medical imaging, where heterogeneous supervision induces cross-dataset interference and sensitivity to data regime (i.e., how the supervisory signals are mixed). In realistic clinical workflows, data and tasks arrive sequentially, so naive continual training further leads to catastrophic forgetting. To address these challenges, we propose MedQwen, a parameter-efficient medical VLM that couples a spectrally routed Mixture-of-Experts (MoE) with a theoretically grounded scaling rule that aligns low-rank updates with a full-rank, fully fine-tuned MoE, without changing the base architecture. Concretely, we initialize each expert from non-overlapping singular value decomposition (SVD) segments of the pretrained weight and introduce a residual compensation and scaling scheme to enable stable expert specialization and consistent routing under distribution shift. Across 23 medical datasets covering visual question answering, report generation, radiology classification, and hallucination mitigation, MedQwen achieves strong, reliable performance: it approaches full fine-tuning on zero-shot classification with 339$\times$ fewer trainable parameters, and reduces sequential forgetting to $\sim$5\% where strong baselines degrade by $>$20-50\%.

[69] ViTs for Action Classification in Videos: An Approach to Risky Tackle Detection in American Football Practice Videos

Syed Ahsan Masud Zaidi,William Hsu,Scott Dietrich

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉Transformer和不平衡数据训练策略的方法,用于在美式橄榄球训练视频中检测危险擒抱动作,并构建了包含733个标注片段的更大规模数据集,在风险动作召回率和F1分数上均优于先前基线。

Details Motivation: 早期识别接触性运动中的危险动作可实现及时干预并提升运动员安全性。 Method: 使用基于视觉Transformer的模型,并结合针对类别不平衡问题的训练策略,对新构建的大规模美式橄榄球危险擒抱视频数据集(733个单人-假人擒抱片段,带SATT-3击打区域标注)进行时空定位与分类。 Result: 交叉验证下获得0.67的风险动作召回率(Risky Recall)和0.59的风险F1分数(Risky F1),相比先前在小数据集上的基线(Risky Recall 0.58,Risky F1 0.56)分别提升超8个百分点。 Conclusion: 视觉Transformer结合不平衡学习可有效检测罕见但关键的安全风险动作,为教练端的伤病预防工具提供了可行路径。 Abstract: Early identification of hazardous actions in contact sports enables timely intervention and improves player safety. We present a method for detecting risky tackles in American football practice videos and introduce a substantially expanded dataset for this task. Our work contains 733 single-athlete-dummy tackle clips, each temporally localized around first point contact and labeled with a strike zone component of the standardized Assessment for Tackling Technique (SATT-3), extending prior work that reported 178 annotated videos. Using a Vision transformer-based model with imbalance-aware training, we obtain risky recall of 0.67 and Risky F1 of 0.59 under crossvalidation. Relative to the previous baseline in a smaller subset (risky recall of 0.58; Risky F1 0.56 ), our approach improves risky recall by more than 8% points on a much larger dataset. These results indicate that the vision transformer-based video analysis, coupled with careful handling of class imbalance, can reliably detect rare but safety-critical tackling patterns, offering a practical pathway toward coach-centered injury prevention tools.

[70] Human Pose Estimation in Trampoline Gymnastics: Improving Performance Using a New Synthetic Dataset

Léa Drolet-Roy,Victor Nogues,Sylvain Gaudet,Eve Charbonneau,Mickaël Begon,Lama Séoud

Main category: cs.CV

TL;DR: 本文提出了一种通过合成数据(STP)微调ViTPose模型以提升蹦床体操中极端姿态估计精度的方法,显著提升了2D和3D姿态估计性能。

Details Motivation: 现有姿态估计模型在蹦床体操这类包含极端人体姿态和非常规视角的场景中表现不佳。 Method: 基于动作捕捉数据构建合成蹦床姿态数据集(STP),通过拟合噪声动捕数据到参数化人体模型并生成多视角逼真图像;用该数据微调ViTPose模型,并在真实多视角蹦床图像上测试。 Result: 2D姿态估计达到该挑战性数据集上的SOTA;3D MPJPE降低12.5mm(相对提升19.6%)。 Conclusion: 利用高质量合成数据微调可有效弥合常见姿态与极端姿态之间的性能差距,提升复杂运动场景下的姿态估计鲁棒性与精度。 Abstract: Trampoline gymnastics involves extreme human poses and uncommon viewpoints, on which state-of-the art pose estimation models tend to under-perform. We demonstrate that this problem can be addressed by fine-tuning a pose estimation model on a dataset of synthetic trampoline poses (STP). STP is generated from motion capture recordings of trampoline routines. We develop a pipeline to fit noisy motion capture data to a parametric human model, then generate multiview realistic images. We use this data to fine-tune a ViTPose model, and test it on real multi-view trampoline images. The resulting model exhibits accuracy improvements in 2D which translates to improved 3D triangulation. In 2D, we obtain state-of-the-art results on such challenging data, bridging the performance gap between common and extreme poses. In 3D, we reduce the MPJPE by 12.5 mm with our best model, which represents an improvement of 19.6% compared to the pretrained ViTPose model.

[71] Regularizing Attention Scores with Bootstrapping

Neo Christopher Chung,Maxim Laletin

Main category: cs.CV

TL;DR: 本文提出了一种基于自助法(bootstrapping)的注意力正则化方法,用于量化ViT中注意力分数的不确定性,从而去除噪声引起的虚假注意力,提升注意力图的稀疏性与可解释性。

Details Motivation: ViT中的注意力分数通常非零,导致注意力图噪声大、扩散严重,限制了模型决策过程的可解释性。 Method: 将注意力分数置于统计框架下,通过自助采样输入特征构建注意力分数的基线分布,进而估计其显著性和后验概率,实现注意力正则化。 Result: 在自然图像和医学图像上显著减少虚假注意力,大幅提升注意力图的收缩性与稀疏性;在仿真与真实数据集上的定量评估验证了有效性。 Conclusion: 自助法是一种实用且有效的注意力正则化工具,可提升以注意力分数为解释依据的ViT模型的可解释性。 Abstract: Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision-making process. However, attention scores are almost always non-zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non-zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed \emph{Attention Regularization} approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real-world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT. Code available: https://github.com/ncchung/AttentionRegularization

[72] Perceptual misalignment of texture representations in convolutional neural networks

Ludovica de Paolis,Fabio Anselmi,Alessio Ansuini,Eugenio Piasini

Main category: cs.CV

TL;DR: 本文探讨了卷积神经网络(CNN)在纹理感知建模中的适用性,发现CNN的视觉系统建模质量(如Brain-Score)与其对人类纹理感知的表征能力之间并无相关性,暗示纹理感知可能依赖于CNN(尤其是以物体识别为目标训练的CNN)未涵盖的机制,例如上下文信息整合。

Details Motivation: 探究CNN作为视觉系统模型时,其基于特征相关性的纹理表征是否自然地与人类纹理感知对齐,尤其关注更优视觉模型是否具有更类人的纹理表征能力。 Method: 比较多种CNN模型提取的非线性特征之间的线性相关性(Gram矩阵)所构成的纹理表征,并将其与人类纹理感知的相似性进行评估;同时将这些模型在Brain-Score上的视觉系统建模表现作为对照。 Result: 发现CNN在Brain-Score等常规视觉系统建模指标上的优劣,与其纹理表征与人类感知的一致性之间无显著关联。 Conclusion: 人类纹理感知可能依赖于当前主流CNN(尤其是以物体识别为训练目标的CNN)未能有效建模的机制,例如对上下文信息的整合,因此需发展新的建模思路。 Abstract: Mathematical modeling of visual textures traces back to Julesz's intuition that texture perception in humans is based on local correlations between image features. An influential approach for texture analysis and generation generalizes this notion to linear correlations between the nonlinear features computed by convolutional neural networks (CNNs), compiled into Gram matrices. Given that CNNs are often used as models for the visual system, it is natural to ask whether such "texture representations" spontaneously align with the textures' perceptual content, and in particular whether those CNNs that are regarded as better models for the visual system also possess more human-like texture representations. Here we compare the perceptual content captured by feature correlations computed for a diverse pool of CNNs, and we compare it to the models' perceptual alignment with the mammalian visual system as measured by Brain-Score. Surprisingly, we find that there is no connection between conventional measures of CNN quality as a model of the visual system and its alignment with human texture perception. We conclude that texture perception involves mechanisms that are distinct from those that are commonly modeled using approaches based on CNNs trained on object recognition, possibly depending on the integration of contextual information.

[73] IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation

Nermin Samet,Gilles Puy,Renaud Marlet

Main category: cs.CV

TL;DR: 本文提出了一种用于3D激光雷达数据零样本开放词汇语义分割(OVSS)的新方法,通过文本生成图像构建原型,并利用2D视觉基础模型蒸馏的3D网络匹配点云与原型图像特征,实现了nuScenes和SemanticKITTI上的SOTA性能。

Details Motivation: 为解决基于CLIP等视觉语言模型(VLM)的方法中固有的图像-文本模态差距问题,探索更适用于3D激光雷达零样本语义分割的替代方案。 Method: 利用文本到图像生成模型创建类别原型图像;将2D视觉基础模型(VFM)蒸馏为3D网络;通过匹配3D点云特征与原型图像的2D特征实现开放词汇分割。 Result: 在nuScenes和SemanticKITTI数据集上达到零样本开放词汇语义分割的最先进(SOTA)性能。 Conclusion: 文本生成图像作为原型可有效弥合模态鸿沟,结合VFM蒸馏的3D网络能高效实现3D点云的零样本开放词汇语义分割。 Abstract: This paper presents a new method for the zero-shot open-vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image-text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state-of-the-art for OVSS on nuScenes and SemanticKITTI. Code, pre-trained models, and generated images are available at https://github.com/valeoai/IGLOSS.

[74] AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction

Aiza Maksutova,Lalithkumar Seenivasan,Hao Ding,Jiru Xu,Chenhao Yu,Chenyan Jing,Yiqing Shen,Mathias Unberath

Main category: cs.CV

TL;DR: 本文提出AffordTissue框架,用于胆囊切除术中预测器械-动作特异性的组织可操作区域(密集热图),结合时序视觉编码器、语言条件引导和DiT式解码器,在自建的首个组织可操作性基准(15638个视频片段)上显著优于现有视觉语言模型。

Details Motivation: 现有手术自动化方法在临床部署中面临两大挑战:难以预测器械与组织表面的交互位置,且缺乏对工具-动作特异性安全交互区域的显式条件控制。 Method: 提出多模态框架AffordTissue,包含:1)捕获多视角器械运动与组织动态的时序视觉编码器;2)支持跨器械-动作泛化的语言条件模块;3)DiT风格解码器用于密集可操作性预测;并构建首个组织可操作性基准(103例胆囊切除术,6种器械-动作对)。 Result: 在自建基准上,AffordTissue的平均表面距离(ASSD)为20.6像素,显著优于Molmo-VLM(60.2像素);验证了任务专用架构在密集外科可操作性预测上优于大规模基础模型。 Conclusion: AffordTissue通过预测器械-动作特异性的组织可操作区域,为手术自动化提供显式空间推理能力,支持策略级安全引导(如引导至合适组织区、器械越界时早期安全停止)。 Abstract: Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.

[75] GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization

Syed Ahsan Masud Zaidi,Lior Shamir,William Hsu,Scott Dietrich,Talha Zaidi

Main category: cs.CV

TL;DR: 本文提出GRAZE无训练管线,用于在无标注数据下精准定位美式橄榄球训练视频中球员首次接触假人(FPOC)的帧,结合Grounding DINO、运动感知时序推理与SAM2像素级验证,在738段视频上实现77.5%±10帧精度。

Details Motivation: 美式橄榄球训练视频长且未剪辑,关键接触事件仅占极短时间窗口;可靠生物力学分析依赖于对接触实体和接触起始时刻的时空精确定位,而现有方法难以应对摄像机运动、场景杂乱、多人相似装备及冲击前后快速姿态变化等挑战。 Method: GRAZE是一种无需训练的FPOC定位管线:首先用Grounding DINO发现候选球员-假人交互区域;其次通过运动感知的时序推理优化候选;最后利用SAM2进行像素级接触验证(而非依赖检测置信度),实现候选发现与接触确认的解耦。 Result: 在738段实战训练视频上,GRAZE有效输出率达97.4%,其中77.5%的FPOC定位误差在±10帧内,82.7%在±20帧内。 Conclusion: 无需任务特定训练即可实现在真实训练视频中帧级精度的接触起始定位,验证了无监督/弱监督范式在体育生物力学分析中的可行性与鲁棒性。 Abstract: American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within $\pm$ 10 frames on 77.5% of all clips and within $\pm$ 20 frames on 82.7% of all clips. These results show that frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training.

[76] LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

Fusang Wang,Nathan Piasco,Moussab Bennehar,Luis Roldão,Dzmitry Tsishkou,Fabien Moutarde

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏体素光栅化(SVRaster)的新框架,以解决现有3D高斯泼溅(3DGS)在开放词汇3D场景理解中因高斯重叠和掩码池化导致的空间与语义模糊问题;通过引入单目深度与法向先验正则化SVRaster,并利用AM-RADIO模型的密集对齐特性,实现了确定性、置信度感知的特征注册和细粒度语义保留,在开放词汇3D物体检索与点云理解任务上达到SOTA性能。

Details Motivation: 现有基于3D高斯泼溅(3DGS)的开放词汇3D场景理解方法存在两大缺陷:一是无结构、重叠的高斯分布导致空间模糊,需概率化特征注册;二是对象级掩码池化引发多层级语义模糊,削弱细粒度细节。 Method: 提出基于稀疏体素光栅化(SVRaster)的结构化、非重叠几何表示;用单目深度与法向先验正则化SVRaster以构建稳定几何基础;实现确定性、置信度感知的特征注册;利用AM-RADIO基础模型的密集对齐特性,避免分层训练开销,消除语义扩散。 Result: 在开放词汇3D物体检索与点云理解基准上达到SOTA性能,尤其在细粒度查询任务中显著优于依赖注册的方法。 Conclusion: SVRaster提供了一种更鲁棒、结构化的3D几何表示范式,结合AM-RADIO的对齐能力,有效克服了3DGS在开放词汇理解中的空间与语义模糊瓶颈,为细粒度3D视觉语言理解开辟新路径。 Abstract: Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.

[77] EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation

Abhishek Saroha,Huajian Zeng,Xingxing Zuo,Daniel Cremers,Xi Wang

Main category: cs.CV

TL;DR: EgoFlow是一种基于流匹配的框架,用于从第一人称视频中生成物理上合理且逼真的6DoF物体运动轨迹,结合混合Mamba-Transformer-Perceiver架构与可微物理约束,在真实数据集上显著降低碰撞率并提升泛化能力。

Details Motivation: 现有生成模型缺乏显式物理推理,难以在遮挡、快速运动等挑战下生成物理一致的6DoF轨迹。 Method: 提出EgoFlow,采用混合Mamba-Transformer-Perceiver架构联合建模时序动态、场景几何与语义意图,并通过梯度引导推理施加可微物理约束(如避碰、运动平滑)。 Result: 在HD-EPIC、EgoExo4D和HOT3D数据集上,EgoFlow在精度、泛化性和物理真实性上超越扩散模型和Transformer基线,碰撞率最高降低79%,且对未见场景具有强泛化性。 Conclusion: 流匹配生成建模为可扩展、物理 grounded 的第一人称运动理解提供了新路径。 Abstract: Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba-Transformer-Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on real-world datasets HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.

[78] Better Rigs, Not Bigger Networks: A Body Model Ablation for Gaussian Avatars

Derek Austin

Main category: cs.CV

TL;DR: 本文提出用Momentum Human Rig (MHR) 替代SMPL,结合SAM-3D-Body估计,构建更简化的3D高斯点阵人体建模流程,在PeopleSnapshot和ZJU-MoCap上取得最优PSNR及有竞争力的LPIPS/SSIM,验证了身体模型表达能力是当前Avatar重建的主要瓶颈。

Details Motivation: 现有基于SMPL的3D高斯点阵方法虽视觉质量高,但训练架构日益复杂;作者质疑该复杂性是否必要,并试图寻找更简洁有效的身体表示与估计方案。 Method: 以Momentum Human Rig(MHR)替代SMPL,利用SAM-3D-Body进行姿态与网格估计,构建无学习形变、无姿态依赖校正的极简高斯点阵管线;并通过两项受控消融实验(MHR↔SMPL-X网格转换、SMPL姿态转MHR重训练)分离姿态估计质量与模型表达能力的影响。 Result: 所提MHR+SAM-3D-Body方案在PeopleSnapshot和ZJU-MoCap数据集上达到最高PSNR,LPIPS和SSIM指标达竞争或更优水平;消融实验证实身体模型表达能力(含网格表示能力与姿态估计质量)是avatar重建性能提升的关键瓶颈。 Conclusion: 简化身体模型(MHR)配合高质量单帧估计(SAM-3D-Body)可超越复杂可学习变形SMPL方案,表明提升身体表示能力和姿态估计精度比增加网络复杂度更关键。 Abstract: Recent 3D Gaussian splatting methods built atop SMPL achieve remarkable visual fidelity while continually increasing the complexity of the overall training architecture. We demonstrate that much of this complexity is unnecessary: by replacing SMPL with the Momentum Human Rig (MHR), estimated via SAM-3D-Body, a minimal pipeline with no learned deformations or pose-dependent corrections achieves the highest reported PSNR and competitive or superior LPIPS and SSIM on PeopleSnapshot and ZJU-MoCap. To disentangle pose estimation quality from body model representational capacity, we perform two controlled ablations: translating SAM-3D-Body meshes to SMPL-X, and translating the original dataset's SMPL poses into MHR both retrained under identical conditions. These ablations confirm that body model expressiveness has been a primary bottleneck in avatar reconstruction, with both mesh representational capacity and pose estimation quality contributing meaningfully to the full pipeline's gains.

[79] Nonlinear Methods for Analyzing Pose in Behavioral Research

Carter Sale,Margaret C. Macpherson,Gaurav Patil,Kelly Miles,Rachel W. Kallen,Sebastian Wallot,Michael J. Richardson

Main category: cs.CV

TL;DR: 本文提出了一种通用的人类姿态数据分析流程,结合预处理、降维和基于递归的时间序列分析,以提取运动动态的时序结构,并通过多个案例验证其在不同场景下的灵活性与适用性。

Details Motivation: 高维、含噪且时序复杂的姿态数据难以提取有意义的协调与行为变化模式,亟需一种通用、鲁棒的分析框架。 Method: 构建包含原理性预处理、降维和基于递归的时间序列分析的通用分析流程,适用于线性和非线性运动表征。 Result: 通过涵盖面部/全身、2D/3D、单主体/多主体行为的三个案例研究,验证了该流程能灵活适应多种实验情境并提取理论上有意义的行为洞察。 Conclusion: 所提出的分析流程为大规模、自然情境下的人类行为分析提供了可扩展、可复用的方法学基础。 Abstract: Advances in markerless pose estimation have made it possible to capture detailed human movement in naturalistic settings using standard video, enabling new forms of behavioral analysis at scale. However, the high dimensionality, noise, and temporal complexity of pose data raise significant challenges for extracting meaningful patterns of coordination and behavioral change. This paper presents a general-purpose analysis pipeline for human pose data, designed to support both linear and nonlinear characterizations of movement across diverse experimental contexts. The pipeline combines principled preprocessing, dimensionality reduction, and recurrence-based time series analysis to quantify the temporal structure of movement dynamics. To illustrate the pipeline's flexibility, we present three case studies spanning facial and full-body movement, 2D and 3D data, and individual versus multi-agent behavior. Together, these examples demonstrate how the same analytic workflow can be adapted to extract theoretically meaningful insights from complex pose time series.

[80] Reinforcing Consistency in Video MLLMs with Structured Rewards

Yihao Quan,Zeru Shi,Jinman Zhao,Ruixiang Tang

Main category: cs.CV

TL;DR: 本文提出了一种结构化奖励机制,通过分解视频字幕为事实性和时序性主张,对多模态大语言模型(MLLMs)的视频理解进行自上而下的可验证性审计,并在强化学习中用细粒度的事实与时序奖励替代粗粒度句子级奖励,显著提升了模型在视觉与时间定位上的忠实性。

Details Motivation: 现有MLLMs在视频理解中常产生看似合理但缺乏视觉和时序依据的幻觉输出(如虚构物体、错误属性或忽略事件重复),标准句子级监督和奖励难以定位具体接地失败。 Method: 提出一种基于分解字幕的‘组合一致性审计’方法;设计结构化RL奖励,包含:(1) 实例感知的场景图奖励(对象/属性/关系),(2) 时序奖励(事件顺序与重复),(3) 视频接地的VQA分层自验证奖励。 Result: 在时序理解、通用视频理解及幻觉评测基准上,该结构化奖励在多个开源骨干模型上带来一致性能提升,显著降低事实性与时间性幻觉。 Conclusion: 结构化奖励塑形是提升MLLMs视频理解忠实性的实用有效路径,强调细粒度、可分解、可验证的监督信号优于粗粒度句子级目标。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.

[81] Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation

Yunbei Zhang,Chengyi Cai,Feng Liu,Jihun Hamm

Main category: cs.CV

TL;DR: 本文提出AReS方法,通过单次API调用对本地预训练编码器进行轻量级微调,再在本地模型上进行白盒重编程,从而避免反复调用昂贵的闭源API,在多个数据集上显著提升性能并大幅降低API调用开销。

Details Motivation: 现有基于零阶优化(ZOO)的闭源服务模型(如GPT-4o)重编程方法存在API调用频繁、成本高、优化不稳定,且现代大模型对输入扰动不敏感,导致ZOO效果差。 Method: AReS采用两阶段策略:第一阶段为单次API交互,仅训练本地预训练编码器顶部的轻量层以‘激活’其适配能力;第二阶段在该本地代理模型上开展白盒重编程,完全脱离API。 Result: 在GPT-4o上相对零样本基线提升+27.8%,而ZOO方法几乎无增益;在10个数据集上平均超越SOTA方法(VLMs +2.5%,标准VMs +15.6%),API调用量减少超99.99%。 Conclusion: AReS提供了一种高效、稳定、低成本的闭源服务模型适配新范式,尤其适用于对扰动不敏感的现代大模型。 Abstract: Adapting closed-box service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS's effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5% for VLMs, +15.6% for standard VMs) while reducing API calls by over 99.99%. AReS thus provides a robust and practical solution for adapting modern closed-box models.

[82] UniRecGen: Unifying Multi-View 3D Reconstruction and Generation

Zhisheng Huang,Jiahao Chen,Cheng Lin,Chenyu Hu,Hanzhuo Huang,Zhengming Yu,Mengfei Li,Yuheng Liu,Zekai Gu,Zibo Zhao,Yuan Liu,Xin Li,Wenping Wang

Main category: cs.CV

TL;DR: UniRecGen 是一个统一框架,结合了前馈重建和扩散生成两种范式,通过在共享规范空间中对齐模型并采用解耦协作学习,在稀疏视角下实现高保真、结构完整且多视角一致的3D建模。

Details Motivation: 稀疏视角3D建模面临重建保真度与生成合理性之间的根本张力:前馈重建高效但缺乏全局先验以保证结构完整性;扩散生成细节丰富但多视角不一致。 Method: 提出 UniRecGen 框架,将前馈重建模块与扩散生成模块对齐到共享的规范空间,并采用解耦协作学习策略;重建模块提供规范几何锚点,扩散模块通过潜在增强条件进行几何细化与补全。 Result: 实验表明 UniRecGen 在稀疏视角输入下生成的3D模型具有更高保真度、结构完整性与多视角一致性,性能优于现有方法。 Conclusion: 通过协同整合重建与生成范式,UniRecGen 有效缓解了稀疏视角3D建模中 fidelity 与 plausibility 的权衡问题,为统一建模框架提供了新思路。 Abstract: Sparse-view 3D modeling represents a fundamental tension between reconstruction fidelity and generative plausibility. While feed-forward reconstruction excels in efficiency and input alignment, it often lacks the global priors needed for structural completeness. Conversely, diffusion-based generation provides rich geometric details but struggles with multi-view consistency. We present UniRecGen, a unified framework that integrates these two paradigms into a single cooperative system. To overcome inherent conflicts in coordinate spaces, 3D representations, and training objectives, we align both models within a shared canonical space. We employ disentangled cooperative learning, which maintains stable training while enabling seamless collaboration during inference. Specifically, the reconstruction module is adapted to provide canonical geometric anchors, while the diffusion generator leverages latent-augmented conditioning to refine and complete the geometric structure. Experimental results demonstrate that UniRecGen achieves superior fidelity and robustness, outperforming existing methods in creating complete and consistent 3D models from sparse observations.

[83] Universal computational thermal imaging overcoming the ghosting effect

Hongyi Xu,Du Wang,Chenjun Zhao,Jiashuo Chen,Jiale Lin,Liqin Cao,Yanfei Zhong,Yiyuan She,Fanglin Bao

Main category: cs.CV

TL;DR: 本文提出了一种名为TAG(thermal anti-ghosting)的通用计算热成像框架,旨在解决材料非均匀性导致的热成像鬼影效应,实现高保真夜视。

Details Motivation: 传统热成像受鬼影效应限制,细节纹理丢失;现有HADAR方法仅适用于均匀材质场景,而现实世界中材质非均匀性普遍存在,亟需通用抗鬼影方案。 Method: 提出TAG框架,利用超光谱光子流进行非参数纹理恢复,实现对非均匀材质场景的通用抗鬼影热成像。 Result: 首次在鬼影严重的人脸热图像中实现前所未有的表情与纹理恢复;在多种场景下全面超越HADAR;首次实现热成像3D拓扑对齐与情绪识别。 Conclusion: TAG为高保真计算夜视建立了通用基础,具有自主导航、侦察、医疗和野生动物监测等广泛应用潜力。 Abstract: Thermal imaging is crucial for night vision but fundamentally hampered by the ghosting effect, a loss of detailed texture in cluttered photon streams. While conventional ghosting mitigation has relied on data post-processing, the recent breakthrough in heat-assisted detection and ranging (HADAR) opens a promising frontier for hyperspectral computational thermal imaging that produces night vision with day-like visibility. However, universal anti-ghosting imaging remains elusive, as state-of-the-art HADAR applies only to limited scenes with uniform materials, whereas material non-uniformity is ubiquitous in the real world. Here, we propose a universal computational thermal imaging framework, TAG (thermal anti-ghosting), to address material non-uniformity and overcome ghosting for high-fidelity night vision. TAG takes hyperspectral photon streams for nonparametric texture recovery, enabling our experimental demonstration of unprecedented expression recovery in thus-far-elusive ghostly human faces -- the archetypal, long-recognized ghosting phenomenon. Strikingly, TAG not only universally outperforms HADAR across various scenes, but also reveals the influence of material non-uniformity, shedding light on HADAR's effectiveness boundary. We extensively test facial texture and expression recovery across day and night, and demonstrate, for the first time, thermal 3D topological alignment and mood detection. This work establishes a universal foundation for high-fidelity computational night vision, with potential applications in autonomous navigation, reconnaissance, healthcare, and wildlife monitoring.

[84] Prototype-Based Low Altitude UAV Semantic Segmentation

Da Zhang,Gao Junyu,Zhao Zhiyuan

Main category: cs.CV

TL;DR: 本文提出了一种面向低空无人机影像语义分割的高效原型分割框架PBSeg,通过原型交叉注意力(PBCA)和结合可变形卷积与上下文感知调制的多尺度特征提取模块,在保持高精度的同时显著降低计算开销。

Details Motivation: 低空无人机影像语义分割面临尺度变化大、边界复杂及边缘设备算力受限等挑战;现有基于Transformer的方法计算开销高,轻量方法又难以捕捉高分辨率航拍图像的细节。 Method: 提出PBSeg框架,核心包括:1)原型基交叉注意力(PBCA),利用特征冗余降低计算复杂度;2)融合可变形卷积(DConv)与上下文感知调制(CAM)的高效多尺度特征提取模块。 Result: 在UAVid和UDD6两个无人机数据集上分别达到71.86%和80.92%的mIoU,性能具有竞争力且计算高效。 Conclusion: PBSeg在精度与效率之间取得了良好平衡,为资源受限的无人机边缘语义分割任务提供了实用解决方案。 Abstract: Semantic segmentation of low-altitude UAV imagery presents unique challenges due to extreme scale variations, complex object boundaries, and limited computational resources on edge devices. Existing transformer-based segmentation methods achieve remarkable performance but incur high computational overhead, while lightweight approaches struggle to capture fine-grained details in high-resolution aerial scenes. To address these limitations, we propose PBSeg, an efficient prototype-based segmentation framework tailored for UAV applications. PBSeg introduces a novel prototype-based cross-attention (PBCA) that exploits feature redundancy to reduce computational complexity while maintaining segmentation quality. The framework incorporates an efficient multi-scale feature extraction module that combines deformable convolutions (DConv) with context-aware modulation (CAM) to capture both local details and global semantics. Experiments on two challenging UAV datasets demonstrate the effectiveness of the proposed approach. PBSeg achieves 71.86\% mIoU on UAVid and 80.92\% mIoU on UDD6, establishing competitive performance while maintaining computational efficiency. Code is available at https://github.com/zhangda1018/PBSeg.

[85] Cross-Domain Vessel Segmentation via Latent Similarity Mining and Iterative Co-Optimization

Zhanqiang Guo,Jianjiang Feng,Jie Zhou

Main category: cs.CV

TL;DR: 本文提出了一种基于潜在血管相似性和生成-分割网络迭代协同优化的跨域视网膜血管分割新框架,显著提升了在模态差异大的临床场景下的分割性能。

Details Motivation: 现有基于CNN的方法在训练与测试数据存在域偏移时性能显著下降,亟需提升模型跨域泛化能力。 Method: 提出一种域迁移框架:先分别预训练源域和目标域的生成网络;利用源域条件扩散模型进行确定性反演,构建域无关的血管图像潜在原型以合成目标域图像;再通过分割网络与生成模型的循环参数更新实现迭代协同优化。 Result: 在跨域视网膜血管分割任务上达到当前最优性能,尤其在模态差异显著的挑战性临床场景中表现突出。 Conclusion: 所提框架通过联合优化生成与分割任务,有效缓解域偏移问题,为医学图像跨域分析提供了新思路。 Abstract: Retinal vessel segmentation serves as a critical prerequisite for automated diagnosis of retinal pathologies. While recent advances in Convolutional Neural Networks (CNNs) have demonstrated promising performance in this task, significant performance degradation occurs when domain shifts exist between training and testing data. To address these limitations, we propose a novel domain transfer framework that leverages latent vascular similarity across domains and iterative co-optimization of generation and segmentation networks. Specifically, we first pre-train generation networks for source and target domains. Subsequently, the pretrained source-domain conditional diffusion model performs deterministic inversion to establish intermediate latent representations of vascular images, creating domain-agnostic prototypes for target synthesis. Finally, we develop an iterative refinement strategy where segmentation network and generative model undergo mutual optimization through cyclic parameter updating. This co-evolution process enables simultaneous enhancement of cross-domain image synthesis quality and segmentation accuracy. Experiments demonstrate that our framework achieves state-of-the-art performance in cross-domain retinal vessel segmentation, particularly in challenging clinical scenarios with significant modality discrepancies.

[86] ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction

Yanzhe Liang,Ruijie Zhu,Hanzhi Chang,Zhuoyuan Li,Jiahao Lu,Tianzhu Zhang

Main category: cs.CV

TL;DR: ReFlow提出了一种无需外部光流引导的单目动态场景重建框架,通过自校正光流匹配机制实现静态与动态成分的解耦建模和鲁棒4D重建。

Details Motivation: 现有单目动态场景重建方法常因动态区域初始化不完整而导致重建与运动估计不稳定,依赖预计算光流等外部运动引导会引入额外复杂性和误差传播。 Method: ReFlow包含三个核心模块:完整规范空间构建模块(增强静态与动态区域初始化)、基于分离的动态场景建模模块(解耦静态与动态成分以实现针对性运动监督)、以及自校正光流匹配机制(含全光流匹配与相机光流匹配,分别对齐3D场景流与2D观测、保障静态物体多视角一致性)。 Result: 在多种场景下实验表明,ReFlow在重建质量与鲁棒性上优于现有方法,建立了单目4D重建的新自校正范式。 Conclusion: ReFlow通过端到端学习与自校正机制,摆脱对外部运动先验的依赖,显著提升了单目动态场景重建的准确性与稳定性。 Abstract: We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation. To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision. The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction. Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.

[87] VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Jiahao Meng,Tan Yue,Qi Xu,Haochen Wang,Zhongwei Ren,Weisong Liu,Yuhao Wang,Renrui Zhang,Yunhai Tong,Haodong Duan

Main category: cs.CV

TL;DR: 本文提出VideoZeroBench,一个用于长视频问答的分层基准测试,强调对时空证据的严格验证;实验表明当前多模态大模型在真实时空定位能力上严重不足。

Details Motivation: 现有视频多模态大语言模型评测存在两大缺陷:分数虚高掩盖细粒度视觉理解缺陷,且未验证模型是否真正定位到支持答案的精确时空证据。 Method: 构建包含500个手动标注问题、对应时间区间和空间边界框证据的VideoZeroBench基准;设计五级评估协议,逐级增强对回答生成、时间定位和空间定位的要求。 Result: Gemini-3-Pro在标准端到端问答(Level-3)下正确率低于17%;在最严苛的Level-5(需同时答对且精确定位时空证据)下,所有模型准确率均低于1%,多数模型零正确。 Conclusion: 当前模型在表面答案正确性与真实证据驱动推理之间存在巨大鸿沟,时空联合定位能力仍是长视频问答的核心瓶颈。 Abstract: Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.

[88] Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning

Longfei Huang,Yang Yang

Main category: cs.CV

TL;DR: 本文提出了一种梯度对齐交替学习(GAAL)范式,通过交替单模态学习与共享分类器、结合基于不确定性的跨模态梯度手术,缓解多模态(表格-图像)融合中的梯度冲突问题,提升融合性能。

Details Motivation: 现有表格-图像多模态融合方法受模态间梯度冲突限制,误导单模态学习器的优化。 Method: 提出GAAL范式:1)交替进行单模态学习与共享分类器训练以解耦多模态梯度;2)设计基于不确定性的跨模态梯度手术,选择性对齐跨模态梯度以优化共享参数。 Result: 在多个常用数据集上,GAAL显著优于各类SOTA表格-图像融合及测试时表格缺失基线方法。 Conclusion: GAAL能有效提供单模态辅助,提升整体融合性能,为多模态梯度协同优化提供了新思路。 Abstract: Multimodal tabular-image fusion is an emerging task that has received increasing attention in various domains. However, existing methods may be hindered by gradient conflicts between modalities, misleading the optimization of the unimodal learner. In this paper, we propose a novel Gradient-Aligned Alternating Learning (GAAL) paradigm to address this issue by aligning modality gradients. Specifically, GAAL adopts an alternating unimodal learning and shared classifier to decouple the multimodal gradient and facilitate interaction. Furthermore, we design uncertainty-based cross-modal gradient surgery to selectively align cross-modal gradients, thereby steering the shared parameters to benefit all modalities. As a result, GAAL can provide effective unimodal assistance and help boost the overall fusion performance. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SoTA) tabular-image fusion baselines and test-time tabular missing baselines. The source code is available at https://github.com/njustkmg/ICME26-GAAL.

[89] Satellite-Free Training for Drone-View Geo-Localization

Tao Liu,Yingzhi Zhang,Kan Ren,Xiaoqi Zhao

Main category: cs.CV

TL;DR: 本文提出了一种无需卫星图像训练的无人机视角地理定位(DVGL)框架(SFT),通过多视角无人机图像重建3D场景、生成伪正射影像并进行特征聚合,实现跨视角检索,显著提升了在无卫星数据条件下的定位性能。

Details Motivation: 现有DVGL方法依赖卫星图像进行训练(配对监督或无监督对齐),限制了其在卫星数据不可用或受限场景下的实际部署;而真实应用中常需处理多视角无人机序列而非单张倾斜图像。 Method: 提出卫星免训练(SFT)框架:1)利用3D高斯溅射从多视角无人机图像重建稠密3D场景;2)基于PCA引导的正交投影将几何体渲染为伪正射影像(无需相机参数);3)轻量几何引导修复提升纹理完整性;4)提取DINOv3 patch特征,仅用无人机数据学习Fisher向量聚合模型,并复用于测试时编码卫星图像。 Result: 在University-1652和SUES-200数据集上,SFT大幅超越现有卫星免训练基线,并显著缩小与使用卫星图像训练方法之间的性能差距。 Conclusion: SFT框架证明了仅用无人机图像即可构建跨视角兼容表征,为GPS拒止环境下实用化DVGL提供了新范式。 Abstract: Drone-view geo-localization (DVGL) aims to determine the location of drones in GPS-denied environments by retrieving the corresponding geotagged satellite tile from a reference gallery given UAV observations of a location. In many existing formulations, these observations are represented by a single oblique UAV image. In contrast, our satellite-free setting is designed for multi-view UAV sequences, which are used to construct a geometry-normalized UAV-side location representation before cross-view retrieval. Existing approaches rely on satellite imagery during training, either through paired supervision or unsupervised alignment, which limits practical deployment when satellite data are unavailable or restricted. In this paper, we propose a satellite-free training (SFT) framework that converts drone imagery into cross-view compatible representations through three main stages: drone-side 3D scene reconstruction, geometry-based pseudo-orthophoto generation, and satellite-free feature aggregation for retrieval. Specifically, we first reconstruct dense 3D scenes from multi-view drone images using 3D Gaussian splatting and project the reconstructed geometry into pseudo-orthophotos via PCA-guided orthographic projection. This rendering stage operates directly on reconstructed scene geometry without requiring camera parameters at rendering time. Next, we refine these orthophotos with lightweight geometry-guided inpainting to obtain texture-complete drone-side views. Finally, we extract DINOv3 patch features from the generated orthophotos, learn a Fisher vector aggregation model solely from drone data, and reuse it at test time to encode satellite tiles for cross-view retrieval. Experimental results on University-1652 and SUES-200 show that our SFT framework substantially outperforms satellite-free generalization baselines and narrows the gap to methods trained with satellite imagery.

[90] SHOE: Semantic HOI Open-Vocabulary Evaluation Metric

Maja Noack,Qinqian Lei,Taipeng Tian,Bihan Dong,Robby T. Tan,Yixin Chen,John Young,Saijun Zhang,Bo Wang

Main category: cs.CV

TL;DR: 本文提出SHOE,一种结合语义相似性的开放词汇HOI检测评估框架,通过LLM计算动词和物体组件的语义相似性,提升评估与人类判断的一致性。

Details Motivation: 现有HOI评估指标(如mAP)仅依赖精确字符串匹配,无法衡量语义相近但用词不同的预测(如“lean on couch” vs. “sit on couch”),难以适用于开放词汇场景。 Method: SHOE将HOI预测分解为动词和物体两部分,利用多个大语言模型(LLMs)的平均嵌入计算其语义相似性,并融合为整体相似度得分,支持对标准数据集(如HICO-DET)的灵活评估。 Result: SHOE在与人类评分的一致性上达85.73%,显著优于现有基于LLM或嵌入的基线方法;且能统一评估判别式与生成式HOI模型。 Conclusion: SHOE提供了一种语义驱动、更符合人类理解的HOI评估范式,推动开放词汇HOI检测向真实场景泛化,并将开源该评估工具以促进后续研究。 Abstract: Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., "lean on couch" vs. "sit on couch"), limiting their applicability for evaluating open-vocabulary predictions that go beyond any predefined set of HOI labels. We introduce SHOE (Semantic HOI Open-Vocabulary Evaluation), a new evaluation framework that incorporates semantic similarity between predicted and ground-truth HOI labels. SHOE decomposes each HOI prediction into its verb and object components, estimates their semantic similarity using the average of multiple large language models (LLMs), and combines them into a similarity score to evaluate alignment beyond exact string match. This enables a flexible and scalable evaluation of both existing HOI detection methods and open-ended generative models using standard benchmarks such as HICO-DET. Experimental results show that SHOE scores align more closely with human judgments than existing metrics, including LLM-based and embedding-based baselines, achieving an agreement of 85.73% with the average human ratings. Our work underscores the need for semantically grounded HOI evaluation that better mirrors human understanding of interactions. We will release our evaluation metric to the public to facilitate future research.

[91] Mitigating the ID-OOD Tradeoff in Open-Set Test-Time Adaptation

Wenjie Zhao,Jia Li,Xin Dong,Yapeng Tian,Yu Xiang,Yunhui Guo

Main category: cs.CV

TL;DR: 本文提出ROSETTA方法,通过引入角度损失和特征范数损失来解决开放集测试时适应(OSTTA)中熵最小化与最大化之间的固有冲突,从而在保持ID样本分类性能的同时提升OOD样本检测能力。

Details Motivation: 在开放集测试时适应(OSTTA)场景中,模型需同时处理分布偏移的ID样本(csID)和OOD样本(csOOD),而传统基于熵的策略存在内在冲突,导致csID分类与csOOD检测难以兼顾。 Method: 提出ROSETTA方法:引入角度损失调节特征范数大小,并设计特征范数损失抑制csOOD样本的logits输出,从而协同优化ID分类与OOD检测。 Result: 在CIFAR-10-C、CIFAR-100-C、Tiny-ImageNet-C和ImageNet-C上实现强OOD检测与高ID分类精度;在Cityscapes语义分割和HAC开放集TTA数据集上也验证了有效性。 Conclusion: ROSETTA有效缓解了熵目标间的冲突,是一种鲁棒的开放集测试时适应方法,在多种任务和数据集上展现出优越的泛化性与实用性。 Abstract: Open-set test-time adaptation (OSTTA) addresses the challenge of adapting models to new environments where out-of-distribution (OOD) samples coexist with in-distribution (ID) samples affected by distribution shifts. In such settings, covariate shift-for example, changes in weather conditions such as snow-can alter ID samples, reducing model reliability. Consequently, models must not only correctly classify covariate-shifted ID (csID) samples but also effectively reject covariate-shifted OOD (csOOD) samples. Entropy minimization is a common strategy in test-time adaptation to maintain ID performance under distribution shifts, while entropy maximization is widely applied to enhance OOD detection. Several studies have sought to combine these objectives to tackle the challenges of OSTTA. However, the intrinsic conflict between entropy minimization and maximization inevitably leads to a trade-off between csID classification and csOOD detection. In this paper, we first analyze the limitations of entropy maximization in OSTTA and then introduce an angular loss to regulate feature norm magnitudes, along with a feature-norm loss to suppress csOOD logits, thereby improving OOD detection. These objectives form ROSETTA, a $\underline{r}$obust $\underline{o}$pen-$\underline{se}$t $\underline{t}$est-$\underline{t}$ime $\underline{a}$daptation. Our method achieves strong OOD detection while maintaining high ID classification performance on CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C and ImageNet-C. Furthermore, experiments on the Cityscapes validate the method's effectiveness in real-world semantic segmentation, and results on the HAC dataset demonstrate its applicability across different open-set TTA setups.

[92] Riemannian and Symplectic Geometry for Hierarchical Text-Driven Place Recognition

Tianyi Shang,Zhenyu Li

Main category: cs.CV

TL;DR: 本文提出SympLoc框架,通过粗到细的多级对齐(实例级、关系级、全局级)解决文本到点云定位中的信息丢失问题,在KITTI360Pose数据集上Top-1 recall@10m提升19%。

Details Motivation: 现有方法依赖全局描述符进行相似性检索,导致严重信息损失且无法捕捉判别性场景结构。 Method: 提出SympLoc框架:粗阶段包含三层对齐——1)实例级对齐:在双曲空间中使用黎曼自注意力建立点云物体与文本提示的直接对应;2)关系级对齐:用信息-辛关系编码器(ISRE)建模物体间空间关系,结合Fisher-Rao度量和哈密顿动力学实现不确定性感知的几何一致传播;3)全局级对齐:通过谱流形变换(SMT)提取图谱结构不变量生成判别性全局描述符;随后进行细粒度定位。 Result: 在KITTI360Pose数据集上,Top-1 recall@10m指标相比当前最优方法提升19%。 Conclusion: SympLoc通过多级语义对齐显著提升了文本到点云定位的精度与鲁棒性,为自然语言驱动的空间理解提供了新范式。 Abstract: Text-to-point-cloud localization enables robots to understand spatial positions through natural language descriptions, which is crucial for human-robot collaboration in applications such as autonomous driving and last-mile delivery. However, existing methods employ pooled global descriptors for similarity retrieval, which suffer from severe information loss and fail to capture discriminative scene structures. To address these issues, we propose SympLoc, a novel coarse-to-fine localization framework with multi-level alignment in the coarse stage. Different from previous methods that rely solely on global descriptors, our coarse stage consists of three complementary alignment levels: 1) Instance-level alignment establishes direct correspondence between individual object instances in point clouds and textual hints through Riemannian self-attention in hyperbolic space; 2) Relation-level alignment explicitly models pairwise spatial relationships between objects using the Information-Symplectic Relation Encoder (ISRE), which reformulates relation features through Fisher-Rao metric and Hamiltonian dynamics for uncertainty-aware geometrically consistent propagation; 3) Global-level alignment synthesizes discriminative global descriptors via the Spectral Manifold Transform (SMT) that extracts structural invariants through graph spectral analysis. This hierarchical alignment strategy progressively captures fine-grained to coarse-grained scene semantics, enabling robust cross-modal retrieval. Extensive experiments on the KITTI360Pose dataset demonstrate that SympLoc achieves a 19% improvement in Top-1 recall@10m compared to existing state-of-the-art approaches.

[93] Towards Minimal Focal Stack in Shape from Focus

Khurram Ashfaq,Muhammad Tariq Mahmood

Main category: cs.CV

TL;DR: 本文提出了一种基于物理的双图焦点堆栈增强方法,结合全焦图像(AiF)和差分能量图(EOD),并设计了一个多尺度ConvGRU迭代优化网络,使Shape from Focus(SFF)方法仅用两张图像即可实现高精度深度估计。

Details Motivation: 现有Shape from Focus(SFF)方法依赖大量密集采样的焦点图像(focal stack),实用性受限,亟需减少输入图像数量同时保持精度。 Method: 提出物理启发的焦点堆栈增强:由两张输入图像生成全焦图像(AiF)及对应的能量差分图(EOD);构建深度焦点体,并通过多尺度卷积门控循环单元(ConvGRU)迭代优化深度图。 Result: 在合成与真实数据集上验证,所提增强策略显著提升多种SFF模型性能,仅用两张图像即达到与使用大堆栈相当的精度,维持了SOTA水平。 Conclusion: 该工作有效缓解了SFF对大规模焦点堆栈的依赖,为轻量、高效、实用的单/双图深度重建提供了新思路和可行框架。 Abstract: Shape from Focus (SFF) is a depth reconstruction technique that estimates scene structure from focus variations observed across a focal stack, that is, a sequence of images captured at different focus settings. A key limitation of SFF methods is their reliance on densely sampled, large focal stacks, which limits their practical applicability. In this study, we propose a focal stack augmentation that enables SFF methods to estimate depth using a reduced stack of just two images, without sacrificing precision. We introduce a simple yet effective physics-based focal stack augmentation that enriches the stack with two auxiliary cues: an all-in-focus (AiF) image estimated from two input images, and Energy-of-Difference (EOD) maps, computed as the energy of differences between the AiF and input images. Furthermore, we propose a deep network that computes a deep focus volume from the augmented focal stacks and iteratively refines depth using convolutional Gated Recurrent Units (ConvGRUs) at multiple scales. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed augmentation benefits existing state-of-the-art SFF models, enabling them to achieve comparable accuracy. The results also show that our approach maintains state-of-the-art performance with a minimal stack size.

[94] F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling

Morui Zhu,Mohammad Dehghani Tezerjani,Mátyás Szántó,Márton Vaitkus,Song Fu,Qing Yang

Main category: cs.CV

TL;DR: F3DGS是一种面向去中心化多智能体3D重建的联邦式3D高斯泼溅框架,通过共享几何骨架初始化与可见性感知的属性聚合,在不传输原始数据的前提下实现高质量分布式重建。

Details Motivation: 现有3D高斯泼溅(3DGS)方法依赖集中式数据访问,难以适用于分布式机器人场景,而直接迁移至多智能体系统会带来通信开销和几何不一致问题。 Method: F3DGS首先对各客户端本地融合的LiDAR点云进行配准,构建共享几何骨架以初始化全局3DGS模型;在联邦优化中固定高斯位置以保持几何对齐,仅由各客户端更新外观相关属性(协方差、不透明度、球谐系数),服务器采用可见性加权聚合策略融合更新。 Result: 在自建多序列室内外LiDAR-RGB-IMU同步数据集上验证,F3DGS重建质量媲美集中式训练,同时支持跨智能体分布式优化。 Conclusion: F3DGS有效解决了多智能体3D重建中的通信效率、几何一致性与部分可观测性三大挑战,为隐私敏感与资源受限的分布式系统提供了实用可行的联邦3D重建方案。 Abstract: We present F3DGS, a federated 3D Gaussian Splatting framework for decentralized multi-agent 3D reconstruction. Existing 3DGS pipelines assume centralized access to all observations, which limits their applicability in distributed robotic settings where agents operate independently, and centralized data aggregation may be restricted. Directly extending centralized training to multi-agent systems introduces communication overhead and geometric inconsistency. F3DGS first constructs a shared geometric scaffold by registering locally merged LiDAR point clouds from multiple clients to initialize a global 3DGS model. During federated optimization, Gaussian positions are fixed to preserve geometric alignment, while each client updates only appearance-related attributes, including covariance, opacity, and spherical harmonic coefficients. The server aggregates these updates using visibility-aware aggregation, weighting each client's contribution by how frequently it observed each Gaussian, resolving the partial-observability challenge inherent to multi-agent exploration. To evaluate decentralized reconstruction, we collect a multi-sequence indoor dataset with synchronized LiDAR, RGB, and IMU measurements. Experiments show that F3DGS achieves reconstruction quality comparable to centralized training while enabling distributed optimization across agents. The dataset, development kit, and source code will be publicly released.

[95] NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy

Kyeonghun Kim,Hyeonseok Jung,Youngung Han,Hyunsu Go,Eunseob Choi,Seongbin Park,Junsu Lim,Jiwon Yang,Sumin Lee,Insung Hwang,Ken Ying-Kai Liao,Nam-Joon Kim

Main category: cs.CV

TL;DR: 本文提出NEMESIS,一种面向3D CT影像的内存高效自监督学习框架,通过局部超块(128×128×128)处理、噪声增强重建、双维度掩码的解剖感知Transformer模块(MATB)及跨尺度聚合的NEMESIS Tokens(NT),显著提升小样本与无监督表征能力,在BTCV基准上超越现有方法并大幅降低计算开销。

Details Motivation: 3D CT影像标注成本高,需自监督学习;但全体积Transformer内存开销大,且常规掩码策略难以建模CT数据的各向异性空间结构。 Method: 提出NEMESIS:基于局部128×128×128超块的掩码自编码器框架,包含(i)噪声增强重建作为代理任务,(ii)Masked Anatomical Transformer Blocks(MATB)实现平面与轴向并行双掩码,(iii)NEMESIS Tokens(NT)支持跨尺度上下文聚合。 Result: 在BTCV多器官分类任务中,冻结主干+线性分类器达mean AUROC 0.9633,优于SuPreM(0.9493)和VoCo(0.9387);仅用10%标签时AUROC仍达0.9075;单次前向计算量降至31.0 GFLOPs(对比全体积基线985.8 GFLOPs)。 Conclusion: NEMESIS在保持解剖细节的同时显著提升内存效率与标签效率,为3D医学影像自监督学习提供了可扩展、鲁棒的新范式。 Abstract: Volumetric CT imaging is essential for clinical diagnosis, yet annotating 3D volumes is expensive and time-consuming, motivating self-supervised learning (SSL) from unlabeled data. However, applying SSL to 3D CT remains challenging due to the high memory cost of full-volume transformers and the anisotropic spatial structure of CT data, which is not well captured by conventional masking strategies. We propose NEMESIS, a masked autoencoder (MAE) framework that operates on local 128x128x128 superpatches, enabling memory-efficient training while preserving anatomical detail. NEMESIS introduces three key components: (i) noise-enhanced reconstruction as a pretext task, (ii) Masked Anatomical Transformer Blocks (MATB) that perform dual-masking through parallel plane-wise and axis-wise token removal, and (iii) NEMESIS Tokens (NT) for cross-scale context aggregation. On the BTCV multi-organ classification benchmark, NEMESIS with a frozen backbone and a linear classifier achieves a mean AUROC of 0.9633, surpassing fully fine-tuned SuPreM (0.9493) and VoCo (0.9387). Under a low-label regime with only 10% of available annotations, it retains an AUROC of 0.9075, demonstrating strong label efficiency. Furthermore, the superpatch-based design reduces computational cost to 31.0 GFLOPs per forward pass, compared to 985.8 GFLOPs for the full-volume baseline, providing a scalable and robust foundation for 3D medical imaging.

[96] Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Jiawei Chen,Simin Huang,Jiawei Du,Shuaihang Chen,Yu Tian,Mingjie Wei,Chao Yu,Zhaoxia Yin

Main category: cs.CV

TL;DR: 本文提出Tex3D框架,首次实现面向视觉-语言-动作(VLA)模型的端到端3D对抗纹理优化,通过Foreground-Background Decoupling(FBD)和Trajectory-Aware Adversarial Optimization(TAAO)技术,在仿真与真实机器人环境中显著降低VLA性能(任务失败率高达96.7%),揭示其在物理世界中的关键脆弱性。

Details Motivation: 现有VLA模型对物理可实现的对抗攻击鲁棒性研究不足;2D视觉或语言扰动攻击缺乏物理真实性,而更具现实威胁的3D对抗纹理因仿真器不可微而难以优化。 Method: 提出Foreground-Background Decoupling(FBD)实现可微纹理优化,结合Trajectory-Aware Adversarial Optimization(TAAO)聚焦关键行为帧并采用顶点参数化稳定优化;构建Tex3D框架,在VLA仿真环境中端到端优化3D对抗纹理。 Result: Tex3D在仿真与真实机器人实验中大幅降低多种操作任务成功率,最高任务失败率达96.7%;验证了3D对抗纹理对VLA系统的强有效性与物理可行性。 Conclusion: VLA系统在物理世界中易受3D对抗纹理攻击,暴露严重鲁棒性缺陷;需在训练阶段引入鲁棒性意识以提升实际部署安全性。 Abstract: Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.

[97] Automatic Image-Level Morphological Trait Annotation for Organismal Images

Vardaan Pahuja,Samuel Stevens,Alyson East,Sydne Record,Yu Su

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏自编码器和基础模型特征的自动化形态性状标注方法,构建了包含80K条性状标注的Bioscan-Traits数据集,提升了大规模生态研究中形态性状分析的可扩展性与生物学合理性。

Details Motivation: 形态性状提取目前依赖专家、耗时长,缺乏高质量图像-性状标注配对数据集,制约了大规模生态研究。 Method: 利用稀疏自编码器处理基础模型特征,获得语义单一、空间定位准确的神经元;结合显著区域定位与视觉-语言提示生成可解释的性状描述。 Result: 构建了Bioscan-Traits数据集(19K昆虫图像,80K性状标注);人工评估证实生成描述具有生物学合理性;消融实验验证了各设计选择的影响。 Conclusion: 该模块化标注流程避免了昂贵的手动标注,为大模型注入生物学意义监督,推动生态相关性与机器学习实用性的统一。 Abstract: Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.

[98] LivingWorld: Interactive 4D World Generation with Environmental Dynamics

Hyeongju Mun,In-Hwan Jin,Sohyeong Kim,Kyeongbo Kong

Main category: cs.CV

TL;DR: LivingWorld是一个从单张图像生成具有环境动态(如云、水、烟)的4D交互式世界框架,通过几何感知对齐模块和哈希运动场实现全局一致、低延迟的动态建模。

Details Motivation: 现有3D场景生成方法主要关注静态几何重建,缺乏对场景尺度环境动态(如云、水、烟)的建模,且难以在场景扩展时保持运动一致性与实时交互性。 Method: 提出渐进式构建全局一致运动场的方法;引入几何感知对齐模块解决多视角方向与尺度歧义;采用紧凑哈希运动场表示,支持高效查询与稳定动态传播,并实现双向运动传播以生成长时序连贯的4D序列。 Result: 在单块RTX 5090 GPU上,每次场景扩展耗时9秒,运动对齐与更新耗时3秒,支持交互式4D世界生成;生成的动态效果具备全局一致性与时间连贯性。 Conclusion: LivingWorld首次实现了从单图出发、支持交互式扩展并保持全局运动一致性的4D环境动态生成,为沉浸式虚拟世界构建提供了新范式。 Abstract: We introduce LivingWorld, an interactive framework for generating 4D worlds with environmental dynamics from a single image. While recent advances in 3D scene generation enable large-scale environment creation, most approaches focus primarily on reconstructing static geometry, leaving scene-scale environmental dynamics such as clouds, water, or smoke largely unexplored. Modeling such dynamics is challenging because motion must remain coherent across an expanding scene while supporting low-latency user feedback. LivingWorld addresses this challenge by progressively constructing a globally coherent motion field as the scene expands. To maintain global consistency during expansion, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities across views. We further represent motion using a compact hash-based motion field, enabling efficient querying and stable propagation of dynamics throughout the scene. This representation also supports bidirectional motion propagation during rendering, producing long and temporally coherent 4D sequences without relying on expensive video-based refinement. On a single RTX 5090 GPU, generating each new scene expansion step requires 9 seconds, followed by 3 seconds for motion alignment and motion field updates, enabling interactive 4D world generation with globally coherent environmental dynamics. Video demonstrations are available at cvsp-lab.github.io/LivingWorld.

[99] TOL: Textual Localization with OpenStreetMap

Youqi Liao,Shuhao Kang,Jingyu Xu,Olaf Wysocki,Yan Xia,Jianping Li,Zhen Dong,Bisheng Yang,Xieyuanli Chen

Main category: cs.CV

TL;DR: 本文提出文本到OpenStreetMap(T2O)全局定位任务,构建了大规模多城市基准TOL,并设计了粗到精的TOLoc框架,在无几何观测和GNSS初值条件下实现高精度2自由度城市定位。

Details Motivation: 现有定位方法依赖稠密点云或高分辨率影像,而OSM具有紧凑、免费、语义丰富等优势,但文本到OSM的定位尚未被探索。 Method: 提出TOLoc粗到精框架:粗阶段提取方向感知特征进行全局检索;细阶段通过文本与地图特征对齐模块回归2-DoF位姿。同时构建跨洲多城基准TOL(121K文本-地图对,覆盖316km道路轨迹)。 Result: TOLoc在5m/10m/25m阈值下分别超越最优基线6.53%、9.93%、8.31%,且具备强跨环境泛化能力。 Conclusion: 文本驱动的OSM定位是可行且高效的,TOLoc为语义地图上的自然语言定位提供了新范式,并推动了开放地理数据与语言模型的融合应用。 Abstract: Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.

[100] MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label

Junyoung Jung,Seokwon Kim,Jun Uk Kim

Main category: cs.CV

TL;DR: 本文提出了一种面向稀疏标注单目3D目标检测的新框架,包含道路感知补丁增强(RAPA)和基于原型的过滤(PBF)两个核心模块,以提升在标注稀缺场景下的检测性能。

Details Motivation: 单目3D目标检测在密集标注数据集上表现优异,但在实际中因3D标注成本高,往往只能获得稀疏标注,导致模型性能下降。 Method: 提出Road-Aware Patch Augmentation(RAPA)在道路区域上几何一致地增广物体补丁;提出Prototype-Based Filtering(PBF),结合原型相似性和深度不确定性生成高质量伪标签。 Result: 实验表明该方法在稀疏标注设定下显著提升检测性能,具备鲁棒性。 Conclusion: 所提框架通过几何保持的数据增强与原型引导的伪标签策略,有效缓解了稀疏标注带来的监督不足问题,为实际应用提供了可行方案。 Abstract: Monocular 3D object detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely annotated setting is common in real-world scenarios where annotating every object is impractical. To address this, we propose a novel framework for sparsely annotated monocular 3D object detection with two key modules. First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. It maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates. Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision. Extensive experiments demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/VisualAIKHU/MonoSAOD .

[101] Moiré Video Authentication: A Physical Signature Against AI Video Generation

Yuan Qing,Kunyu Zheng,Lingxiao Li,Boqing Gong,Chang Xiao

Main category: cs.CV

TL;DR: 本文提出了一种基于莫尔效应的物理认证签名,利用真实相机拍摄时自然产生的光学现象(莫尔条纹相位与光栅图像位移的线性耦合关系),来区分真实视频与AI生成视频。该签名在光学几何上具有不变性,且当前生成模型无法准确复现,实验验证其在多种SOTA视频生成模型上均有效。

Details Motivation: 随着视频生成技术进步,AI合成内容越来越难与真实视频区分,亟需一种物理上可验证、难以伪造的鉴别方法。 Method: 提出并推导‘莫尔运动不变量’:利用相机拍摄双层光栅结构时产生的莫尔干涉条纹,证明其相位变化与光栅图像位移之间存在由光学几何决定的线性关系;设计验证器从视频中提取这两路信号并检验其相关性。 Result: 在真实拍摄与多个SOTA AI视频生成模型(如Sora、Pika等)生成的视频上验证,真实视频表现出强线性相关,而AI生成视频相关性显著弱或无规律,区分效果鲁棒。 Conclusion: 确定性的光学现象(如莫尔效应)可作为物理根基扎实、可验证的防伪签名,为AI生成视频鉴伪提供了新范式。 Abstract: Recent advances in video generation have made AI-synthesized content increasingly difficult to distinguish from real footage. We propose a physics-based authentication signature that real cameras produce naturally, but that generative models cannot faithfully reproduce. Our approach exploits the Moiré effect: the interference fringes formed when a camera views a compact two-layer grating structure. We derive the Moiré motion invariant, showing that fringe phase and grating image displacement are linearly coupled by optical geometry, independent of viewing distance and grating structure. A verifier extracts both signals from video and tests their correlation. We validate the invariant on both real-captured and AI-generated videos from multiple state-of-the-art generators, and find that real and AI-generated videos produce significantly different correlation signatures, suggesting a robust means of differentiating them. Our work demonstrates that deterministic optical phenomena can serve as physically grounded, verifiable signatures against AI-generated video.

[102] DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data

Wonjoon Jin,Jiyun Won,Janghyeok Han,Qi Dai,Chong Luo,Seung-Hwan Baek,Sunghyun Cho

Main category: cs.CV

TL;DR: DynaVid提出了一种两阶段视频合成框架,利用合成的光流数据(而非完整视频)训练扩散模型,以提升动态运动建模能力与运动可控性,同时避免合成外观失真。

Details Motivation: 现有视频扩散模型在高度动态运动或精细运动控制任务上表现不佳,主因是真实训练数据中此类样本稀缺。 Method: 提出DynaVid框架:1)使用计算机图形管线渲染合成光学流(非完整视频)作为运动信号;2)采用两阶段生成:先由运动生成器合成光流,再由运动引导的视频生成器据此生成真实感视频帧。 Result: 在剧烈人体运动生成与极端相机运动控制两个挑战性任务上,DynaVid显著提升了生成视频的动态真实性与运动可控性。 Conclusion: 仅用合成光流而非合成视频进行训练,可有效解耦运动与外观建模,在不损害视觉真实性的前提下增强动态运动建模能力。 Abstract: Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.

[103] Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion

Juncen Guo,Xiaoguang Zhu,Jingyi Wu,Jingyu Zhang,Jingnan Cai,Zhenghao Niu,Liang Song

Main category: cs.CV

TL;DR: 本文提出了一种无需领域标识和历史样本的增量学习框架,通过解耦表征和权重融合策略,提升具身多媒体系统在动态环境中的鲁棒连续适应能力。

Details Motivation: 现有领域增量感知方法依赖测试阶段预知的领域ID,且易过拟合场景特异性噪声,导致泛化差与灾难性遗忘。 Method: 设计解耦表征机制以去除环境风格干扰、聚焦跨场景语义特征;采用权重融合策略动态整合新旧环境知识,无需存储历史数据。 Result: 在多个标准基准数据集上显著降低灾难性遗忘,在完全无领域ID和无样例设定下精度超越现有SOTA方法。 Conclusion: 所提框架有效提升了具身感知系统在开放物理空间中持续交互时的环境自适应性、泛化性与稳定性。 Abstract: Embodied perception systems face severe challenges of dynamic environment distribution drift when they continuously interact in open physical spaces. However, the existing domain incremental awareness methods often rely on the domain id obtained in advance during the testing phase, which limits their practicability in unknown interaction scenarios. At the same time, the model often overfits to the context-specific perceptual noise, which leads to insufficient generalization ability and catastrophic forgetting. To address these limitations, we propose a domain-id and exemplar-free incremental learning framework for embodied multimedia systems, which aims to achieve robust continuous environment adaptation. This method designs a disentangled representation mechanism to remove non-essential environmental style interference, and guide the model to focus on extracting semantic intrinsic features shared across scenes, thereby eliminating perceptual uncertainty and improving generalization. We further use the weight fusion strategy to dynamically integrate the old and new environment knowledge in the parameter space, so as to ensure that the model adapts to the new distribution without storing historical data and maximally retains the discrimination ability of the old environment. Extensive experiments on multiple standard benchmark datasets show that the proposed method significantly reduces catastrophic forgetting in a completely exemplar-free and domain-id free setting, and its accuracy is better than the existing state-of-the-art methods.

[104] HOT: Harmonic-Constrained Optimal Transport for Remote Photoplethysmography Domain Adaptation

Ba-Thinh Nguyen,Thi-Duyen Ngo,Thanh-Trung Huynh,Thanh-Ha Le,Huy-Hieu Pham

Main category: cs.CV

TL;DR: 本文提出频率域自适应(FDA)和调和约束最优传输(HOT)方法,以提升远程光电容积描记法(rPPG)模型在跨域场景下的鲁棒性与泛化能力。

Details Motivation: 现有深度学习rPPG方法易过拟合于光照、相机特性等外观相关因素,导致跨域性能显著下降。 Method: 提出频率域自适应(FDA)建模外观变化,并设计调和约束最优传输(HOT)利用心率信号的谐波特性对齐原始与FDA转换后的表征。 Result: 在多个数据集上的跨域实验表明,FDA+HOT框架显著提升了rPPG模型的鲁棒性和泛化能力。 Conclusion: FDA与HOT联合策略能有效解耦外观变化与生理信号,增强rPPG模型在真实多变环境中的实用性。 Abstract: Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos; however, its practical deployment is often hindered by substantial performance degradation under domain shift. While recent deep learning-based rPPG methods have achieved strong performance on individual datasets, they frequently overfit to appearance-related factors, such as illumination, camera characteristics, and color response, that vary significantly across domains. To address this limitation, we introduce frequency domain adaptation (FDA) as a principled strategy for modeling appearance variation in rPPG. By transferring low-frequency spectral components that encode domain-dependent appearance characteristics, FDA encourages rPPG models to learn invariance to appearance variations while retaining cardiac-induced signals. To further support physiologically consistent alignment under such appearance variation, we propose Harmonic-Constrained Optimal Transport (HOT), which leverages the harmonic property of cardiac signals to guide alignment between original and FDA-transferred representations. Extensive cross-dataset experiments demonstrate that the proposed FDA and HOT framework effectively enhances the robustness and generalization of rPPG models across diverse datasets.

[105] GPA: Learning GUI Process Automation from Demonstrations

Zirui Zhao,Jun Hao Liew,Yan Yang,Wenzhuo Yang,Ziyang Luo,Doyen Sahoo,Silvio Savarese,Junnan Li

Main category: cs.CV

TL;DR: 本文提出GUI Process Automation (GPA),一种轻量级、基于视觉的机器人流程自动化方法,通过顺序蒙特卡洛定位、就绪校准和本地化执行,实现鲁棒、确定性高且隐私安全的GUI任务自动化,并在实验中显著优于Gemini 3 Pro。

Details Motivation: 解决传统RPA的脆弱性和当前视觉语言模型GUI代理的不确定性风险,满足企业工作流对适应性、鲁棒性和安全性的需求。 Method: 引入基于顺序蒙特卡洛的定位以增强鲁棒性,采用就绪校准保障确定性与可靠性,并通过快速全本地执行确保隐私;同时支持作为MCP/CLI工具供其他具备编码能力的智能体调用。 Result: 在试点实验中,GPA在长周期GUI任务上的成功率高于Gemini 3 Pro(配备CUA工具),执行速度快10倍。 Conclusion: GPA是一种高效、稳定且安全的GUI自动化方案,兼具企业级实用性与多智能体协同潜力。 Abstract: GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.

[106] Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding

Yuheng Jiang,Yiwen Cai,Zihao Wang,Yize Wu,Sicheng Li,Zhuo Su,Shaohui Jiao,Lan Xu

Main category: cs.CV

TL;DR: 本文提出Director,一种统一的时空高斯表示方法,联合建模人体动作、高保真渲染和实例级语义,通过语言对齐的语义监督与光流引导的运动优化,实现动态场景下稳定、可分割、可开放词汇查询的4D重建。

Details Motivation: 现有基于高斯的体视频方法虽渲染质量高,但缺乏实例级结构建模,难以支持稳定跟踪与语义推理。 Method: 引入实例一致的语义嵌入,利用多模态大模型生成的句子嵌入和时序对齐的实例掩码监督高斯语义特征;结合2D光流优化高斯运动以提升时间稳定性;加入几何感知的SDF约束与表面连续性正则化增强动态前景的时间一致性。 Result: Director在保持高保真渲染的同时,实现了时间一致的4D重建,并支持实例分割与开放词汇语义查询。 Conclusion: 嵌入语言对齐的实例语义与光流引导的运动建模,能有效提升4D高斯表示在动态场景中的结构稳定性与语义可解释性。 Abstract: Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.

[107] BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography

Ba-Thinh Nguyen,Thi-Duyen Ngo,Thanh-Trung Huynh,Thanh-Ha Le,Huy-Hieu Pham

Main category: cs.CV

TL;DR: 本文提出BTS-rPPG框架,通过正交蝶形时序移位(BTS)和正交特征传递机制(OFT),增强远程光电容积描记法(rPPG)中长程时序建模能力,提升生理信号估计精度。

Details Motivation: 现有深度学习方法在rPPG中多依赖局部时序操作(如时序移位或卷积),导致时序感受野有限、难以建模远距离帧间关系。 Method: 提出基于正交蝶形时序移位(BTS)的时序建模框架,借鉴FFT中的蝶形通信结构,采用XOR配对策略实现结构化帧交互;并引入正交特征传递机制(OFT),在时序移位前滤除冗余特征,仅传递与目标上下文正交的成分。 Result: 在多个基准数据集上实验表明,BTS-rPPG显著提升rPPG中长程生理动态建模能力,性能持续优于现有时序建模方法。 Conclusion: BTS-rPPG通过结构化远距离帧交互与正交特征过滤,有效解决了rPPG中局部时序建模局限问题,为接触式生理感知提供了更鲁棒、更全局的时序建模范式。 Abstract: Remote photoplethysmography (rPPG) enables contactless physiological sensing from facial videos by analyzing subtle appearance variations induced by blood circulation. However, modeling the temporal dynamics of these signals remains challenging, as many deep learning methods rely on temporal shifting or convolutional operators that aggregate information primarily from neighboring frames, resulting in predominantly local temporal modeling and limited temporal receptive fields. To address this limitation, we propose BTS-rPPG, a temporal modeling framework based on Orthogonal Butterfly Temporal Shifting (BTS). Inspired by the butterfly communication pattern in the Fast Fourier Transform (FFT), BTS establishes structured frame interactions via an XOR-based butterfly pairing schedule, progressively expanding the temporal receptive field and enabling efficient propagation of information across distant frames. Furthermore, we introduce an orthogonal feature transfer mechanism (OFT) that filters the source feature with respect to the target context before temporal shifting, retaining only the orthogonal component for cross-frame transmission. This reduces redundant feature propagation and encourages complementary temporal interaction. Extensive experiments on multiple benchmark datasets demonstrate that BTS-rPPG improves long-range temporal modeling of physiological dynamics and consistently outperforms existing temporal modeling strategies for rPPG estimation.

[108] From Understanding to Erasing: Towards Complete and Stable Video Object Removal

Dingming Liu,Wenjing Wang,Chen Li,Jing Lyu

Main category: cs.CV

TL;DR: 本文提出了一种结合外部知识蒸馏与内部上下文建模的视频目标擦除方法,提升对目标物体及其物理效应(如阴影、反射)的理解,实现更一致、清晰的视频修复。

Details Motivation: 现有扩散模型在视频目标擦除中难以消除目标引发的副作用(如阴影、反射、光照变化),主因是对目标物体及其与场景物理/语义交互理解不足。 Method: 1) 外部:通过知识蒸馏将视觉基础模型中物体与副作用的关系迁移到视频扩散模型;2) 内部:设计帧级上下文交叉注意力机制,使每个去噪模块聚焦于目标区域周围未遮挡的上下文信息。 Result: 在多个指标上达到SOTA性能,并构建了首个面向真实场景的视频目标擦除基准。 Conclusion: 融合内外双重理解机制可显著提升视频目标擦除的物理合理性与时空一致性,推动该任务向更实用、鲁棒方向发展。 Abstract: Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: https://github.com/WeChatCV/UnderEraser.

[109] Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation

Lingyu Liu,Yaxiong Wang,Li Zhu,Zhedong Zheng

Main category: cs.CV

TL;DR: 本文提出了一种基于时间循环一致性的双向视频帧插值框架,通过可学习的方向性标记和课程学习策略,在不增加推理开销的前提下显著提升长序列插值的运动一致性与质量。

Details Motivation: 现有生成式视频帧插值方法多为单向,缺乏对时间一致性的自验证机制,易导致运动漂移、方向模糊和边界错位,尤其在长序列中问题突出。 Method: 提出双向循环一致框架:引入可学习的方向性token显式建模时间方向,共享骨干网络联合优化前向合成与后向重建;采用课程学习从短序列逐步过渡到长序列训练;循环约束仅用于训练,推理保持单次前向传播。 Result: 在37帧和73帧长序列任务上均达到图像质量、运动平滑性和动态控制的SOTA性能,显著优于强基线,且无额外计算开销。 Conclusion: 时间循环一致性是一种有效正则化手段,能增强生成运动路径的逻辑可逆性与稳定性,所提双向框架兼顾性能与效率,为长程视频插值提供了新范式。 Abstract: Video frame interpolation aims to synthesize realistic intermediate frames between given endpoints while adhering to specific motion semantics. While recent generative models have improved visual fidelity, they predominantly operate in a unidirectional manner, lacking mechanisms to self-verify temporal consistency. This often leads to motion drift, directional ambiguity, and boundary misalignment, especially in long-range sequences. Inspired by the principle of temporal cycle-consistency in self-supervised learning, we propose a novel bidirectional framework that enforces symmetry between forward and backward generation trajectories. Our approach introduces learnable directional tokens to explicitly condition a shared backbone on temporal orientation, enabling the model to jointly optimize forward synthesis and backward reconstruction within a single unified architecture. This cycle-consistent supervision acts as a powerful regularizer, ensuring that generated motion paths are logically reversible. Furthermore, we employ a curriculum learning strategy that progressively trains the model from short to long sequences, stabilizing dynamics across varying durations. Crucially, our cyclic constraints are applied only during training; inference requires a single forward pass, maintaining the high efficiency of the base model. Extensive experiments show that our method achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines while incurring no additional computational overhead.

[110] Bias mitigation in graph diffusion models

Meng Yu,Kun Zhan

Main category: cs.CV

TL;DR: 本文提出了一种综合方法,通过设计新的Langevin采样算法和分数校正机制,缓解图扩散模型中的反向起始偏差和暴露偏差,无需修改网络结构,显著提升生成质量。

Details Motivation: 现有图扩散模型存在显著的偏差问题,包括前向扩散最大扰动分布偏离标准高斯分布导致的反向起始偏差,以及扩散模型固有的暴露偏差,共同导致生成质量下降。 Method: 为缓解反向起始偏差,设计了新的Langevin采样算法以对齐前向最大扰动分布;为解决暴露偏差,引入基于新定义分数差的分数校正机制。整个方法无需修改网络结构。 Result: 该方法在多个模型、数据集和任务上验证有效,取得了当前最优性能(state-of-the-art)。 Conclusion: 所提方法能有效缓解图扩散模型中的两类关键偏差,在不增加模型复杂度的前提下显著提升生成质量。 Abstract: Most existing graph diffusion models have significant bias problems. We observe that the forward diffusion's maximum perturbation distribution in most models deviates from the standard Gaussian distribution, while reverse sampling consistently starts from a standard Gaussian distribution, which results in a reverse-starting bias. Together with the inherent exposure bias of diffusion models, this results in degraded generation quality. This paper proposes a comprehensive approach to mitigate both biases. To mitigate reverse-starting bias, we employ a newly designed Langevin sampling algorithm to align with the forward maximum perturbation distribution, establishing a new reverse-starting point. To address the exposure bias, we introduce a score correction mechanism based on a newly defined score difference. Our approach, which requires no network modifications, is validated across multiple models, datasets, and tasks, achieving state-of-the-art results.Code is at https://github.com/kunzhan/spp

[111] End-to-End Shared Attention Estimation via Group Detection with Feedback Refinement

Chihiro Nakatani,Norimichi Ukita,Jean-Marc Odobez

Main category: cs.CV

TL;DR: 本文提出了一种通过群体检测实现端到端共享注意力(SA)估计的新方法,联合优化群体检测与SA热图生成,显著提升性能。

Details Motivation: 以往方法未显式检测关注同一目标的人群,或假设图像中仅存在单一SA点,限制了实际应用与性能。 Method: 采用两步流程:(i) 基于个体注视热图和群体成员标量生成SA热图;(ii) 利用初始SA热图反向优化群体成员关系,输出最终SA热图。 Result: 在群体检测与共享注意力估计任务上均优于现有方法,并通过消融实验证明各模块有效性。 Conclusion: 所提联合建模方法有效克服了传统SA估计中忽略群体结构的缺陷,提升了实用性与精度。 Abstract: This paper proposes an end-to-end shared attention estimation method via group detection. Most previous methods estimate shared attention (SA) without detecting the actual group of people focusing on it, or assume that there is a single SA point in a given image. These issues limit the applicability of SA detection in practice and impact performance. To address them, we propose to simultaneously achieve group detection and shared attention estimation using a two step process: (i) the generation of SA heatmaps relying on individual gaze attention heatmaps and group membership scalars estimated in a group inference; (ii) a refinement of the initial group memberships allowing to account for the initial SA heatmaps, and the final prediction of the SA heatmap. Experiments demonstrate that our method outperforms other methods in group detection and shared attention estimation. Additional analyses validate the effectiveness of the proposed components. Code: https://github.com/chihina/sagd-CVPRW2026.

[112] SteerFlow: Steering Rectified Flows for Faithful Inversion-Based Image Editing

Thinh Dao,Zhen Wang,Kien T. Pham,Long Chen

Main category: cs.CV

TL;DR: 本文提出SteerFlow,一种模型无关的文本引导图像编辑框架,通过引入摊销不动点求解器、轨迹插值和自适应掩码机制,在保证源图像保真度的同时提升编辑质量与背景保持能力,并支持多轮编辑。

Details Motivation: 现有基于流的生成模型在文本引导图像编辑中难以兼顾源图像保真度与编辑灵活性:高阶求解器增加计算开销,截断反演限制可编辑性,特征注入方法缺乏架构可迁移性。 Method: 提出SteerFlow框架,包含三部分:(1)前向过程使用摊销不动点求解器,通过强制相邻时间步速度一致性隐式拉直前向轨迹;(2)反向过程采用轨迹插值,自适应融合目标编辑与源重建速度;(3)引入自适应掩码机制,结合概念引导分割与源-目标速度差进行空间约束。 Result: 在FLUX.1-dev和Stable Diffusion 3.5 Medium上实验表明,SteerFlow在编辑质量、源保真度和背景保持方面均优于现有方法,并支持无漂移的多轮编辑。 Conclusion: SteerFlow是一种具备强理论保真保障、模型无关且可扩展的文本引导图像编辑框架,有效解决了现有方法在保真度、灵活性与可迁移性方面的关键局限。 Abstract: Recent advances in flow-based generative models have enabled training-free, text-guided image editing by inverting an image into its latent noise and regenerating it under a new target conditional guidance. However, existing methods struggle to preserve source fidelity: higher-order solvers incur additional model inferences, truncated inversion constrains editability, and feature injection methods lack architectural transferability. To address these limitations, we propose SteerFlow, a model-agnostic editing framework with strong theoretical guarantees on source fidelity. In the forward process, we introduce an Amortized Fixed-Point Solver that implicitly straightens the forward trajectory by enforcing velocity consistency across consecutive timesteps, yielding a high-fidelity inverted latent. In the backward process, we introduce Trajectory Interpolation, which adaptively blends target-editing and source-reconstruction velocities to keep the editing trajectory anchored to the source. To further improve background preservation, we introduce an Adaptive Masking mechanism that spatially constrains the editing signal with concept-guided segmentation and source-target velocity differences. Extensive experiments on FLUX.1-dev and Stable Diffusion 3.5 Medium demonstrate that SteerFlow consistently achieves better editing quality than existing methods. Finally, we show that SteerFlow extends naturally to a complex multi-turn editing paradigm without accumulating drift.

[113] Setup-Independent Full Projector Compensation

Haibo Li,Qingyue Deng,Jijiang Li,Haibin Ling,Bingyao Huang

Main category: cs.CV

TL;DR: 本文提出SIComp,首个无需微调或重训练即可泛化至新投影设置的全投影补偿框架,通过大规模真实数据集和解耦几何与光度补偿的协同自适应设计实现强泛化能力。

Details Motivation: 现有投影仪补偿方法高度依赖特定设置,缺乏大规模多样数据集,且几何校正模型难以泛化到新几何配置。 Method: 构建包含277种不同投影仪-相机设置的大规模真实世界数据集;提出SIComp框架,采用协同自适应设计:用定制光学流模块在线进行几何校正,用新型光度网络处理光度补偿,并引入强度变化表面先验增强光照鲁棒性。 Result: SIComp在多种未见设置下持续生成高质量补偿结果,在泛化能力上显著优于现有方法,成为首个可泛化的投影补偿方案。 Conclusion: SIComp成功解决了投影补偿中设置依赖性强、泛化能力差的核心问题,为实际复杂场景下的投影应用提供了可靠、通用的技术基础。 Abstract: Projector compensation seeks to correct geometric and photometric distortions that occur when images are projected onto nonplanar or textured surfaces. However, most existing methods are highly setup-dependent, requiring fine-tuning or retraining whenever the surface, lighting, or projector-camera pose changes. Progress has been limited by two key challenges: (1) the absence of large, diverse training datasets and (2) existing geometric correction models are typically constrained by specific spatial setups; without further retraining or fine-tuning, they often fail to generalize directly to novel geometric configurations. We introduce SIComp, the first Setup-Independent framework for full projector Compensation, capable of generalizing to unseen setups without fine-tuning or retraining. To enable this, we construct a large-scale real-world dataset spanning 277 distinct projector-camera setups. SIComp adopts a co-adaptive design that decouples geometry and photometry: A carefully tailored optical flow module performs online geometric correction, while a novel photometric network handles photometric compensation. To further enhance robustness under varying illumination, we integrate intensity-varying surface priors into the network design. Extensive experiments demonstrate that SIComp consistently produces high-quality compensation across diverse unseen setups, substantially outperforming existing methods in terms of generalization ability and establishing the first generalizable solution to projector compensation. The code and dataset are available on our project page: https://hai-bo-li.github.io/SIComp/

[114] Dense Point-to-Mask Optimization with Reinforced Point Selection for Crowd Instance Segmentation

Hongru Chen,Jiyang Huang,Jia Wan,Antoni B. Chan

Main category: cs.CV

TL;DR: 本文提出DPMO和RPS两个新方法,利用点标注生成密集人群的实例分割掩码,并通过强化学习优化点选择,显著提升人群计数与分割性能。

Details Motivation: 现有密集人群数据集多为点标注,缺乏准确的区域标注(如框或掩码),而直接应用SAM等大模型效果不佳,需改进以获得高质量掩码并提升计数精度。 Method: 提出Dense Point-to-Mask Optimization(DPMO)结合SAM与NNEC约束,从点标注生成掩码;再构建Reinforced Point Selection(RPS)框架,采用Group Relative Policy Optimization(GRPO)优化点选择;并设计基于掩码监督的新损失函数。 Result: 在ShanghaiTech、UCF-QNRF、JHU-CROWD++和NWPU-Crowd四个主流数据集上达到SOTA实例分割性能,并验证掩码监督可显著提升多种模型的计数精度。 Conclusion: 高质量掩码标注对密集人群实例分割与计数至关重要;DPMO与RPS有效 bridged 点标注与实例分割之间的鸿沟,为 crowd analysis 提供了新范式。 Abstract: Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.

[115] Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception

Haoyuan Li,Wen Yang,Fang Xu,Hong Tan,Haijian Zhang,Shengyang Li,Gui-Song Xia

Main category: cs.CV

TL;DR: 本文提出了一种几何感知的无人机(UAV)跨视角地理定位框架,通过重建局部3D场景并渲染正射校正的鸟瞰图(BEV),统一粗粒度地点检索与细粒度位姿估计;引入卫星级注意力模块以高效处理多候选位置,并发布重校准的University-1652数据集,显著提升GNSS拒止环境下城市复杂场景中的米级定位精度与泛化能力。

Details Motivation: 在GNSS拒止环境中,无人机斜视图像与卫星正射影像之间存在严重几何差异,现有方法将透视畸变视为外观噪声,采用解耦式流程(先检索后位姿估计),难以实现高精度端到端定位。 Method: 提出几何感知框架:1)利用视觉几何基础Transformer(VGGT)从多视角UAV图像序列重建局部3D场景;2)渲染正射校正的鸟瞰图(BEV)作为几何中介;3)设计卫星级注意力模块(Satellite-wise Attention Block),隔离各卫星候选与UAV场景交互,避免干扰且保持线性计算复杂度。 Result: 在重校准的University-1652和SUES-200数据集上,显著超越SOTA方法,实现鲁棒的米级定位精度,在复杂城市环境中泛化能力更强。 Conclusion: 显式建模3D场景几何结构并引入BEV中间表征与专用注意力机制,可有效弥合跨视角几何鸿沟,为GNSS拒止下的无人机精准自主定位提供新范式。 Abstract: Cross-view geo-localization for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments remains challenging due to the severe geometric discrepancy between oblique UAV imagery and orthogonal satellite maps. Most existing methods address this problem through a decoupled pipeline of place retrieval and pose estimation, implicitly treating perspective distortion as appearance noise rather than an explicit geometric transformation. In this work, we propose a geometry-aware UAV geo-localization framework that explicitly models the 3D scene geometry to unify coarse place recognition and fine-grained pose estimation within a single inference pipeline. Our approach reconstructs a local 3D scene from multi-view UAV image sequences using a Visual Geometry Grounded Transformer (VGGT), and renders a virtual Bird's-Eye View (BEV) representation that orthorectifies the UAV perspective to align with satellite imagery. This BEV serves as a geometric intermediary that enables robust cross-view retrieval and provides spatial priors for accurate 3 Degrees of Freedom (3-DoF) pose regression. To efficiently handle multiple location hypotheses, we introduce a Satellite-wise Attention Block that isolates the interaction between each satellite candidate and the reconstructed UAV scene, preventing inter-candidate interference while maintaining linear computational complexity. In addition, we release a recalibrated version of the University-1652 dataset with precise coordinate annotations and spatial overlap analysis, enabling rigorous evaluation of end-to-end localization accuracy. Extensive experiments on the refined University-1652 benchmark and SUES-200 demonstrate that our method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.

[116] Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding

Jiayun Jin,Haolong Chai,Xueying Huang,Xiaoqing Guo,Zengwei Zheng,Zhan Zhou,Junmei Wang,Xinyu Wang,Jie Liu,Binbin Zhou

Main category: cs.CV

TL;DR: 本文提出Ultrasound-CLIP,一种面向超声影像的语义感知对比学习框架,并构建了大规模超声图文数据集US-365K与超声诊断知识体系UDT,显著提升分类、检索及零样本等下游任务性能。

Details Motivation: 现有视觉语言预训练模型(如CLIP)难以直接适配超声图像——因其解剖结构异质性强、诊断属性多样,缺乏适配的高质量数据与领域知识支撑。 Method: 构建US-365K数据集(365k图文对,52类解剖部位);提出超声诊断知识体系UDT(含解剖层级分类与九维诊断属性框架);设计Ultrasound-CLIP框架,引入语义软标签、语义损失及基于诊断属性的异构图模态建模。 Result: 在患者级划分的分类与检索任务上达到SOTA;在零样本、线性探测与微调任务中展现出强泛化能力。 Conclusion: 领域知识引导的多模态预训练可有效提升超声影像理解性能,UDT与Ultrasound-CLIP为医学影像语言建模提供了可扩展范式。 Abstract: Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.

[117] Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion

Edoardo A. Dominici,Thomas Deixelberger,Konstantinos Vardis,Markus Steinberger

Main category: cs.CV

TL;DR: 本文提出了一种轻量级架构和训练策略,利用自监督学习特征(如DINO)实现对预训练视频扩散模型的通用条件控制,解耦外观与其他需保留特征,支持视频域迁移和从3D生成视频等任务,并通过提高特征维度弥补低空间分辨率带来的可控性下降。

Details Motivation: 现有视频扩散模型多依赖感知、几何或简单语义信号进行条件控制,而高维自监督视觉特征(如DINO)虽富含场景信息但因高度纠缠限制了其生成能力;本文旨在探索将其作为通用条件信号用于预训练视频扩散模型,提升可控生成能力。 Method: 提出一种轻量级网络架构与训练策略,显式解耦外观(style/lighting)与语义/几何等其他特征;利用高维低分辨率特征替代高分辨率空间特征,在保持重建质量的同时增强生成可控性。 Result: 实现了视频域迁移(如风格迁移、重打光)与视频从3D生成任务;验证了低空间分辨率可通过更高特征维度补偿,显著提升生成过程中的外观控制鲁棒性与灵活性。 Conclusion: 自监督视觉特征可作为通用、强表达力的条件信号用于预训练视频扩散模型,关键在于特征解耦与维度-分辨率权衡设计;该方法为视频生成提供了更灵活、可控的渲染接口。 Abstract: Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.

[118] Cosine-Normalized Attention for Hyperspectral Image Classification

Muhammad Ahmad,Manuel Mazzara

Main category: cs.CV

TL;DR: 本文提出了一种基于余弦归一化注意力机制的Transformer方法,用于高光谱图像分类,强调角度关系并降低对幅度变化的敏感性,在极有限监督下优于现有Transformer和Mamba模型。

Details Motivation: 传统Transformer的点积注意力混合了特征的模长和方向,对高光谱数据可能不是最优;高光谱数据具有显著的角结构特性,需更适配的相似性度量。 Method: 提出余弦归一化注意力:将查询和键嵌入投影到单位超球面,采用平方余弦相似度计算注意力分数,强调角度关系、抑制幅度干扰;集成到空间-光谱Transformer中。 Result: 在三个基准数据集上,该方法在极有限监督条件下持续超越多种先进Transformer及Mamba模型,且使用轻量骨干网络;控制实验表明余弦评分提供可靠的归纳偏置。 Conclusion: 从几何视角重构注意力评分函数(尤其是余弦归一化)能更好适配高光谱数据的内在结构,为HSIC任务提供更鲁棒、高效的表示学习机制。 Abstract: Transformer-based methods have improved hyperspectral image classification (HSIC) by modeling long-range spatial-spectral dependencies; however, their attention mechanisms typically rely on dot-product similarity, which mixes feature magnitude and orientation and may be suboptimal for hyperspectral data. This work revisits attention scoring from a geometric perspective and introduces a cosine-normalized attention formulation that aligns similarity computation with the angular structure of hyperspectral signatures. By projecting query and key embeddings onto a unit hypersphere and applying a squared cosine similarity, the proposed method emphasizes angular relationships while reducing sensitivity to magnitude variations. The formulation is integrated into a spatial-spectral Transformer and evaluated under extremely limited supervision. Experiments on three benchmark datasets demonstrate that the proposed approach consistently achieves higher performance, outperforming several recent Transformer- and Mamba-based models despite using a lightweight backbone. In addition, a controlled analysis of multiple attention score functions shows that cosine-based scoring provides a reliable inductive bias for hyperspectral representation learning.

[119] Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning

Seyed Amir Kasaei,Arash Marioriyad,Mahbod Khaleti,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: 本文提出RebusBench基准,用于评估大视觉语言模型(LVLMs)在解决需多步推理与知识整合的谜题(如rebus puzzle)上的神经符号能力;实验发现当前SOTA模型在此类任务上表现极差(<10%精确匹配),揭示其缺乏将感知与先验知识有效联结的认知推理机制。

Details Motivation: 现有LVLMs虽擅长显式视觉识别,但在需以视觉为线索、经多步抽象推理才能得出答案的任务(如rebus puzzle)中存在显著认知缺陷,亟需专门基准评估其神经符号推理能力。 Method: 构建包含1164个rebus谜题的RebusBench基准,要求模型完成视觉/文本属性提取、语言先验知识检索(如习语)、以及跨模态抽象映射;在Qwen、InternVL、LLaVA等SOTA模型上进行系统评测,并分析模型缩放与上下文学习的影响。 Result: 所有测试模型在Exact Match指标上均低于10%,语义准确率不超20%;模型规模扩大和In-Context Learning均未带来显著提升。 Conclusion: 当前LVLMs虽具备视觉与语言基础组件,但严重缺乏将二者通过认知推理‘粘合’起来的能力,凸显神经符号协同推理是未来关键突破方向。 Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.

[120] DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning

Yang Zhou,Xiaofeng Wang,Hao Shao,Letian Wang,Guosheng Zhao,Jiangnan Shao,Jiagang Zhu,Tingdong Yu,Zheng Zhu,Guan Huang,Steven L. Waslander

Main category: cs.CV

TL;DR: DriveDreamer-Policy 是一种统一的驾驶世界-动作模型,融合深度生成、未来视频生成与运动规划,具备几何感知能力,显著提升闭环规划与世界生成性能。

Details Motivation: 现有世界-动作模型(WAM)多局限于2D外观或潜在表征建模,缺乏对物理世界至关重要的几何接地能力,难以支撑具身系统在真实环境中的可靠决策。 Method: 提出 DriveDreamer-Policy:以大语言模型处理语言指令、多视角图像和动作输入;后接三个轻量级生成器分别输出深度图、未来视频帧和驾驶动作;通过学习几何感知的世界表征,统一指导未来预测与运动规划。 Result: 在 Navsim v1/v2 上分别达到 89.2 PDMS 和 88.7 EPDMS,超越现有基于世界模型的方法;生成的未来视频与深度预测质量更高;消融实验证明显式深度学习可增强视频想象一致性与规划鲁棒性。 Conclusion: 几何感知的世界-动作建模是提升具身智能(尤其是自动驾驶)推理与控制能力的关键路径;DriveDreamer-Policy 的模块化设计兼顾性能、可控延迟与可解释性。 Abstract: Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.

[121] FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation

Taimur Khan,Hannes Feilhauer,Muhammad Jazib Zafar

Main category: cs.CV

TL;DR: 本文提出FSKD框架,通过知识蒸馏将LiDAR数据中的森林结构信息(如CHM、PAI、FHD)迁移至仅使用RGBI影像的轻量模型,实现低成本、高分辨率、多指标森林结构制图。

Details Motivation: 高分辨率森林结构数据对碳汇、生物多样性和生态系统监测至关重要,但机载LiDAR成本高、获取频率低;而现有纯光学方法难以达到同等精度和多指标输出能力。 Method: 提出LiDAR-to-RGBI知识蒸馏框架FSKD:以融合LiDAR平面指标与垂直剖面特征的多模态跨注意力教师模型指导仅输入RGBI影像的SegFormer学生模型,联合预测CHM、PAI和FHD;在德国萨克森州384 km²数据上训练,在8个地理异质区域测试,并验证时序不匹配(冬LiDAR/夏RGBI)下的鲁棒性。 Result: 学生模型零样本CHM预测达SOTA:MedAE=4.17 m,R²=0.51,IoU=0.87;MAE比HRCHM/DAC基线低29–46%;多指标联合预测能力为当前单目CHM方法所不具备;PAI/FHD迁移具区域依赖性,需本地校准;支持冬LiDAR-夏RGBI跨季节应用。 Conclusion: FSKD实现了LiDAR级森林结构指标向RGBI影像的高效知识迁移,突破了传统LiDAR依赖与严格时空配准限制,为数字孪生德国等大规模20 cm级 operational 监测提供了可行技术路径。 Abstract: Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29--46% in MAE (5.81 m vs. 8.14--10.84 m) with stronger correlation coefficients (0.713 vs. 0.166--0.652). Ablations show that multi-modal fusion improves performance by 10--26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.

[122] GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents

Mengtian Li,Fan Yang,Ruixue Xiong,Yiyan Fan,Zhifeng Xie,Zeyu Wang

Main category: cs.CV

TL;DR: 本文提出GardenDesigner框架,结合江南园林美学原则与程序化建模代理链,实现快速、多样且美观的江南园林数字生成,并通过GardenVerse知识库和Unity交互界面支持非专家用户一分钟内文本驱动建园。

Details Motivation: 江南园林作为重要数字资产,在影视、游戏和数字旅游中潜力巨大,但人工建模依赖专家经验、耗时长,亟需自动化解决方案。 Method: 提出GardenDesigner框架:基于江南园林‘水为脉、路为骨’等美学规则,构建地形分布、路径生成、资产选择与布局优化等多智能体程序化建模流程;引入专家标注的GardenVerse知识库提升文化合理性;开发Unity交互界面支持文本输入与实时编辑。 Result: 实验与人工评估表明,该方法可高效生成多样化、符合美学与文化规范的江南园林数字场景;非专家用户可在1分钟内完成建园。 Conclusion: GardenDesigner成功将传统园林美学系统化编码为可计算规则,实现了文化传承与AIGC技术的深度融合,为非遗数字化提供了可扩展范式。 Abstract: Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. The water-centric terrain and explorative pathway rules are applied by terrain distribution and road generation agents. Selection and spatial layout of garden assets follow the aesthetic and cultural constraints. Consequently, we propose asset selection and layout optimization agents to select and arrange objects for each area in the garden. Additionally, we introduce GardenVerse for Jiangnan garden construction, including expert-annotated garden knowledge to enhance the asset arrangement process. To enable interaction and editing, we develop an interactive interface and tools in Unity, in which non-expert users can construct Jiangnan gardens via text input within one minute. Experiments and human evaluations demonstrate that GardenDesigner can generate diverse and aesthetically pleasing Jiangnan gardens. Project page is available at https://monad-cube.github.io/GardenDesigner.

[123] PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency

Leezy Han,Seunggyu Kim,Dongseok Shim,Hyeonbeom Lee

Main category: cs.CV

TL;DR: 本文提出了一种利用轮式里程计提升单目深度估计时间一致性的新框架,通过光流三角化估计稀疏深度并递归更新尺度,再校准预训练深度模型的相对深度输出。

Details Motivation: 现有单目深度估计方法在连续帧间难以保持时间一致性,导致抖动和深度突变时的估计失败。 Method: 利用轮式里程计,结合光流三角化估计相机位姿与稀疏深度;用稀疏深度递归贝叶斯估计度量尺度,并用于重标定预训练深度模型输出的相对深度。 Result: 在KITTI、TartanAir、MS2及自建数据集上验证了方法的鲁棒性与准确性。 Conclusion: 该一致性感知框架显著提升了单目深度估计的时间稳定性与精度,尤其适用于自动驾驶与移动机器人等实际场景。 Abstract: Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation foundation model. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our own dataset, demonstrating robust and accurate depth estimation performance.

[124] A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection

Arezoo Borji,Gernot Kronreif,Bernhard Angermayr,Francisco Mario Calisto,Wolfgang Birkfellner,Inna Servetnyk,Yinyin Yuan,Sepideh Hatamikia

Main category: cs.CV

TL;DR: 本文提出了一种结合NSGA-II优化与蒙特卡洛Dropout不确定性估计的深度学习框架,从H&E染色全切片图像中直接预测PAM50亚型,减少对昂贵分子检测的依赖。在TCGA-BRCA和CPTAC-BRCA数据集上验证,取得了较高的F1分数和AUC,表明其具有临床应用潜力。

Details Motivation: 降低对昂贵分子检测(如PAM50基因表达谱)的依赖,利用常规H&E染色图像实现乳腺癌内在亚型的精准、高效、可扩展预测,推动病理影像驱动的个体化治疗。 Method: 提出一种优化驱动的深度学习框架:联合优化补丁信息量、空间多样性、不确定性及补丁数量;采用NSGA-II多目标优化算法与Monte Carlo Dropout不确定性估计相结合;使用ResNet18提取特征,自定义CNN头进行分类;仅选择少量高信息量补丁用于最终分类。 Result: 在内部TCGA-BRCA数据集(627例WSIs)上达F1=0.8812、AUC=0.9841;在外部队列CPTAC-BRCA上达F1=0.7952、AUC=0.9512;显著提升计算效率并保持高预测性能。 Conclusion: 该不确定性感知、优化引导的补丁选择策略可有效替代传统分子检测,为基于数字病理的PAM50亚型判别提供一种高效、可推广的临床可行方案。 Abstract: Breast cancer is a highly heterogeneous disease with diverse molecular profiles. The PAM50 gene signature is widely recognized as a standard for classifying breast cancer into intrinsic subtypes, enabling more personalized treatment strategies. In this study, we introduce a novel optimization-driven deep learning framework that aims to reduce reliance on costly molecular assays by directly predicting PAM50 subtypes from H&E-stained whole-slide images (WSIs). Our method jointly optimizes patch informativeness, spatial diversity, uncertainty, and patch count by combining the non-dominated sorting genetic algorithm II (NSGA-II) with Monte Carlo dropout-based uncertainty estimation. The proposed method can identify a small but highly informative patch subset for classification. We used a ResNet18 backbone for feature extraction and a custom CNN head for classification. For evaluation, we used the internal TCGA-BRCA dataset as the training cohort and the external CPTAC-BRCA dataset as the test cohort. On the internal dataset, an F1-score of 0.8812 and an AUC of 0.9841 using 627 WSIs from the TCGA-BRCA cohort were achieved. The performance of the proposed approach on the external validation dataset showed an F1-score of 0.7952 and an AUC of 0.9512. These findings indicate that the proposed optimization-guided, uncertainty-aware patch selection can achieve high performance and improve the computational efficiency of histopathology-based PAM50 classification compared to existing methods, suggesting a scalable imaging-based replacement that has the potential to support clinical decision-making.

[125] STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

Emad Bahrami,Olga Zatsarynna,Parth Pathak,Sunando Sengupta,Juergen Gall,Mohsen Fayyaz

Main category: cs.CV

TL;DR: STRIVE是一种面向视频问答的时空强化学习框架,通过构建视频的多种时空变体并联合归一化文本生成与视觉变体,提升奖励信号质量;引入重要性感知采样机制,在保持语义相关性的同时增强探索鲁棒性;在多个视频推理基准上显著优于现有强化学习基线。

Details Motivation: 现有基于组策略优化的多模态强化学习方法因响应正确性相似而导致奖励方差低,优势估计弱或不稳定。 Method: 提出STRIVE框架:1)构建输入视频的多种时空变体;2)对文本生成和视觉变体进行联合归一化以扩大组比较维度;3)设计重要性感知采样机制,优先选择与问题最相关的帧,同时保留时间覆盖。 Result: 在VideoMME、TempCompass、VideoMMMU、MMVU、VSI-Bench和PerceptionTest六个视频推理基准上,STRIVE在多个大型多模态模型上持续超越强强化学习基线。 Conclusion: 结构化的时空探索是一种稳定多模态强化学习、提升视频推理性能的有效且原理清晰的机制。 Abstract: We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.

[126] SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers

Xiang Yang,Feifei Li,Mi Zhang,Geng Hong,Xiaoyu You,Min Yang

Main category: cs.CV

TL;DR: 本文提出SafeRoPE,一种轻量级、细粒度的安全生成框架,通过分析MMDiT中注意力机制的安全关键子空间,并利用Rotary Positional Embedding(RoPE)扰动实现对有害语义的精准抑制,兼顾安全性与生成保真度。

Details Motivation: 现有T2I模型(如SD3、FLUX)虽生成质量高,但易受多词触发产生不安全内容;而现有缓解方法计算开销大、且针对U-Net设计,难以适配MMDiT等Transformer架构。 Method: 1)分析MMDiT注意力头,识别承载不安全语义的低维可解释子空间;2)构建头级别不安全子空间并定义潜在风险得分(LRS);3)设计头级别RoPE扰动,选择性抑制不安全特征,同时保留良性内容;4)联合LRS与RoPE扰动实现风险导向的查询/键向量旋转。 Result: SafeRoPE在MMDiT上实现了SOTA的安全-效用平衡:显著降低有害内容生成率,同时保持图像质量与生成保真度;实验验证其有效性、轻量化及跨提示鲁棒性。 Conclusion: SafeRoPE揭示了MMDiT中不安全语义具有结构化、头级稀疏性,证明RoPE是可控干预的关键接口;该框架为Transformer-based扩散模型提供了高效、可解释、无需微调的安全生成新范式。 Abstract: Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of safety-critical heads is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images. Motivated by these insights, we propose SafeRoPE, a lightweight and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combines both head-wise LRS and RoPE perturbations to perform risk-specific head-wise rotation on query and key vector embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT. Codes are available at https://github.com/deng12yx/SafeRoPE.

[127] Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks

Yaxin Luo,Zhiqiang Shen

Main category: cs.CV

TL;DR: 本文提出了一种无需人工标注的随机标签桥接训练方法,使大语言模型(LLM)参数能有效适配视觉任务,并发现部分LLM层具备强基础性,无需微调即可直接用于视觉任务,为跨模态迁移提供了新路径。

Details Motivation: 现有研究普遍认为语言预训练模型因参数空间与视觉模型差异大而不适用于视觉下游任务,本文挑战这一假设,探索语言与视觉模态间直接适配的可能性。 Method: 提出随机标签桥接训练(random label bridge training)作为模态适配学习器,仅需随机标签、无需人工标注;并探索部分桥接训练策略,即仅对LLM的部分层进行桥接,保留其他层的原始参数。 Result: 实验证明该桥接训练能有效对齐LLM参数与视觉任务;部分桥接优于全桥接,特定LLM层具有强基础性,在未微调时仍对视觉任务有益。 Conclusion: 语言预训练模型参数可通过轻量桥接训练适配视觉任务,且部分层具备跨模态通用性,为跨模态迁移提供了高效、实用的新范式。 Abstract: The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.

[128] Ranking-Guided Semi-Supervised Domain Adaptation for Severity Classification

Shota Harada,Ryoma Bise,Kiyohito Tanaka,Seiichi Uchida

Main category: cs.CV

TL;DR: 本文提出了一种用于医学图像严重程度分类的半监督域自适应新方法,通过跨域排序和连续分布对齐来对齐源域与目标域的等级分数分布。

Details Motivation: 现有半监督域自适应方法在医学图像严重程度分类中表现不佳,因严重程度类别具有自然顺序且类边界模糊,导致域适配困难。 Method: 提出跨域排序(Cross-Domain Ranking)和连续分布对齐(Continuous Distribution Alignment)两个模块,利用类别顺序学习等级分数,并对齐源域与目标域的等级分数分布。 Result: 在溃疡性结肠炎和糖尿病视网膜病变数据集上的实验表明,该方法能有效对齐类别特定的等级分数分布,提升严重程度分类性能。 Conclusion: 所提方法通过引入等级结构建模,显著改善了半监督域自适应在有序类别医学图像分析中的适用性与效果。 Abstract: Semi-supervised domain adaptation leverages a few labeled and many unlabeled target samples, making it promising for addressing domain shifts in medical image analysis. However, existing methods struggle with severity classification due to unclear class boundaries. Severity classification involves naturally ordered class labels, complicating adaptation. We propose a novel method that aligns source and target domains using rank scores learned via ranking with class order. Specifically, Cross-Domain Ranking ranks sample pairs across domains, while Continuous Distribution Alignment aligns rank score distributions. Experiments on ulcerative colitis and diabetic retinopathy classification validate the effectiveness of our approach, demonstrating successful alignment of class-specific rank score distributions.

[129] Semantic Segmentation of Textured Non-manifold 3D Meshes using Transformers

Mohammadreza Heidarianbaei,Max Mehltretter,Franz Rottensteiner

Main category: cs.CV

TL;DR: 本文提出了一种纹理感知的Transformer模型,用于对带纹理的3D网格进行语义分割,通过融合面片级像素特征与几何描述符,并采用两阶段Transformer块实现多尺度特征聚合,在SUM和新文化遗迹数据集上均取得显著性能提升。

Details Motivation: 现有基于深度学习的3D网格语义分割方法大多忽略纹理信息,而纹理对理解网格外观至关重要;同时,不施加几何约束的方法虽能直接处理网格,但未能充分利用其丰富的纹理线索。 Method: 提出纹理感知Transformer:1)纹理分支将每个面片的原始像素编码为可学习token;2)与几何描述符融合后输入两阶段Transformer块(TSTB),支持局部与全局信息交互;3)结合层次化多尺度特征聚合策略。 Result: 在Semantic Urban Meshes (SUM)基准上达到81.9% mF1和94.3% OA;在新建的文化遗产屋顶瓦片数据集(三角面级损伤标注)上达到49.7% mF1和72.8% OA,显著优于现有方法。 Conclusion: 纹理信息对3D网格语义分割至关重要,所提出的纹理感知Transformer能有效联合建模几何与纹理特征,为复杂真实场景(如城市建模与文化遗产保护)中的细粒度分割提供了新范式。 Abstract: Textured 3D meshes jointly represent geometry, topology, and appearance, yet their irregular structure poses significant challenges for deep-learning-based semantic segmentation. While a few recent methods operate directly on meshes without imposing geometric constraints, they typically overlook the rich textural information also provided by such meshes. We introduce a texture-aware transformer that learns directly from raw pixels associated with each mesh face, coupled with a new hierarchical learning scheme for multi-scale feature aggregation. A texture branch summarizes all face-level pixels into a learnable token, which is fused with geometrical descriptors and processed by a stack of Two-Stage Transformer Blocks (TSTB), which allow for both a local and a global information flow. We evaluate our model on the Semantic Urban Meshes (SUM) benchmark and a newly curated cultural-heritage dataset comprising textured roof tiles with triangle-level annotations for damage types. Our method achieves 81.9\% mF1 and 94.3\% OA on SUM and 49.7\% mF1 and 72.8\% OA on the new dataset, substantially outperforming existing approaches.

[130] Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images

Jamie S. J. Stirling,Noura Al-Moubayed,Hubert P. H. Shum

Main category: cs.CV

TL;DR: 本文提出了一种位置无关的离散图像表示方法PI-VQ,通过约束潜在码不携带位置信息,促使模型学习全局语义特征,并支持无需先验的直接插值;引入基于最优二分匹配的匹配量化算法,提升瓶颈容量3.5倍,实现单次前向传播的插值采样,在CelebA等数据集上达到有竞争力的生成质量指标。

Details Motivation: 探究空间对齐数据的离散表示是否必须依赖位置信息,挑战现有VQ-VAE/VQ-GAN中位置依赖性带来的建模复杂性(如需自回归或扩散先验)。 Method: 提出置换不变向量量化自编码器(PI-VQ),强制潜在码无位置信息;设计匹配量化(matching quantization),基于最优二分匹配替代最近邻量化以提升有效瓶颈容量;利用学习到的组合式码本结构实现插值采样。 Result: 在CelebA、CelebA-HQ和FFHQ上生成图像的精度(precision)、密度(density)和覆盖度(coverage)指标具有竞争力;实现了无需显式先验的直接图像插值与单步前向生成;码本呈现良好可解释性与语义可分离性。 Conclusion: 位置信息并非离散图像表示的必要条件;PI-VQ证明了位置无关表示能有效捕捉全局语义并支持简单高效的生成机制;匹配量化显著缓解容量瓶颈;该范式为可解释、解耦与高效生成建模提供了新方向。 Abstract: Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.

[131] FaCT-GS: Fast and Scalable CT Reconstruction with Gaussian Splatting

Pawel Tomasz Pieta,Rasmus Juul Pedersen,Sina Borgi,Jakob Sauer Jørgensen,Jens Wenzel Andreasen,Vedrana Andersen Dahl

Main category: cs.CV

TL;DR: 本文提出FaCT-GS,一种基于高斯泼溅(GS)的快速灵活CT重建框架,通过深度优化体素化与光栅化流程,显著提升速度并支持体积先验引导重建。

Details Motivation: 高斯泼溅(GS)虽在CT重建中表现不俗,但相比传统算法优势不够明显,难以推动实际应用迁移;现有GS方法在计算效率和可扩展性方面仍存在瓶颈。 Method: 提出FaCT-GS框架,核心在于对体素化与光栅化管线的深度优化,并支持从已有体数据快速拟合高斯分布,用于warm-start重建或作为压缩表示。 Result: 在512×512投影下比当前最优GS CT方法快4倍以上,在2K投影下快13倍以上;具备良好的投影数与输出体大小扩展性。 Conclusion: FaCT-GS显著提升了GS在CT重建中的实用性与效率,为临床及大规模CT重建提供了高效、灵活的新范式。 Abstract: Gaussian Splatting (GS) has emerged as a dominating technique for image rendering and has quickly been adapted for the X-ray Computed Tomography (CT) reconstruction task. However, despite being on par or better than many of its predecessors, the benefits of GS are typically not substantial enough to motivate a transition from well-established reconstruction algorithms. This paper addresses the most significant remaining limitations of the GS-based approach by introducing FaCT-GS, a framework for fast and flexible CT reconstruction. Enabled by an in-depth optimization of the voxelization and rasterization pipelines, our new method is significantly faster than its predecessors and scales well with projection and output volume size. Furthermore, the improved voxelization enables rapid fitting of Gaussians to pre-existing volumes, which can serve as a prior for warm-starting the reconstruction, or simply as an alternative, compressed representation. FaCT-GS is over 4X faster than the State of the Art GS CT reconstruction on standard 512x512 projections, and over 13X faster on 2k projections. Implementation available at: https://github.com/PaPieta/fact-gs.

[132] Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance

Jason Qiu,Zachary Meurer,Xavier Thomas,Deepti Ghadiyaram

Main category: cs.CV

TL;DR: 本文揭示了当前视觉-语言模型(VLMs)在基本几何变换(如旋转、缩放)下缺乏空间不变性与等变性,导致其在语义内容稀疏时性能显著下降,暴露了语义理解与空间推理之间的系统性差距。

Details Motivation: 现代VLMs虽在语义任务上表现优异,但在基础空间推理能力(如对几何变换的鲁棒性)方面存在根本性缺陷,亟需探究其空间不变性与等变性的缺失问题。 Method: 通过在符号草图、自然照片和抽象艺术等多种视觉域上进行系统性评估,测试不同架构、模型规模和提示策略下的VLMs对旋转、缩放和恒等变换的鲁棒性。 Result: VLMs在几何变换下性能急剧下降,尤其在语义内容稀疏时;该现象跨架构、模型容量和提示方式普遍存在。 Conclusion: 当前VLMs存在语义理解与空间推理之间的系统性脱节,未来多模态系统需强化几何感知与建模能力。 Abstract: This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.

[133] HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models

Yansong Guo,Chaoyang Zhu,Jiayi Ji,Jianghang Lin,Liujuan Cao

Main category: cs.CV

TL;DR: HieraVid是一种分层视频令牌剪枝框架,通过段级、帧级和层级三级动态剪枝,在大幅减少视频输入令牌数的同时,保持甚至提升VideoLLM的性能。

Details Motivation: 现有方法仅在输入层面剪枝,忽略了视频本身的结构(如段-帧结构)以及大语言模型内部多模态信息的单向传播特性,导致剪枝效率低、信息损失大。 Method: 提出HieraVid分层剪枝框架:1)段级——对视频进行时间分段与空间合并;2)帧级——在每段内联合剪除相似帧以保留多样性;3)层级——随LLM层数加深逐步减少冗余,不牺牲性能。 Result: 在四个主流视频理解基准上验证,仅保留30%令牌时,HieraVid超越现有方法,性能达新SOTA,并分别保持LLaVA-Video-7B和LLaVA-OneVision-7B的98%和99%性能。 Conclusion: 分层、动态、结构感知的剪枝策略能更高效地压缩视频输入,兼顾计算效率与模型理解能力,为VideoLLM轻量化部署提供新范式。 Abstract: Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.

[134] Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation

Hinako Mitsuoka,Kazuhiro Hotta

Main category: cs.CV

TL;DR: 本文提出了一种轻量级双损失训练框架,用于提升时序动作分割(TAS)的细粒度性能,仅需增加一个输出通道和两个辅助损失项,无需修改主干网络结构。

Details Motivation: 现有TAS方法依赖复杂架构,不利于实际部署;亟需一种轻量、通用且高效的训练策略来提升边界定位与段内一致性。 Method: 设计两种损失:1)单通道边界回归损失,提升时间边界定位精度;2)基于累积分布函数(CDF)的段级正则化损失,增强预测段内部结构的一致性;该框架与架构无关,可即插即用地集成到主流TAS模型中。 Result: 在三个基准数据集上,显著提升F1和Edit分数,改善段级一致性和边界质量;帧级准确率基本不变,验证了轻量损失设计的有效性。 Conclusion: 精细的时序动作分割不依赖更重的模型或推理优化,而可通过简洁、通用的损失函数设计实现。 Abstract: Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.

[135] MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation

Kai Dong,Tingting Bai

Main category: cs.CV

TL;DR: 本文提出MAR-MAER,一种分层自回归文本到图像生成框架,通过度量感知嵌入正则化提升生成质量一致性,并引入条件变分模块增强对模糊提示的语义灵活性,显著优于Hi-MAR基线。

Details Motivation: 解决现有自回归文本到图像模型在生成图像质量不满足人类偏好、以及难以处理多义性提示两大问题。 Method: 提出MAR-MAER框架,包含两部分:1)基于自适应核回归损失训练的轻量投影头,实现度量感知嵌入正则化(对齐CLIPScore/HPSv2等人类偏好指标);2)条件变分模块,在分层token生成中引入可控随机性以支持多义提示下的多样性生成。 Result: 在COCO和新构建的Ambiguous-Prompt Benchmark上验证:CLIPScore提升+1.6,HPSv2提升+5.3;对模糊提示生成多样性显著增强,且经人工评估与自动指标双重验证。 Conclusion: MAR-MAER有效提升了自回归图像生成的质量一致性与语义灵活性,为处理开放、模糊提示提供了新范式。 Abstract: Autoregressive (AR) models have demonstrated significant success in the realm of text-to-image generation. However, they usually face two major challenges. Firstly, the generated images may not always meet the quality standards expected by humans. Furthermore, these models face difficulty when dealing with ambiguous prompts that could be interpreted in several valid ways. To address these issues, we introduce MAR-MAER, an innovative hierarchical autoregressive framework. It combines two main components. It is a metric-aware embedding regularization method. The other one is a probabilistic latent model used for handling ambiguous semantics. Our method utilizes a lightweight projection head, which is trained with an adaptive kernel regression loss function. This aligns the model's internal representations with human-preferred quality metrics, such as CLIPScore and HPSv2. As a result, the embedding space that is learned more accurately reflects human judgment. We are also introducing a conditional variational module. This approach incorporates an aspect of controlled randomness within the hierarchical token generation process. This capability allows the model to produce a diverse array of coherent images based on ambiguous or open-ended prompts. We conducted extensive experiments using COCO and a newly developed Ambiguous-Prompt Benchmark. The results show that MAR-MAER achieves excellent performance in both metric consistency and semantic flexibility. It exceeds the baseline Hi-MAR model's performance, showing an improvement of +1.6 in CLIPScore and +5.3 in HPSv2. For unclear inputs, it produces a notably wider range of outputs. These findings have been confirmed through both human evaluation and automated metrics.

[136] GeoAI Agency Primitives

Akram Zaytar,Rohan Sawahn,Caleb Robinson,Gilles Q. Hacheme,Girmaw A. Tadesse,Inbal Becker-Reshef,Rahul Dodhia,Juan Lavista Ferres

Main category: cs.CV

TL;DR: 本文提出了一套面向地理空间人工智能(GeoAI)助手的代理基础能力(agency primitives),旨在弥合大模型能力与GIS从业者实际工作流之间的鸿沟,强调以人类为中心、迭代协作的‘代理层’,并定义了9个核心原始操作及配套生产力评估基准。

Details Motivation: 现有GeoAI模型(如卫星图像描述、视觉问答等)虽有进展,但未能提升GIS从业者在制图、矢量层生成等实际任务中的生产力,根本原因在于缺乏支持人机迭代协作的代理层。 Method: 提出一套包含9个基础能力(如导航、感知、地理参考记忆、双重建模等)的代理层词汇,并构建一个以人类生产力为指标的评估基准。 Result: 形成可实现、可测试、可比较的GeoAI代理辅助框架,为GIS领域的人机协同提供结构化基础。 Conclusion: 代理层而非单纯模型能力是提升GIS实践者生产力的关键;该工作为GeoAI从技术演示走向实际应用提供了方法论和评估标准。 Abstract: We present ongoing research on agency primitives for GeoAI assistants -- core capabilities that connect Foundation models to the artifact-centric, human-in-the-loop workflows where GIS practitioners actually work. Despite advances in satellite image captioning, visual question answering, and promptable segmentation, these capabilities have not translated into productivity gains for practitioners who spend most of their time producing vector layers, raster maps, and cartographic products. The gap is not model capability alone but the absence of an agency layer that supports iterative collaboration. We propose a vocabulary of $9$ primitives for such a layer -- including navigation, perception, geo-referenced memory, and dual modeling -- along with a benchmark that measures human productivity. Our goal is a vocabulary that makes agentic assistance in GIS implementable, testable, and comparable.

[137] A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes

Di Li,Jie Feng,Guanbin Li,Ronghua Shang,Yuhui Zheng,Weisheng Dong,Guangming Shi

Main category: cs.CV

TL;DR: 本文提出A3R框架,将细粒度功能推理重构为顺序证据获取过程,通过多维证据(3D几何+2D语义)逐步消解歧义,并采用基于GRPO的策略学习提升效率与准确性,在复杂3D高斯场景中显著优于静态单次预测方法。

Details Motivation: 现有方法将功能推理视为基于静态观测的单次预测,但在复杂3D场景中常因固定视角下任务相关证据不全而失败,而非模型预测能力不足。 Method: 提出A3R——一种基于MLLM的智能体框架,通过迭代选择证据采集动作(如视角调整、区域聚焦等),融合3D几何与2D语义证据更新功能信念;并引入GRPO策略优化算法提升决策效率与推理精度。 Result: 在场景级基准测试中,A3R持续超越各类静态单次预测基线,在细粒度功能定位任务上展现出更高准确率与鲁棒性。 Conclusion: 顺序化、跨维度的主动证据获取机制比静态单次推理更适配复杂3D高斯场景中的细粒度功能推理任务,验证了智能体式推理范式的有效性。 Abstract: Affordance reasoning in 3D Gaussian scenes aims to identify the region that supports the action specified by a given text instruction in complex environments. Existing methods typically cast this problem as one-shot prediction from static scene observations, assuming sufficient evidence is already available for reasoning. However, in complex 3D scenes, many failure cases arise not from weak prediction capacity, but from incomplete task-relevant evidence under fixed observations. To address this limitation, we reformulate fine-grained affordance reasoning as a sequential evidence acquisition process, where ambiguity is progressively reduced through complementary 3D geometric and 2D semantic evidence. Building on this formulation, we propose A3R, an agentic affordance reasoning framework that enables an MLLM-based policy to iteratively select evidence acquisition actions and update the affordance belief through cross-dimensional evidence acquisition. To optimize such sequential decision making, we further introduce a GRPO-based policy learning strategy that improves evidence acquisition efficiency and reasoning accuracy. Extensive experiments on scene-level benchmarks show that A3R consistently surpasses static one-shot baselines, demonstrating the advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.

[138] GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting

Xianben Yang,Tao Wang,Yuxuan Li,Yi Jin,Haibin Ling

Main category: cs.CV

TL;DR: 本文提出GS²方法,通过图结构优化高斯点空间分布,在大幅减少高斯点数量(仅12.5%)的同时提升渲染质量与内存效率。

Details Motivation: 3D高斯泼溅(3DGS)虽在新视角合成和实时渲染中表现优异,但因高斯点数量庞大导致内存开销过高;现有剪枝方法易破坏空间一致性并引入渲染伪影。 Method: 提出基于图的高斯点空间分布优化框架GS²,包括:1)基于证据下界(ELBO)的自适应稠密化策略;2)不透明度感知的渐进式剪枝策略;3)图结构特征编码模块实现特征引导的点位置调整。 Result: GS²在仅使用约12.5%的高斯点情况下,PSNR高于原始3DGS,并在渲染质量与内存效率两方面全面超越所有对比基线。 Conclusion: GS²通过协同优化高斯点的空间分布与数量,在保持甚至提升重建质量的同时显著压缩表示规模,为高效3D重建与实时渲染提供了新范式。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting. Extensive experiments validate that GS\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5\% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.

[139] Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Ahmed B Mustafa,Zihan Ye,Yang Lu,Michael P Pound,Shreyank N Gowda

Main category: cs.CV

TL;DR: 本文揭示了当前文本到图像生成模型在安全过滤方面的脆弱性,提出了一系列无需模型访问或优化的自然语言提示攻击策略,并构建了视觉越狱技术分类体系,实验显示攻击成功率高达74.47%。

Details Motivation: 现有文本到图像生成模型依赖安全过滤器防止滥用,但其实际防护能力尚不明确,亟需系统评估其对自然语言层面越狱攻击的鲁棒性。 Method: 提出并系统研究多种基于提示词的视觉越狱策略(如艺术重构、材料替换、伪教育框架等),在不访问模型、不进行优化或对抗训练的前提下,通过自然语言改写绕过安全过滤;在多个SOTA模型上进行实证评估。 Result: 所提越狱策略在多个主流文本到图像模型上均有效,整体攻击成功率(ASR)最高达74.47%,显著暴露了当前提示词过滤与视觉安全机制的语义理解缺陷。 Conclusion: 表面级提示过滤不足以应对语义层面的对抗意图,需提升安全系统对上下文和隐含意图的深层理解能力。 Abstract: Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.

[140] ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery

Ke Li,Ting Wang,Di Wang,Yongshan Zhu,Yiming Zhang,Tao Lei,Quan Wang

Main category: cs.CV

TL;DR: 本文提出ProVG框架,通过解耦语言表达为全局上下文、空间关系和物体属性,并采用渐进式跨模态调制器实现粗到细的视觉-语言对齐,显著提升了遥感图像中基于自然语言表达的目标定位精度。

Details Motivation: 现有方法依赖句子级视觉-语言对齐,难以利用细粒度语言线索(如空间关系和物体属性),而这些线索在不同定位阶段起不同作用,需有针对性地利用。 Method: 提出ProVG框架,将语言表达解耦为全局上下文、空间关系和物体属性;设计渐进式跨模态调制器(survey-locate-verify机制);引入跨尺度融合模块和语言引导校准解码器;使用统一多任务头支持指代表达理解和分割任务。 Result: 在RRSIS-D和RISBench两个基准上显著超越现有方法,达到新的SOTA性能。 Conclusion: ProVG通过细粒度语言解耦与渐进式跨模态对齐,有效提升了遥感视觉定位任务的精度与鲁棒性,为RSVG任务提供了新范式。 Abstract: Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.

[141] SHARC: Reference point driven Spherical Harmonic Representation for Complex Shapes

Panagiotis Sapoutzoglou,George Terzakis,Maria Pateraki

Main category: cs.CV

TL;DR: SHARC是一种基于球谐函数(SH)距离场表示的新框架,通过在物体内部最优位置放置参考点并采样可见距离场,实现高保真、高效且简洁的任意拓扑形状合成。

Details Motivation: 现有方法在重建精度、效率和模型简洁性之间难以兼顾,且对复杂几何细节建模能力有限。 Method: 提出SHARC框架:在物体内部优化选择参考点,兼顾稀疏性、中心性和表面可见性;对每个点通过射线投射采样可见距离场,并用快速球谐变换(FSHT)计算SH系数;施加可配置低通滤波和基于邻近性的局部一致性优化。 Result: 在重建精度和时间效率上均优于当前最先进方法,同时保持模型简洁性。 Conclusion: SHARC为任意拓扑形状合成提供了一种高保真、高效且参数经济的新范式。 Abstract: We propose SHARC, a novel framework that synthesizes arbitrary, genus-agnostic shapes by means of a collection of Spherical Harmonic (SH) representations of distance fields. These distance fields are anchored at optimally placed reference points in the interior volume of the surface in a way that maximizes learning of the finer details of the surface. To achieve this, we employ a cost function that jointly maximizes sparsity and centrality in terms of positioning, as well as visibility of the surface from their location. For each selected reference point, we sample the visible distance field to the surface geometry via ray-casting and compute the SH coefficients using the Fast Spherical Harmonic Transform (FSHT). To enhance geometric fidelity, we apply a configurable low-pass filter to the coefficients and refine the output using a local consistency constraint based on proximity. Evaluation of SHARC against state-of-the-art methods demonstrates that the proposed method outperforms existing approaches in both reconstruction accuracy and time efficiency without sacrificing model parsimony. The source code is available at https://github.com/POSE-Lab/SHARC.

[142] FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation

Xilai Li,Chusheng Fang,Xiaosong Li

Main category: cs.CV

TL;DR: 本文提出FTPFusion,一种基于频率感知、时间扰动和稀疏跨模态交互的红外与可见光视频融合方法,旨在同时提升空间细节保真度和时间稳定性。

Details Motivation: 现有方法难以在保持时间稳定性的同时保留空间细节:要么仅关注单帧增强而缺乏时间建模,要么依赖繁重的时空聚合而损失高频细节。 Method: FTPFusion将特征分解为高低频分量:高频分支通过稀疏跨模态时空交互捕获运动上下文和互补细节;低频分支引入时间扰动策略以增强对闪烁、抖动和局部错位等变化的鲁棒性;并设计偏移感知的时间一致性约束来稳定跨帧表征。 Result: 在多个公开基准上,FTPFusion在空间保真度和时间一致性各项指标上均显著优于现有最先进方法。 Conclusion: FTPFusion通过频率解耦、时间扰动与稀疏交互的有效结合,为红外-可见光视频融合提供了兼顾细节与稳定性的新范式。 Abstract: Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at https://github.com/ixilai/FTPFusion.

[143] Light-ResKAN: A Parameter-Sharing Lightweight KAN with Gram Polynomials for Efficient SAR Image Recognition

Pan Yi,Weijie Li,Xiaodong Chen,Jiehua Zhang,Li Liu,Yongxiang Liu

Main category: cs.CV

TL;DR: 本文提出Light-ResKAN,一种基于Kolmogorov-Arnold网络(KAN)的轻量级模型,用于资源受限边缘设备上的SAR图像识别。通过将ResNet中的卷积替换为KAN卷积、采用Gram多项式作为激活函数、以及通道内核参数共享策略,在精度与效率间取得更好平衡。在多个SAR数据集上达到高准确率,并显著降低FLOPs和参数量。

Details Motivation: 大型SAR图像尺寸阻碍深度学习在边缘设备上的部署;现有轻量模型难以兼顾高精度特征提取与低计算开销。 Method: 提出Light-ResKAN:1)用KAN卷积替代ResNet中标准卷积;2)采用Gram多项式作为可学习激活函数;3)实施通道级参数共享策略以减少参数与计算量。 Result: 在MSTAR、FUSAR-Ship和SAR-ACD数据集上分别达到99.09%、93.01%和97.26%准确率;在缩放至1024×1024的MSTAR上相比VGG16降低82.90× FLOPs和163.78×参数量。 Conclusion: Light-ResKAN为边缘端SAR图像识别提供了一种高效、高精度的新解决方案,验证了KAN结构在遥感图像处理中的潜力。 Abstract: Synthetic Aperture Radar (SAR) image recognition is vital for disaster monitoring, military reconnaissance, and ocean observation. However, large SAR image sizes hinder deep learning deployment on resource-constrained edge devices, and existing lightweight models struggle to balance high-precision feature extraction with low computational requirements. The emerging Kolmogorov-Arnold Network (KAN) enhances fitting by replacing fixed activations with learnable ones, reducing parameters and computation. Inspired by KAN, we propose Light-ResKAN to achieve a better balance between precision and efficiency. First, Light-ResKAN modifies ResNet by replacing convolutions with KAN convolutions, enabling adaptive feature extraction for SAR images. Second, we use Gram Polynomials as activations, which are well-suited for SAR data to capture complex non-linear relationships. Third, we employ a parameter-sharing strategy: each kernel shares parameters per channel, preserving unique features while reducing parameters and FLOPs. Our model achieves 99.09%, 93.01%, and 97.26% accuracy on MSTAR, FUSAR-Ship, and SAR-ACD datasets, respectively. Experiments on MSTAR resized to $1024 \times 1024$ show that compared to VGG16, our model reduces FLOPs by $82.90 \times$ and parameters by $163.78 \times$. This work establishes an efficient solution for edge SAR image recognition.

[144] Lifting Unlabeled Internet-level Data for 3D Scene Understanding

Yixin Chen,Yaowei Zhang,Huangyue Yu,Junchao He,Yan Wang,Jiangyong Huang,Hongyu Shen,Junfeng Ni,Shaofei Wang,Baoxiong Jia,Song-Chun Zhu,Siyuan Huang

Main category: cs.CV

TL;DR: 本文提出利用网络上大量未标注视频,通过精心设计的数据引擎自动生成3D场景理解任务的训练数据,从而弥补标注数据稀缺的问题,并在多个3D感知与推理任务上验证了其有效性。

Details Motivation: 3D场景标注数据稀缺且昂贵,而互联网上存在大量未标注视频,亟需一种方法充分利用这些廉价资源提升3D场景理解模型性能。 Method: 设计并分析自动化数据生成的数据引擎,识别影响无标签数据学习效率与效果的关键瓶颈,并在多粒度3D感知与推理任务(如3D目标检测、实例分割、3D空间VQA和VLN)上进行验证。 Result: 基于生成数据训练的模型展现出优异的零样本性能,并在微调后进一步提升,证明了利用网络视频数据提升3D场景理解能力的可行性。 Conclusion: 网络上丰富的未标注视频可通过合适的数据引擎有效转化为高质量训练数据,为构建更强大的3D场景理解系统提供可行路径。 Abstract: Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.

[145] Night Eyes: A Reproducible Framework for Constellation-Based Corneal Reflection Matching

Virmarie Maquiling,Yasmeen Abdrabou,Enkelejda Kasneci

Main category: cs.CV

TL;DR: 本文提出了一种基于2D几何与星图识别启发的多光斑(glint)检测与匹配框架,强调可复现性与清晰评估,通过SLA方法实现鲁棒、身份保持的对应关系。

Details Motivation: 现有角膜反射(glint)检测多依赖硬件相关的启发式方法,导致跨平台不可复现;缺乏统一、可评估的多光斑处理框架。 Method: 提出基于星座结构的几何驱动流程,将glint视为有空间关系的结构化集合而非孤立斑点;设计相似性-布局对齐(SLA)算法,融合过检测、自适应候选回退、外观感知打分及可选语义布局先验,并显式分离检测与匹配步骤。 Result: 在公开多LED眼动数据集上验证,系统在噪声条件下仍能提供稳定且身份保持的glint对应;开源代码、预设与评测脚本,支持透明复现与标注。 Conclusion: 该框架提升了多glint检测的鲁棒性、可复现性与可评估性,为P-CR眼动追踪提供了更可靠、模块化的基础组件。 Abstract: Corneal reflection (glint) detection plays an important role in pupil-corneal reflection (P-CR) eye tracking, but in practice it is often handled as heuristics embedded within larger systems, making reproducibility difficult across hardware setups. We introduce a 2D geometry-driven, constellation-based pipeline for mulit-glint detection and matching, focusing on reproducibility and clear evaluation. Inspired by lost-in-space star identification, we treat glints as structured constellations rather than independent blobs. We propose a Similarity-Layout Alignment (SLA) procedure which adapts constellation matching to the specific constraints of multi-LED eye tracking. The framework brings together controlled over-detection, adaptive candidate fallback, appearance-aware scoring, and optional semantic layout priors while keeping detection and correspondence explicitly separated. Evaluated on a public multi-LED dataset, the system provides stable identity-preserving correspondence under noisy conditions. We release code, presets, and evaluation scripts to enable transparent replication, comparison, and dataset annotation.

[146] Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

Yifan Gao,Tao Zhou,Yi Zhou,Ke Zou,Yizhe Zhang,Huazhu Fu

Main category: cs.CV

TL;DR: 本文提出KnowMVG框架,通过知识增强提示和全局-局部注意力机制提升医学视觉定位(MVG)的空间精度,在多个基准上显著超越现有方法。

Details Motivation: 现有视觉语言模型(VLMs)在医学视觉定位任务中空间定位精度不足,主要因仅依赖潜在嵌入而缺乏显式的定位先验。 Method: 提出KnowMVG框架,包括:1)知识增强提示策略,将医学短语相关知识编码为紧凑嵌入;2)全局-局部注意力机制,联合利用粗粒度全局信息与细粒度局部线索以引导精确定位。 Result: 在四个MVG基准上,AP50提升3.0%,mIoU提升2.6%,显著优于当前最优方法;消融与定性实验验证各组件有效性。 Conclusion: 引入显式知识先验与分层注意力机制可有效弥合语义理解与细粒度视觉感知之间的鸿沟,提升MVG任务的可解释性与临床实用性。 Abstract: Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.

[147] Learning Spatial Structure from Pre-Beamforming Per-Antenna Range-Doppler Radar Data via Visibility-Aware Cross-Modal Supervision

George Sebastian,Philipp Berthold,Bianca Forkel,Leon Pohl,Mirko Maehlisch

Main category: cs.CV

TL;DR: 本文探讨了是否能直接从预波束成形的每根天线距离-多普勒(RD)数据中学习有意义的空间结构,而不依赖传统波束成形构建角度域表示;实验基于商用汽车雷达,在端到端数据驱动框架下,利用双啁啾共享权重编码器,并以可见性感知、跨模态(LiDAR)监督进行BEV占用率重建作为几何可恢复性探针;结果表明无需显式角度域构造或手工信号处理即可恢复空间结构。

Details Motivation: 传统汽车雷达感知流程依赖波束成形构建角度域表示,本文动机是探究能否绕过该步骤,直接从原始每根天线的RD数据中学习空间结构,从而简化流程并提升鲁棒性与泛化性。 Method: 采用6发8收(48虚拟天线)商用CS-FMCW雷达,利用A/B啁啾序列实现可控发射孔径变化;对预波束成形的每根天线RD张量,使用双啁啾共享权重编码器进行端到端训练;以LiDAR生成的可见性感知、遮挡建模的BEV占用图作为监督信号;通过啁啾消融(A/B单/双)、频段分析及物理对齐基线评估几何可恢复性。 Result: 实验证明:预波束成形的每根天线RD张量足以支撑空间结构的学习;不同啁啾配置(如单TX vs. 多TX)显著影响几何可恢复性;BEV占用重建质量验证了无需角度域显式构造即可实现有效空间感知。 Conclusion: 空间结构可直接从预波束成形的原始RD数据中学习,无需传统波束成形或手工设计的信号处理模块,为雷达感知提供了更简洁、数据驱动的新范式。 Abstract: Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird's-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.

[148] Rethinking Representations for Cross-Domain Infrared Small Target Detection: A Generalizable Perspective from the Frequency Domain

Yimin Fu,Songbo Wang,Feiyan Wu,Jialin Lyu,Zhunga Liu,Michael K. Ng

Main category: cs.CV

TL;DR: 本文提出了一种面向跨域红外小目标检测(IRSTD)的空间-频谱协同感知网络S²CPNet,通过频域分析揭示相位不一致性是域差异的主要表现,并设计相位校正模块(PRM)、正交注意力机制(OAM)和选择性风格重组合(SSR)来提升模型在未见域上的泛化能力。

Details Motivation: 现有IRSTD方法多局限于同域设定,难以应对训练与测试数据间因观测条件和环境变化导致的分布偏移;同时红外小目标本身信噪比低、特征模糊,易导致模型过拟合于源域特有模式,从而在跨域场景下性能严重下降。 Method: 提出空间-频谱协同感知网络S²CPNet:1)从频域视角重新建模IRSTD表征,发现域差异主要体现为频谱相位不一致;2)设计相位校正模块(PRM)增强目标感知的泛化性;3)在跳跃连接中引入正交注意力机制(OAM)兼顾位置信息与特征精炼;4)采用选择性风格重组合(SSR)缓解对域特有模式的偏差。 Result: 在三个IRSTD数据集上的大量跨域实验表明,所提方法在多种跨域设置下均达到当前最优性能。 Conclusion: 频域视角下的相位建模与协同优化策略可有效提升IRSTD模型的跨域泛化能力,S²CPNet为解决实际部署中分布偏移问题提供了新思路。 Abstract: The accurate target-background separation in infrared small target detection (IRSTD) highly depends on the discriminability of extracted representations. However, most existing methods are confined to domain-consistent settings, while overlooking whether such discriminability can generalize to unseen domains. In practice, distribution shifts between training and testing data are inevitable due to variations in observational conditions and environmental factors. Meanwhile, the intrinsic indistinctiveness of infrared small targets aggravates overfitting to domain-specific patterns. Consequently, the detection performance of models trained on source domains can be severely degraded when deployed in unseen domains. To address this challenge, we propose a spatial-spectral collaborative perception network (S$^2$CPNet) for cross-domain IRSTD. Moving beyond conventional spatial learning pipelines, we rethink IRSTD representations from a frequency perspective and reveal inconsistencies in spectral phase as the primary manifestation of domain discrepancies. Based on this insight, we develop a phase rectification module (PRM) to derive generalizable target awareness. Then, we employ an orthogonal attention mechanism (OAM) in skip connections to preserve positional information while refining informative representations. Moreover, the bias toward domain-specific patterns is further mitigated through selective style recomposition (SSR). Extensive experiments have been conducted on three IRSTD datasets, and the proposed method consistently achieves state-of-the-art performance under diverse cross-domain settings.

[149] Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm

Sixing Li,Zhibin Gu,Ziqi Zhang,Weiguo Pan,Bing Li,Ying Wang,Hongzhe Liu

Main category: cs.CV

TL;DR: 本文提出ECAC基准数据集和RSRS混合训练框架,用于提升幼儿教育图像描述生成的细粒度专业性,并开发了KinderMM-Cap-3B模型,在教学玩具识别等专业指标上显著超越现有方法。

Details Motivation: 现有图像描述方法在幼儿教育(ECE)领域面临两大挑战:缺乏大规模、领域专用数据集,导致描述泛化、不精确;传统训练范式(监督学习或强化学习)难以有效提升专业对象描述能力。 Method: 构建大规模ECE图像描述基准ECAC(含25.6万张真实图像及专家标注),设计领域导向评估指标TTS(教学玩具识别得分),并提出RSRS混合训练框架——动态切换强化学习与监督微调,将零奖励困难样本重定向至监督优化以缓解优势坍塌。 Result: 基于ECAC和RSRS开发的KinderMM-Cap-3B模型在TTS上达51.06,显著优于SOTA基线,同时保持高质量描述生成能力。 Conclusion: ECAC数据集、TTS评估协议与RSRS训练框架共同推动了面向幼儿教育的专业化图像描述技术发展,验证了领域适配多模态大模型在教育应用中的有效性与潜力。 Abstract: Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.

[150] A Self supervised learning framework for imbalanced medical imaging datasets

Yash Kumar Sharma,Charan Ramtej Kodi,Vineet Padmanabhan

Main category: cs.CV

TL;DR: 本文提出AMIMV方法,通过构建非对称多图像多视角对,结合自监督学习解决医学影像中标签数据稀缺与类别不平衡问题,并在MedMNIST多个子集上验证其鲁棒性与性能提升。

Details Motivation: 医学影像分析常面临标注数据稀缺和类别不平衡两大挑战;现有自监督学习方法虽缓解数据稀缺,但对其在类别不平衡下的鲁棒性研究不足。 Method: 扩展先前提出的MIMV方法,引入新增强策略构建非对称多图像多视角(AMIMV)对;在11个MedMNIST数据集的长尾分布与有限监督设定下,评估AMIMV及8种代表性SSL方法。 Result: 在MedMNIST上,AMIMV相较基线在retinaMNIST、tissueMNIST和DermaMNIST分别提升4.25%、1.88%和3.1%;并通过数据分析验证了其对不同程度类别不平衡的鲁棒性。 Conclusion: AMIMV能有效应对医学影像中的数据稀缺与类别不平衡问题,显著提升自监督学习在长尾分布下的分类性能,为医学SSL鲁棒性研究提供了新思路与实证支撑。 Abstract: Two problems often plague medical imaging analysis: 1) Non-availability of large quantities of labeled training data, and 2) Dealing with imbalanced data, i.e., abundant data are available for frequent classes, whereas data are highly limited for the rare class. Self supervised learning (SSL) methods have been proposed to deal with the first problem to a certain extent, but the issue of investigating the robustness of SSL to imbalanced data has rarely been addressed in the domain of medical image classification. In this work, we make the following contributions: 1) The MIMV method proposed by us in an earlier work is extended with a new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs to address both data scarcity and dataset imbalance in medical image classification. 2) We carry out a data analysis to evaluate the robustness of AMIMV under varying degrees of class imbalance in medical imaging . 3) We evaluate eight representative SSL methods in 11 medical imaging datasets (MedMNIST) under long-tailed distributions and limited supervision. Our experimental results on the MedMNIST dataset show an improvement of 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST.

[151] MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction

Xilai Li,Weijun Jiang,Xiaosong Li,Yang Liu,Hongbin Wang,Tao Ye,Huafeng Li,Haishu Tan

Main category: cs.CV

TL;DR: 本文提出MAVFusion,一种端到端红外与可见光视频融合框架,通过运动感知的稀疏交互机制,在保证高质量融合结果的同时显著提升计算效率。

Details Motivation: 现有方法多针对静态图像设计,难以有效处理视频帧间运动;而现有视频融合方法虽提升时序一致性,但计算开销大。 Method: 利用光流识别多模态序列中的动态区域,对动态区域自适应地施加高成本跨模态注意力,对静态背景区域采用轻量弱交互模块;通过解耦动态与静态区域处理实现高效与高质量兼顾。 Result: 在多个红外-可见光视频基准上达到SOTA性能,640×480分辨率下推理速度达14.16 FPS。 Conclusion: MAVFusion在保持优异融合质量与时间一致性的前提下,大幅降低计算复杂度,为实时多模态视频融合提供了高效可行方案。 Abstract: Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16\,FPS at $640 \times 480$ resolution. The source code will be available at https://github.com/ixilai/MAVFusion.

[152] Automated Prostate Gland Segmentation in MRI Using nnU-Net

Pablo Rodriguez-Belenguer,Gloria Ribas,Javier Aquerreta Escribano,Rafael Moreno-Calatayud,Leonor Cerda-Alberich,Luis Marti-Bonmati

Main category: cs.CV

TL;DR: 本文提出了一种基于nnU-Net v2框架的专用深度学习方法,利用多模态mpMRI(T2WI、DWI、ADC)自动分割前列腺腺体,在内部交叉验证和外部验证中均取得优异Dice分数(0.96和0.82),显著优于通用分割工具TotalSegmentator(0.15),并开源了容器化推理工具。

Details Motivation: 手动勾画前列腺耗时且存在观察者间差异,通用分割工具在前列腺特异性任务中精度不足。 Method: 采用nnU-Net v2框架,融合T2加权成像、扩散加权成像(DWI)及表观扩散系数(ADC)图进行多模态学习;在PI-CAI数据集981例上训练,通过5折交叉验证与西班牙La Fe医院54例外部队列验证。 Result: 交叉验证平均Dice得分为0.96±0.00,外部测试集为0.82;对比TotalSegmentator(Dice=0.15),本方法显著提升分割精度,尤其缓解欠分割问题。 Conclusion: 任务特异、多模态的深度学习策略对前列腺分割至关重要,所提方法具备临床研究落地潜力,并已容器化开源以支持复现与部署。 Abstract: Accurate segmentation of the prostate gland in multiparametric MRI (mpMRI) is a fundamental step for a wide range of clinical and research applications, including image registration, volume estimation, and radiomic analysis. However, manual delineation is time-consuming and subject to inter-observer variability, while general-purpose segmentation tools often fail to provide sufficient accuracy for prostate-specific tasks. In this work, we propose a dedicated deep learning-based approach for automatic prostate gland segmentation using the nnU-Net v2 framework. The model leverages multimodal mpMRI data, including T2-weighted imaging, diffusion-weighted imaging (DWI), and apparent diffusion coefficient (ADC) maps, to exploit complementary tissue information. Training was performed on 981 cases from the PI-CAI dataset using whole-gland annotations, and model performance was assessed through 5-fold cross-validation and external validation on an independent cohort of 54 patients from Hospital La Fe. The proposed model achieved a mean Dice score of 0.96 +/- 0.00 in cross-validation and 0.82 on the external test set, demonstrating strong generalization despite domain shift. In comparison, a general-purpose approach (TotalSegmentator) showed substantially lower performance, with a Dice score of 0.15, primarily due to under-segmentation of the gland. These results highlight the importance of task-specific, multimodal segmentation strategies and demonstrate the potential of the proposed approach for reliable integration into clinical research workflows. To facilitate reproducibility and deployment, the model has been fully containerized and is available as a ready-to-use inference tool.

[153] Ego-Grounding for Personalized Question-Answering in Egocentric Videos

Junbin Xiao,Shenglang Zhang,Pengxiang Zhu,Angela Yao

Main category: cs.CV

TL;DR: 本文提出了首个面向自我中心视频个性化问答的基准数据集MyEgo,并系统评估了多模态大语言模型(MLLMs)在‘自我定位’(ego-grounding)与长时记忆方面的能力,发现现有模型表现远逊于人类,且模型规模与推理机制未能带来稳定提升。

Details Motivation: 现有MLLMs缺乏对自我中心视频中相机佩戴者(即“我”)的理解、记忆与推理能力,亟需专门数据集和系统性评估来推动个性化问答发展。 Method: 构建首个自我中心视频问答数据集MyEgo(含541个长视频、5K个个性化问题),涵盖‘我的物品’、‘我的活动’、‘我的过去’三类问题;在多种主流MLLM(开源/闭源、思考型/非思考型、不同参数量)上进行基准评测,并分析证据提供、时间衰减等影响因素。 Result: 顶尖闭源(GPT-5)与开源(Qwen3-VL)模型准确率仅约46%和36%,显著低于人类水平(差距达40%–50%);显式推理与模型放大未带来一致提升;提供相关证据可短期提效,但随时间推移性能迅速下降。 Conclusion: ego-grounding与长程记忆是实现自我中心视频个性化问答的关键瓶颈;MyEgo为该方向提供了重要基准与分析基础,有望推动面向个人辅助的具身AI发展。 Abstract: We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo

[154] SDesc3D: Towards Layout-Aware 3D Indoor Scene Generation from Short Descriptions

Jie Feng,Jiawei Shen,Junjia Huang,Junpeng Zhang,Mingtao Feng,Weisheng Dong,Guanbin Li

Main category: cs.CV

TL;DR: 本文提出SDesc3D框架,通过多视角结构先验增强和功能感知的布局锚定,提升短文本引导下的3D室内场景生成的物理合理性和细节丰富度。

Details Motivation: 现有文本条件3D场景生成方法在短文本条件下物理合理性差、细节不足,主要因过度依赖显式语义与空间关系线索,缺乏有效的3D推理能力(如先验整合与空间锚定)。 Method: 提出SDesc3D框架:1)多视角场景先验增强,将稀疏文本映射为多视角关系先验;2)功能感知布局锚定,利用区域功能隐式定义空间锚点并分层推理布局;3)迭代反思-修正机制,实现结构合理性的渐进优化。 Result: 在短文本条件3D室内场景生成任务上显著优于现有方法,提升了物理合理性和语义细节丰富度。 Conclusion: 融合多视角结构先验与功能语义锚定可有效增强模型在稀疏文本指导下的3D空间推理能力,为交互式3D环境构建提供更鲁棒、更细致的生成方案。 Abstract: 3D indoor scene generation conditioned on short textual descriptions provides a promising avenue for interactive 3D environment construction without the need for labor-intensive layout specification. Despite recent progress in text-conditioned 3D scene generation, existing works suffer from poor physical plausibility and insufficient detail richness in such semantic condensation cases, largely due to their reliance on explicit semantic cues about compositional objects and their spatial relationships. This limitation highlights the need for enhanced 3D reasoning capabilities, particularly in terms of prior integration and spatial anchoring.Motivated by this, we propose SDesc3D, a short-text conditioned 3D indoor scene generation framework, that leverages multi-view structural priors and regional functionality implications to enable 3D layout reasoning under sparse textual guidance.Specifically, we introduce a Multi-view scene prior augmentation that enriches underspecified textual inputs with aggregated multi-view structural knowledge, shifting from inaccessible semantic relation cues to multi-view relational prior aggregation. Building on this, we design a Functionality-aware layout grounding, employing regional functionality grounding for implicit spatial anchors and conducting hierarchical layout reasoning to enhance scene organization and semantic plausibility.Furthermore, an Iterative reflection-rectification scheme is employed for progressive structural plausibility refinement via self-rectification.Extensive experiments show that our method outperforms existing approaches on short-text conditioned 3D indoor scene generation.Code will be publicly available.

[155] NearID: Identity Representation Learning via Near-identity Distractors

Aleksandar Cvejic,Rameen Abdal,Abdelrahman Eldesokey,Bernard Ghanem,Peter Wonka

Main category: cs.CV

TL;DR: 本文提出NearID框架,通过引入近身份干扰项(Near-identity distractors)来解耦身份与背景上下文,构建NearID数据集和严格评估协议,揭示现有视觉编码器在身份识别任务中的脆弱性,并通过两层对比学习显著提升身份感知表征能力。

Details Motivation: 现有视觉编码器在身份相关任务中将对象身份与背景上下文纠缠,导致表征和评估不可靠。 Method: 提出NearID原理:使用语义相似但身份不同的实例置于相同背景上,消除上下文捷径;构建NearID数据集(19K身份,316K匹配背景干扰项)和基于严格边距的评估协议;设计两层对比学习目标(同一身份 > NearID干扰项 > 随机负样本),在冻结骨干网络上学习身份感知表征。 Result: 预训练编码器在NearID协议下Sample Success Rate(SSR)低至30.7%;所提方法将SSR提升至99.2%,部件级判别能力提升28.0%,并在DreamBench++上更符合人类判断。 Conclusion: NearID为身份感知表征提供了首个原则性评估与学习框架,有效解耦身份与背景,显著提升模型在个性化生成与编辑等任务中的可靠性与可评估性。 Abstract: When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/

[156] Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation

Yuqing Huang,Guotian Zeng,Zhenqiao Yuan,Zhenyu He,Xin Li,Yaowei Wang,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 本文提出交互式跟踪(Interactive Tracking)新范式,通过自然语言指令实现人机协同跟踪,并构建首个大规模基准InteractTrack、评估协议及基线方法IMAT。

Details Motivation: 现有视觉跟踪器为非交互式,难以适应需人工干预的真实场景,缺乏人机协同能力。 Method: 提出交互式跟踪新范式;构建含150个视频和时序语言指令的大规模基准InteractTrack;设计综合评估协议;提出基于动态记忆机制的基线方法IMAT以学习用户反馈并实时更新跟踪行为。 Result: 评估25种主流跟踪器发现其在交互场景下性能显著下降,传统基准上的优异表现无法迁移;IMAT展现出更强的交互适应能力。 Conclusion: InteractTrack基准、评估协议与IMAT基线共同为开发更智能、自适应、协作式的跟踪系统奠定基础,弥合自动感知与人类引导之间的鸿沟。 Abstract: Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at https://github.com/NorahGreen/InteractTrack.git.

[157] Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models

Antoine Saporta,Baptiste Callard,Corentin Dancette,Julien Khlaut,Charles Corbière,Leo Butsanets,Amaury Prat,Pierre Manceron

Main category: cs.CV

TL;DR: 本文提出了Curia-2,一种专为CT和MRI影像优化的百亿参数多模态基础模型,改进了预训练策略与表征质量,并构建了包含2D/3D双轨的CuriaBench评估基准,在视觉任务上全面超越现有基础模型,在临床复杂任务上媲美视觉语言模型。

Details Motivation: 医学影像快速增长导致放射科医生工作负担过重,现有基础模型在处理复杂放射学体数据方面仍有优化空间。 Method: 基于Curia框架,提出Curia-2,改进预训练策略并提升表征质量;首次将架构扩展至十亿参数级Vision Transformer;构建双轨CuriaBench评估基准(2D切片级与3D体素级)。 Result: Curia-2在视觉导向任务上全面优于所有现有基础模型,在临床复杂任务(如病灶检测)上表现接近视觉语言模型。 Conclusion: Curia-2显著提升了医学影像基础模型的规模、表征能力和评估规范性,推动了多模态医学AI发展,模型权重将开源。 Abstract: The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.

[158] Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

Boyang Gong,Yu Zheng,Fanye Kong,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的Inertia-aware Visual Excitation(IVE)方法,用于缓解多模态大语言模型(MLLMs)中视觉注意力的惯性问题,从而提升其在认知推理任务中的表现,特别是减少认知型幻觉。

Details Motivation: 现有幻觉缓解方法主要针对感知型幻觉(如物体存在或属性错误),而难以解决需要对象间关系推理的认知型幻觉;作者发现MLLMs视觉注意力在早期解码后趋于静态(即‘视觉惯性’),阻碍了组合式理解与认知推理。 Method: 通过词元级注意力分析识别视觉惯性现象,并提出无需训练的IVE方法:1)建模注意力的动态响应性以支持认知推理;2)选择相对于历史注意力趋势动态浮现的视觉词元,区分惯性行为词元;3)引入惯性感知惩罚项,抑制注意力过度集中与局部区域持续聚焦。 Result: IVE在多个基础MLLM和多种幻觉评测基准上均展现出有效性,尤其显著改善认知型幻觉的缓解效果。 Conclusion: 视觉注意力惯性是导致MLLMs认知推理能力受限的关键因素;IVE作为一种轻量、训练无关的干预机制,可有效打破该惯性,增强模型对复杂视觉关系的理解与推理能力。 Abstract: Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

[159] Resonance4D: Frequency-Domain Motion Supervision for Preset-Free Physical Parameter Learning in 4D Dynamic Physical Scene Simulation

Changshe Zhang,Jie Feng,Siyu Chen,Guanbin Li,Ronghua Shang,Junpeng Zhang

Main category: cs.CV

TL;DR: 本文提出Resonance4D框架,通过将3D高斯泼溅与物质点法结合,并引入双域运动监督(DMS)和仿真引导的全参数物理恢复策略,在保证物理真实性和运动一致性的前提下,显著降低计算与显存开销,实现单消费级GPU上的高质量4D动态仿真。

Details Motivation: 现有方法依赖高成本视频扩散或光流监督,且仅优化部分材料参数,难以建模复杂材质与动力学行为。 Method: 提出Resonance4D框架:1)耦合3D高斯泼溅与物质点法;2)设计双域运动监督(DMS),融合空间结构一致性与频域谱一致性;3)结合零样本文本分割与仿真引导初始化,实现对象部件级分解与全材料参数联合优化。 Result: 在合成与真实场景上验证了高物理保真度与运动一致性;峰值GPU显存从35GB以上降至约20GB,支持单消费级GPU运行。 Conclusion: Resonance4D有效缓解了物理驱动4D仿真中监督成本高、参数简化导致失真等关键瓶颈,为轻量、高保真动态仿真提供了新范式。 Abstract: Physics-driven 4D dynamic simulation from static 3D scenes remains constrained by an overlooked contradiction: reliable motion supervision often relies on online video diffusion or optical-flow pipelines whose computational cost exceeds that of the simulator itself. Existing methods further simplify inverse physical modeling by optimizing only partial material parameters, limiting realism in scenes with complex materials and dynamics. We present Resonance4D, a physics-driven 4D dynamic simulation framework that couples 3D Gaussian Splatting with the Material Point Method through lightweight yet physically expressive supervision. Our key insight is that dynamic consistency can be enforced without dense temporal generation by jointly constraining motion in complementary domains. To this end, we introduce Dual-domain Motion Supervision (DMS), which combines spatial structural consistency for local deformation with frequency-domain spectral consistency for oscillatory and global dynamic patterns, substantially reducing training cost and memory overhead while preserving physically meaningful motion cues. To enable stable full-parameter physical recovery, we further combine zero-shot text-prompted segmentation with simulation-guided initialization to automatically decompose Gaussians into object-part-level regions and support joint optimization of full material parameters. Experiments on both synthetic and real scenes show that Resonance4D achieves strong physical fidelity and motion consistency while reducing peak GPU memory from over 35\,GB to around 20\,GB, enabling high-fidelity physics-driven 4D simulation on a single consumer-grade GPU.

[160] MTLSI-Net: A Linear Semantic Interaction Network for Parameter-Efficient Multi-Task Dense Prediction

Chen Liu,Hengyu Man,Xiaopeng Fan,Debin Zhao

Main category: cs.CV

TL;DR: 本文提出MTLSI-Net,通过线性注意力机制实现多任务密集预测中低复杂度的全局跨任务交互,在NYUDv2和PASCAL-Context上达到SOTA。

Details Motivation: 标准自注意力在高分辨率特征上具有二次复杂度,难以有效建模多任务密集预测中的全局跨任务交互。 Method: 提出MTLSI-Net,包含三个核心模块:多任务多尺度查询线性融合块(利用共享全局上下文矩阵实现线性复杂度跨任务依赖建模)、语义令牌蒸馏器(压缩冗余特征为紧凑语义令牌)、跨窗口集成注意力块(双分支结构将全局语义注入局部特征)。 Result: 在NYUDv2和PASCAL-Context数据集上取得当前最优性能,验证了方法在有效性与效率上的优势。 Conclusion: MTLSI-Net以线性复杂度和更少参数实现了全面的跨任务交互建模,为多任务密集预测提供了高效可扩展的新范式。 Abstract: Multi-task dense prediction aims to perform multiple pixel-level tasks simultaneously. However, capturing global cross-task interactions remains non-trivial due to the quadratic complexity of standard self-attention on high-resolution features. To address this limitation, we propose a Multi-Task Linear Semantic Interaction Network (MTLSI-Net), which facilitates cross-task interaction through linear attention. Specifically, MTLSI-Net incorporates three key components: a Multi-Task Multi-scale Query Linear Fusion Block, which captures cross-task dependencies across multiple scales with linear complexity using a shared global context matrix; a Semantic Token Distiller that compresses redundant features into compact semantic tokens, distilling essential cross-task knowledge; and a Cross-Window Integrated attention Block that injects global semantics into local features via a dual-branch architecture, preserving both global consistency and spatial precision. These components collectively enable the network to capture comprehensive cross-task interactions at linear complexity with reduced parameters. Extensive experiments on NYUDv2 and PASCAL-Context demonstrate that MTLSI-Net achieves state-of-the-art performance, validating its effectiveness and efficiency in multi-task learning.

[161] ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction

Sirshapan Mitra,Yogesh S. Rawat

Main category: cs.CV

TL;DR: 本文提出ProDiG框架,通过渐进式高斯点阵变换和扩散引导,从纯航拍图像生成地面视角视图和一致的3D场景模型,无需多高度真值数据。

Details Motivation: 现有方法在极端视角变化、中间观测缺失和尺度差异大的情况下难以生成几何一致的地面视图和3D模型,且依赖难以获取的多高度真值数据或后处理导致几何不一致。 Method: 提出ProDiG(Progressive Altitude Gaussian Splatting),结合扩散引导的渐进式高斯表示优化、几何感知的因果注意力模块(注入对极结构)以及距离自适应高斯模块(动态调整尺度与不透明度)。 Result: 在合成与真实数据集上显著优于现有方法,在视觉质量、几何一致性及对极端视角变化的鲁棒性方面表现突出。 Conclusion: ProDiG实现了无需额外地面真值视角的、渐进且几何可信的航拍到地面视角重建,为单源 aerial-to-ground 三维建模提供了新范式。 Abstract: Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-to-ground gaps. To address these limitations, we introduce ProDiG (Progressive Altitude Gaussian Splatting), a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes.

[162] Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models

Osher Rafaeli,Tal Svoray,Ariel Nahlieli

Main category: cs.CV

TL;DR: 本文提出Prior2DSM,一种无需训练的数字表面模型(DSM)补全框架,利用DINOv3视觉特征与单目深度基础模型,在测试时通过语义特征空间对应传播度量高度信息,并结合LoRA与轻量MLP实现测试时自适应校准,显著降低重建误差。

Details Motivation: 大规模DSM常存在缺失或过时区域,传统插值法因依赖空间连续性而失效,现有学习方法又受限于监督训练和传感器特异性,泛化能力差。 Method: Prior2DSM为无训练框架,融合DINOv3自监督ViT特征与单目深度基础模型,在测试时通过语义特征空间匹配传播高度先验;采用LoRA+轻量MLP进行测试时自适应,预测空间变化的尺度与偏移参数,将相对深度转为度量高度。 Result: 实验表明Prior2DSM在RMSE上相较线性拟合MDE最高降低46%,优于插值法、先验重标定法及前沿单目深度估计模型,同时保持结构保真度,并支持DSM更新与RGB-DSM联合生成。 Conclusion: Prior2DSM通过解耦语义理解与度量校准、摒弃训练依赖,实现了跨域、跨传感器的通用DSM补全,为地理空间建模提供了高效、灵活的新范式。 Abstract: Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.

[163] Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation

Jie Feng,Fengze Li,Junpeng Zhang,Siyu Chen,Yuping Liang,Junying Chen,Ronghua Shang

Main category: cs.CV

TL;DR: 本文提出DR-Seg框架,通过解耦CLIP特征为语义主导与结构主导子空间,并利用DINO特征进行有针对性的结构增强,再结合图校正与自适应融合模块,显著提升遥感图像开放词汇语义分割性能。

Details Motivation: CLIP全局对齐的视觉表征难以捕捉结构细节,而现有引入DINO特征的方法未能定位需结构增强的位置,易破坏CLIP语义完整性。 Method: DR-Seg框架:1)解耦CLIP特征为语义主导和结构主导子空间;2)基于DINO引导的先验驱动图校正模块生成精细化分支;3)不确定性引导的自适应融合模块动态融合精细化分支与原始CLIP分支。 Result: 在八个基准测试上取得新SOTA性能。 Conclusion: DR-Seg有效平衡了语言对齐语义与细粒度空间划分,验证了特征通道功能异质性建模对开放词汇遥感分割的重要性。 Abstract: Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.

[164] Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence

Dian Liu,Jie Feng,Di Li,Yuhui Zheng,Guanbin Li,Weisheng Dong,Guangming Shi

Main category: cs.CV

TL;DR: 本文提出LinkS²Bench,首个用于评估视觉-语言模型(VLMs)在动态、广域跨视角空间智能方面能力的综合基准,涵盖无人机视频与高分辨率卫星图像的对齐,并构建了17.9k高质量问答对;实验发现跨视角动态对齐是关键瓶颈,并设计了Cross-View Alignment Adapter有效提升性能。

Details Motivation: 现有基准仅关注孤立的无人机视频或静态卫星图像,无法评估VLMs在动态本地-全局空间映射和跨视角推理中的能力,亟需能反映真实应急与安防场景中空天协同需求的新基准。 Method: 构建LinkS²Bench基准:关联1022分钟动态无人机影像与覆盖200 km²的高分辨率卫星图像;通过LMM辅助流程与人工精标,生成17.9k问答对,覆盖感知、定位、关系、推理四维度共12细粒度任务;设计Cross-View Alignment Adapter并开展18个主流VLMs评测与消融实验。 Result: 18个代表性VLMs在LinkS²Bench上显著落后于人类基线,证实跨视角动态对齐为关键瓶颈;所提Adapter显著提升性能;微调实验证明该基准可有效推动VLM面向复杂空间推理的适配。 Conclusion: LinkS²Bench填补了空天协同场景下VLM空间智能评估的空白,揭示了动态跨视角对齐的核心挑战,并为未来VLM在应急响应与安全操作等现实任务中的发展提供了数据、评测与方法基础。 Abstract: Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS$^2$Bench, the first comprehensive benchmark designed to evaluate VLMs' wide-area, dynamic cross-view spatial intelligence. LinkS$^2$Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km$^2$. Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS$^2$Bench in advancing VLM adaptation for complex spatial reasoning.

[165] Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data

Alejandro Castañeda Garcia,Jan van Gemert,Daan Brinks,Nergis Tömen

Main category: cs.CV

TL;DR: 本文提出了一种针对空间数据不平衡问题的自编码器改进方法,通过自熵损失和样本传播机制提升对稀有空间位置的重建能力,在模拟与真实多领域数据上均优于基线方法。

Details Motivation: 自编码器在处理医学影像、生物学和物理学中常见的空间非均匀采样图像时表现不佳,因背景占主导导致模型偏向多数模式,丢失细节并产生模糊重建。 Method: 提出两种互补组件:(i) 基于自熵的损失函数,增强统计上不常见空间位置的权重;(ii) 样本传播(Sample Propagation),一种在训练中跨批次选择性重放难重建样本的回放机制。 Result: 在可控模拟数据集及三个真实世界数据集(物理、生物、天文)上验证,该方法在多种重建指标上优于现有基线,尤其在空间不平衡分布下效果显著。 Conclusion: 空间不平衡是影响无监督图像重建质量的关键因素,需在批处理中重视数据表征与稀有样本,所提方法有效缓解该问题。 Abstract: Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.

[166] IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline

Sebastian-Ion Nae,Radu Moldoveanu,Alexandra Stefania Ghita,Adina Magda Florea

Main category: cs.CV

TL;DR: 本文介绍了IndoorCrowd数据集,用于室内人群检测、实例分割和多目标跟踪,包含31个视频(9913帧),涵盖四个校园场景,并提供了人类验证的分割掩码;通过对比SAM3等基础模型与人工标注的性能,以及建立YOLOv8n等模型的基线结果,揭示了不同场景因密度、尺度和遮挡带来的难度差异。

Details Motivation: 现有数据集难以在大规模下捕捉真实室内环境中人群行为的复杂性,而理解此类行为对监控、智能建筑和人机交互至关重要。 Method: 构建了名为IndoorCrowd的多场景数据集,涵盖四个校园地点,含31段视频(9913帧,5fps)及人工验证的逐实例分割掩码;使用Cohen's κ、AP、精度、召回率和掩码IoU评估SAM3、GroundingSAM和EfficientGroundingSAM在620帧控制子集上的自动标注性能;在2552帧子集上采用MOTChallenge格式支持多目标跟踪;并以YOLOv8n、YOLOv26n、RT-DETR-L结合ByteTrack、BoT-SORT、OC-SORT建立检测、分割与跟踪基线。 Result: ACS-EC场景最具挑战性,79.3%帧为高密度,平均实例尺度仅60.8像素;各模型在不同场景中表现差异显著,体现出密度、尺度与遮挡对任务难度的关键影响。 Conclusion: IndoorCrowd填补了真实复杂室内人群理解数据集的空白,提供了高质量标注、严格评估协议和强基线,推动室内人群感知技术发展。 Abstract: Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.

[167] Efficient Reasoning via Thought Compression for Language Segmentation

Qing Zhou,Shiyu Zhang,Yuyu Jia,Junyu Gao,Weiping Ni,Junzheng Wu,Qi Wang

Main category: cs.CV

TL;DR: WISE是一种高效多模态推理新范式,通过'思考两次'(一次学习、一次提速)策略,在保持甚至提升分割性能的同时,大幅压缩推理长度。

Details Motivation: 链式思维(CoT)虽提升了语言引导分割性能,但因生成冗长推理过程导致计算开销过大,难以实际部署。 Method: WISE训练模型生成三段式结构输出:先输出简洁理由,再输出最终答案,最后输出详细解释;利用自回归条件机制使简洁理由成为详细解释的充分摘要,并通过联合语义保真与简洁性的自蒸馏目标强化该能力;推理时仅使用简洁理由,并借助WISE-S提示技术缓解分布偏移。 Result: 在ReasonSeg基准上实现58.3 cIoU的零样本SOTA性能,平均推理长度从112词缩减至23词(压缩约5倍)。 Conclusion: WISE证明了将详细推理内化为简洁策略的可行性,显著提升效率而不牺牲性能,为高效多模态推理提供了新思路。 Abstract: Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textit{thinking twice -- once for learning, once for speed}. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user's query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf{5$\times$} -- from 112 to just 23 tokens. Code is available at \href{https://github.com/mrazhou/WISE}{WISE}.

[168] Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

Issa Sugiura,Keito Sasagawa,Keisuke Nakao,Koki Maeda,Ziqi Yin,Zhishen Yang,Shuhei Kurita,Yusuke Oda,Ryoko Tokuhisa,Daisuke Kawahara,Naoaki Okazaki

Main category: cs.CV

TL;DR: 本文提出Jagle,目前最大的日语多模态后训练数据集(约920万样本),通过多种策略(如VLM生成、翻译、文本渲染)从异构源数据构建,显著提升日语VLM性能,并兼顾英语性能。

Details Motivation: 现有英文视觉语言模型(VLM)依赖大规模VQA数据集,但日语等非英语语言缺乏足够规模和领域覆盖的VQA资源,严重阻碍多语言VLM发展。 Method: 构建Jagle数据集:收集图像、图文对、PDF等异构源数据,采用VLM自动生成问答对、跨语言翻译、文本渲染等多种策略生成日语VQA样本;并在2.2B参数模型上进行后训练与评估。 Result: 基于Jagle训练的2.2B模型在10项日语评测任务平均分超越InternVL3.5-2B,接近Qwen3-VL-2B-Instruct(相差约5分);与FineVision联合训练不仅不损害英语性能,反而提升英语表现。 Conclusion: Jagle为日语VLM提供了高质量、大规模、多样化的后训练资源,验证了不依赖现成VQA数据构建多语言多模态数据集的可行性,且具备跨语言正向迁移能力。 Abstract: Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.

[169] True to Tone? Quantifying Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines

Gabriel Ferri Schneider,Erick Menezes,Rafael Mecenas,Paulo Knob,Victor Araujo,Soraia Raupp Musse

Main category: cs.CV

TL;DR: 本文提出了一种全自动、可扩展的方法,用于系统评估虚拟人(VH)生成流程中肤色保真度,发现肤色提取策略存在表型依赖性,且深色肤色的色度误差始终更高。

Details Motivation: 现有虚拟人头像生成流程多依赖未经色彩校准的照片输入,易引入不一致性和偏差,影响肤色再现的准确性、身份保持与公平性。 Method: 构建端到端工作流,整合肤色与光照提取、纹理重着色、实时渲染及定量颜色分析;采用CFD人脸图像,对比脸颊采样与全脸多维掩码两种肤色提取策略,并结合预训练TRUST框架进行光照隔离;将提取肤色应用于MetaHuman纹理,在多种光照下渲染,并在CIELAB空间用ΔE和ITA指标客观评估一致性。 Result: 共生成并分析约19,848个渲染实例;结果表明肤色提取策略表现具有表型依赖性,且深色肤色始终表现出更高的色度误差。 Conclusion: 所提方法无需人工干预、无训练环节(仅使用预训练光照补偿模块),计算成本低、适合大规模评估,揭示了当前VH管线中肤色保真度的系统性偏差问题。 Abstract: Accurate reproduction of facial skin tone is essential for realism, identity preservation, and fairness in Virtual Human (VH) rendering. However, most accessible avatar creation pipelines rely on photographic inputs that lack colorimetric calibration, which can introduce inconsistencies and bias. We propose a fully automatic and scalable methodology to systematically evaluate skin tone fidelity across the VH generation pipeline. Our approach defines a full workflow that integrates skin color and illumination extraction, texture recolorization, real-time rendering, and quantitative color analysis. Using facial images from the Chicago Face Database (CFD), we compare skin tone extraction strategies based on cheek-region sampling, following the literature, and multidimensional masking derived from full-face analysis. Additionally, we test both strategies with lighting isolation, using the pre-trained TRUST framework, employed without any training or optimization within our pipeline. Extracted skin tones are applied to MetaHuman textures and rendered under multiple lighting configurations. Skin tone consistency is evaluated objectively in the CIELAB color space using the $ΔE$ metric and the Individual Typology Angle (ITA). The proposed methodology operates without manual intervention and, with the exception of pre-trained illumination compensation modules, the pipeline does not include learning or training stages, enabling low computational cost and large-scale evaluation. Using this framework, we generate and analyze approximately 19,848 rendered instances. Our results show phenotype-dependent behavior of extraction strategies and consistently higher colorimetric errors for darker skin tones.

[170] COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing

Hao Wang,Yanyu Qian,Pengcheng Weng,Zixuan Xia,William Dan,Yangxin Xu,Fei Wang

Main category: cs.CV

TL;DR: 本文提出COMPASS框架,通过为每个缺失模态生成目标特定的代理标记(proxy token),确保融合头始终接收固定N槽的多模态输入,从而提升缺失模态下的跨模态交互与融合鲁棒性。

Details Motivation: 现有缺失模态融合方法因动态调整输入结构,导致训练与推理不一致,融合不完整、跨模态交互减弱。 Method: COMPASS基于‘融合完整性’原则,为每个缺失模态在共享隐空间中利用成对源-目标生成器合成代理标记,并通过代理对齐、共享空间正则化和逐代理判别监督,确保其表征兼容性与任务信息性;最终聚合为单个替换标记,维持固定N-slot输入。 Result: 在XRF55、MM-Fi和OctoNet数据集上,COMPASS在单/多模态缺失场景下大幅优于先前方法,多数场景表现最优。 Conclusion: 保持模态完整的融合接口是一种简单而有效的鲁棒多模态感知设计原则。 Abstract: Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.

[171] CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

Jingliang Li,Jindou Jia,Tuo An,Chuhao Zhou,Xiangyu Chen,Shilin Shan,Boyu Ma,Bofan Lyu,Gen Li,Jianfei Yang

Main category: cs.CV

TL;DR: 本文提出多物体情境下的意图驱动型3D功能接地新任务,构建首个聚焦隐式意图与混淆对的基准CompassAD,并设计CompassNet框架,通过实例约束的跨模态注入(ICI)和双层对比精化(BCR)模块解决混淆物体间的功能区分问题,在仿真与真实机器人抓取中均取得SOTA效果。

Details Motivation: 现有3D功能识别方法多在单物体、显式类别提示下评估,无法应对真实场景中多个物体共享相同功能但仅一个符合任务意图的‘混淆对’挑战。 Method: 提出CompassNet框架,包含两个核心模块:1)Instance-bounded Cross Injection(ICI),在物体实例边界内约束语言-几何对齐,防止跨物体语义泄露;2)Bi-level Contrastive Refinement(BCR),在几何组和点两个层次进行对比学习,增强目标与混淆表面的判别性。 Result: 在自建基准CompassAD上实现SOTA性能,泛化至未见指令表现优异;在真实机械臂上成功部署,验证其在混淆多物体场景中抓取任务的有效迁移能力。 Conclusion: 隐式意图驱动的多物体功能接地是更贴近真实交互需求的关键方向;结构化地建模物体边界与层级对比机制可有效缓解功能混淆问题,为具身智能中的语义-几何联合推理提供新范式。 Abstract: When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.

[172] Network Structure in UK Payment Flows: Evidence on Economic Interdependencies and Implications for Real-Time Measurement

Aditya Humnabadkar

Main category: cs.CV

TL;DR: 本文通过分析2017-2024年英国53.2万条跨行业支付记录,发现图论网络特征(如中心性、聚类系数)可显著提升支付流预测精度,尤其在经济冲击(如新冠疫情)期间优势突出;识别出金融、批发贸易和专业服务为结构性关键行业,并指出支付网络密度变化可作为经济结构变化的先行指标。

Details Motivation: 传统双边测量方法难以揭示行业间隐含的结构性经济关系,而实时经济监测亟需更稳健、更具解释力的指标,尤其在经济扰动时期传统时间序列模型表现急剧恶化。 Method: 基于89个行业、532,346条UK支付记录构建有向加权行业支付网络,提取中心性(如入度、介数)、聚类系数等图论特征,将其融入时序预测模型(如ARIMA或机器学习回归),并与纯时间序列基线模型对比预测性能(R²等)。 Result: 网络特征使预测准确率平均提升8.8个百分点;疫情期间网络贡献达+13.8个百分点(R²从0.19回升);识别出金融、批发贸易、专业服务为结构性中心行业;网络密度整体上升12.5%,2020年骤降后反弹超疫情前水平。 Conclusion: 行业支付网络结构特征不仅增强短期预测能力,更可作为官方统计的补充工具,提供结构性经济变化的先行信号,尤其在传统时间模式失效的动荡期具有不可替代的监测价值。 Abstract: Network analysis of inter-industry payment flows reveals structural economic relationships invisible to traditional bilateral measurement approaches, with significant implications for real-time economic monitoring. Analysing 532,346 UK payment records (2017--2024) across 89 industry sectors, we demonstrate that graph-theoretic features which include centrality measures and clustering coefficients improve payment flow forecasting by 8.8 percentage points beyond traditional time-series methods. Critically, network features prove most valuable during economic disruptions: during the COVID-19 pandemic, when traditional forecasting accuracy collapsed (R2} falling from 0.38 to 0.19), network-enhanced models maintained substantially better performance, with network contributions reaching +13.8 percentage points. The analysis identifies Financial Services, Wholesale Trade, and Professional Services as structurally central industries whose network positions indicate systemic importance beyond their transaction volumes. Network density increased 12.5\% over the sample period, with visible disruption during 2020 followed by recovery exceeding pre-pandemic integration levels. These findings suggest payment network monitoring could enhance official statistics production by providing leading indicators of structural economic change and improving nowcasting accuracy during periods when traditional temporal patterns prove unreliable.

[173] Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

Soo Won Seo,KyungChae Lee,Hyungchan Cho,Taein Son,Nam Ik Cho,Jun Won Choi

Main category: cs.CV

TL;DR: 本文提出InCoM-Net框架,通过融合视觉语言模型(VLM)的语义知识与检测器的实例特征,增强人-物交互(HOI)检测中的上下文建模能力,在HICO-DET和V-COCO上达到SOTA性能。

Details Motivation: 现有基于VLM的HOI检测方法未能充分利用场景中分布广泛的多样化上下文线索,限制了交互推理的深度和广度。 Method: 提出Instance-centric Context Mining Network(InCoM-Net),包含两个核心模块:Instance-centric Context Refinement(ICR)用于分别提取实例内、实例间和全局上下文线索;Progressive Context Aggregation(ProCA)迭代融合多层级上下文特征与检测器实例特征。 Result: 在HICO-DET和V-COCO基准上均取得当前最优性能(state-of-the-art)。 Conclusion: InCoM-Net有效提升了HOI检测中对复杂场景上下文的理解与利用能力,验证了实例中心化多粒度上下文建模的有效性。 Abstract: Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.

[174] PLUME: Latent Reasoning Based Universal Multimodal Embedding

Chenwei He,Xiangzhao Hao,Tianyu Yang,Yuxiang Ma,Yuheng Jia,Lingxiang Wu,Chaoyang Zhao,Haiyun Guo,Jinqiao Wang

Main category: cs.CV

TL;DR: PLUME 提出了一种隐式链式推理框架,用连续潜在状态的短自回归展开替代显式文本链式推理(CoT),结合语义锚点引导的转换适配器和渐进式显式-隐式训练课程,在多模态嵌入检索任务中显著提升效率与性能。

Details Motivation: 现有基于显式链式思维(CoT)的通用多模态嵌入(UME)方法存在推理开销大、将丰富多模态证据压缩为文本瓶颈的问题,尤其在视频和视觉文档等结构复杂、证据密集的检索场景中表现受限。 Method: PLUME 采用三阶段方法:(1)以短序列连续潜在状态自回归 rollout 替代显式 CoT;(2)引入语义锚点引导的过渡适配器,使同一计算预算下支持多样化推理路径;(3)设计渐进式显式→隐式训练课程,初期利用显式 CoT 作为监督信号,后期完全去除文本生成,仅依赖隐藏状态计算。 Result: 在 78 任务 MMEB-v2 基准上超越强显式 CoT UME 基线,推理步骤从数百 token 减少至少于 10 步潜在状态,速度提升超 30 倍;在视频与视觉文档检索等高密度复杂证据场景中优势显著。 Conclusion: 结构化潜在计算可在不牺牲中间推理益处的前提下,彻底规避显式理由生成开销,为实用检索系统提供更高效、更强健的新范式。 Abstract: Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.

[175] FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition

Taichi Endo,Guoqing Hao,Kazuhiko Sumi

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的连续图像编辑方法FlowSlider,基于Rectified Flow,通过分解编辑更新为保真项和导向项,并利用二者近似正交性实现平滑可靠的强度控制。

Details Motivation: 现有基于学习的滑块式连续编辑方法依赖于合成或代理监督训练的辅助模块,带来额外训练开销且在分布偏移下可靠性下降。 Method: FlowSlider在Rectified Flow框架下,将FlowEdit的更新分解为源图像条件下的保真项(维持身份与结构)和驱动语义变化的导向项;通过几何分析与实证验证二者近似正交,仅缩放导向项即可实现编辑强度的稳定调节。 Result: FlowSlider无需任何后训练,实现了平滑、可靠且高质量的连续编辑,在多种任务上优于现有方法。 Conclusion: FlowSlider是一种训练自由、通用性强的连续图像编辑方案,解决了现有方法对训练数据分布依赖性强、需额外模块与监督的问题。 Abstract: Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textit{FlowSlider}, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textit{FlowSlider} decomposes FlowEdit's update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textit{FlowSlider} provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.

[176] Center-Aware Detection with Swin-based Co-DETR Framework for Cervical Cytology

Yan Kong,Yuan Yin,Hongan Chen,Yuqi Fang,Caifeng Shan

Main category: cs.CV

TL;DR: 本文提出了一种基于Co-DINO与Swin-Large的细胞检测方法,将检测建模为中心点预测,并引入中心保持增强与几何框优化,显著提升Pap涂片图像中密集细胞的定位精度,在RIVA挑战赛中获Track B第一、Track A第二。

Details Motivation: Pap涂片图像自动分析对宫颈癌筛查至关重要,但因细胞密集分布与形态复杂而极具挑战性。 Method: 采用Co-DINO框架结合Swin-Large骨干网络进行多尺度特征提取;将检测任务建模为中心点预测;设计中心保持的数据增强策略和解析式几何框优化以抑制定位抖动;并进行任务特定的损失权重调优。 Result: 在RIVA宫颈细胞学挑战赛中取得Track B第一名、Track A第二名;实验验证所提优化显著提升检测性能。 Conclusion: 所提出的中心点建模与针对性优化策略构成了一套高效、鲁棒的宫颈细胞图像分析流程,为临床辅助诊断提供了实用工具。 Abstract: Automated analysis of Pap smear images is critical for cervical cancer screening but remains challenging due to dense cell distribution and complex morphology. In this paper, we present our winning solution for the RIVA Cervical Cytology Challenge, achieving 1st place in Track B and 2nd place in Track A. Our approach leverages a powerful baseline, integrating the Co-DINO framework with a Swin-Large backbone for robust multi-scale feature extraction. To address the dataset's unique fixed-size bounding box annotations, we formulate the detection task as a center-point prediction problem. Tailoring our approach to this formulation, we introduce a center-preserving data augmentation strategy and an analytical geometric box optimization to effectively absorb localization jitter. Finally, we apply track-specific loss tuning to adapt the loss weights for each task. Experiments demonstrate that our targeted optimizations improve detection performance, providing an effective pipeline for cytology image analysis. Our code is available at https://github.com/YanKong0408/Center-DETR.

[177] GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding

Rong Fan,Kaiyan Xiao,Minghao Zhu,Liuyi Wang,Kai Dai,Zhao Yang

Main category: cs.CV

TL;DR: 本文提出GroundVTS,一种面向视频时序定位任务的Vid-LLM新架构,通过查询引导的细粒度视觉令牌采样和渐进式优化策略,提升关键帧选择与时间建模能力,在多个基准上显著超越现有方法。

Details Motivation: 现有视频大语言模型(Vid-LLMs)采用均匀帧采样,导致关键帧稀疏、丢失重要时间线索,难以支撑视频时序定位(VTG)等精细任务。 Method: 提出Grounded Visual Token Sampling(GroundVTS):1)查询引导的细粒度视觉令牌筛选机制,聚焦最具信息量的时间片段;2)渐进式优化策略,使LLM适应非均匀视觉特征分布,增强时间依赖建模能力。 Result: 在三个标准VTG基准上大幅领先:时刻检索mIoU提升7.7点,高亮检测mAP提升12.0点。 Conclusion: GroundVTS有效缓解了均匀采样带来的时序信息损失问题,验证了查询驱动的动态视觉令牌采样对提升Vid-LLMs视频定位能力的重要价值。 Abstract: Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at https://github.com/Florence365/GroundVTS.

[178] LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Jiachun Jin,Zetong Zhou,Xiao Yang,Hao Zhang,Pengfei Liu,Jun Zhu,Zhijie Deng

Main category: cs.CV

TL;DR: 本文提出LatentUM,一种在共享语义潜在空间中表示所有模态的新统一模型,避免了像素空间中介,提升了跨模态推理与生成的效率与对齐能力。

Details Motivation: 现有统一模型因理解与生成使用分离的视觉表征,需依赖像素解码作为桥梁,导致低效且效果不佳;而跨模态交错推理(如视觉空间规划、自反思图像生成、物理世界建模)更具价值和潜力。 Method: 提出LatentUM模型,将所有模态映射到统一的语义潜在空间,取消视觉理解与生成之间的像素空间中介,实现端到端的跨模态联合建模与交错推理。 Result: LatentUM在Visual Spatial Planning基准上达到SOTA;显著提升自反思驱动的视觉生成质量;支持在共享潜在空间中预测未来视觉状态,实现更优的世界建模;同时缓解编解码偏差、增强跨模态对齐、提高计算效率。 Conclusion: 共享语义潜在空间是构建高效、强对齐、可交错推理的统一多模态模型的关键路径,LatentUM为下一代统一模型提供了新范式。 Abstract: Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.

[179] CASHG: Context-Aware Stylized Online Handwriting Generation

Jinsu Shin,Sungeun Hong,Jin Yeong Bak

Main category: cs.CV

TL;DR: 本文提出CASHG模型,通过显式建模字间连笔与间距,结合上下文感知的编码器-解码器架构和三阶段课程学习,实现了风格一致、自然流畅的句子级在线手写生成,并设计了边界感知的Connectivity and Spacing Metrics(CSM)评估指标。

Details Motivation: 句子级在线手写生成面临字间连笔连续性、间距合理性及风格一致性等挑战,而以往方法将这些边界特性隐式建模,在句子尺度和组合多样性受限时不可靠。 Method: 提出CASHG:1)Character Context Encoder提取字符身份与句子级上下文记忆;2)基于二元组感知的滑动窗口Transformer解码器,强调局部前驱-当前字符过渡;3)门控上下文融合机制;4)三阶段课程学习(从单字到整句);5)提出Connectivity and Spacing Metrics(CSM)进行边界感知评估。 Result: 在基准匹配评估下,CASHG在CSM指标上持续优于对比方法,在DTW轨迹相似性上保持竞争力,并通过人工评估进一步验证生成质量提升。 Conclusion: 显式建模字间连通性与间距、结合上下文感知结构和课程学习策略,是提升句子级在线手写生成自然性与风格一致性的有效途径;CSM为该任务提供了更贴合实际书写特性的评估新范式。 Abstract: Online handwriting represents strokes as time-ordered trajectories, which makes handwritten content easier to transform and reuse in a wide range of applications. However, generating natural sentence-level online handwriting that faithfully reflects a writer's style remains challenging, since sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at the sentence scale and under limited compositional diversity. We propose CASHG, a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis. CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory and fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor--current transitions, complemented by gated context fusion for sentence-level context.Training proceeds through a three-stage curriculum from isolated glyphs to full sentences, improving robustness under sparse transition coverage. We further introduce Connectivity and Spacing Metrics (CSM), a boundary-aware evaluation suite that quantifies cursive connectivity and spacing similarity. Under benchmark-matched evaluation protocols, CASHG consistently improves CSM over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by a human evaluation.

[180] CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection

Weidong Tang,Hanbin Sun,Zihan Li,Yikai Wang,Feifan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的开放词汇遥感变化检测方法CoRegOVCD,通过后验一致性正则化提升跨时相概念响应的可比性与空间连贯性,在多个基准上显著超越先前无训练方法。

Details Motivation: 现有遥感变化检测方法假设固定标签空间,无法支持任意用户查询;而开放词汇变化检测在无训练设定下面临跨时相概念响应难以直接比较的问题,包括外观变化、概念间竞争弱及地物空间连续性导致的噪声和语义不可靠。 Method: 提出CoRegOVCD框架:1)竞争性后验校准(CPC)和语义后验差(SPD)将原始概念响应转化为竞争感知的查询概念后验并量化其跨时相差异;2)几何-标记一致性门(GeoGate)和区域共识差异(RCD)通过几何感知结构验证与区域共识抑制无效响应、增强空间一致性。 Result: 在四个涵盖建筑导向与多类场景的基准上,CoRegOVCD持续优于最强的无训练基线,F1_C提升2.24至4.98点;在SECOND数据集六类平均F1_C达47.50%。 Conclusion: CoRegOVCD通过后验校准与结构一致性建模,有效缓解了无训练开放词汇变化检测中的语义不可靠与空间碎片问题,为遥感图像开放查询式变化分析提供了新范式。 Abstract: Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points and reaches a six-class average of 47.50% F1$_C$ on SECOND.

[181] Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation

Saurabh Hinduja,Gurmeet Kaur,Maneesh Bilalpur,Jeffrey Cohn,Shaun Canavan

Main category: cs.CV

TL;DR: 本文揭示了面部动作单元(AU)检测中常用的受试者独占交叉验证存在显著的随机方差,导致性能提升评估不可靠;提出采用跨数据集的留一数据集法(LODO)以获得更稳定、可解释的评估结果。

Details Motivation: 现有AU检测研究普遍采用受试者独占交叉验证,但报告的性能提升往往微小且不稳定,作者质疑该协议本身引入的随机性可能掩盖真实改进。 Method: 通过在BP4D+数据集上重复3折受试者独占划分量化交叉验证的随机方差;对比F1与AUC等指标的波动性;引入Leave-One-Dataset-Out(LODO)协议,在5个AU数据集上评估跨数据集鲁棒性。 Result: BP4D+上平均F1存在±0.065的经验噪声下限,低频AU波动更大;F1比AUC更易受划分影响,模型排序会随折次变化;LODO消除了划分随机性,暴露出单数据集CV无法发现的域级不稳定性。 Conclusion: 许多在交叉验证中报告的性能增益可能落入协议固有方差范围内;LODO是一种更稳健、更具解释性的AU检测评估范式。 Abstract: Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of $\pm 0.065$ in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings

[182] Reflection Generation for Composite Image Using Diffusion Model

Haonan Zhao,Qingyang Liu,Jiaxuan Chen,Li Niu

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的反射生成方法,通过引入反射位置与外观先验,并采用类型感知设计,在新构建的大规模反射数据集DEROBA上实现了物理一致且视觉真实的反射合成。

Details Motivation: 反射生成在图像合成中长期被忽视,而阴影生成已得到广泛研究;现有方法缺乏对反射物理特性与多样性的建模。 Method: 将反射位置和外观先验注入基础扩散模型,并根据反射类型(如镜面/漫反射)进行类型感知建模;构建首个大规模物体反射数据集DEROBA用于训练。 Result: 在DEROBA数据集上的实验表明,该方法生成的反射具有物理一致性与视觉真实性,性能优于现有方法,建立了反射生成的新基准。 Conclusion: 本工作首次系统性地解决了图像合成中的反射生成问题,验证了结合先验知识与类型感知设计在扩散模型中的有效性,为环境一致的图像编辑提供了新思路。 Abstract: Image composition involves inserting a foreground object into the background while synthesizing environment-consistent effects such as shadows and reflections. Although shadow generation has been extensively studied, reflection generation remains largely underexplored. In this work, we focus on reflection generation. We inject the prior information of reflection placement and reflection appearance into foundation diffusion model. We also divide reflections into two types and adopt type-aware model design. To support training, we construct the first large-scale object reflection dataset DEROBA. Experiments demonstrate that our method generates reflections that are physically coherent and visually realistic, establishing a new benchmark for reflection generation.

[183] ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline

Juan Manuel Hernandez,Mariana Fernandez-Espinosa,Denis Parra,Diego Gomez-Zara

Main category: cs.CV

TL;DR: 本文提出ViT-Explainer,一个面向Vision Transformer的交互式可视化分析系统,支持从图像分块到分类决策的全流程解释。

Details Motivation: 现有可解释性工具多聚焦于孤立模块或专家级分析,缺乏对Vision Transformer端到端推理过程的引导式、一体化理解支持。 Method: 设计并实现了一个基于Web的交互式系统ViT-Explainer,集成动态流程演示、补丁级注意力热图叠加、视觉适配的Logit Lens,并支持引导式与自由探索两种模式。 Result: 用户研究(6名参与者)表明该系统易于学习和使用,有效提升了用户对ViT行为的理解与解释能力。 Conclusion: ViT-Explainer为Vision Transformer提供了实用、直观且用户友好的可解释性支持,填补了端到端视觉模型分析工具的空白。 Abstract: Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.

[184] CXR-LT 2026 Challenge: Projection-Aware Multi-Label and Zero-Shot Chest X-Ray Classification

Juno Cho,Dohui Kim,Mingeon Kim,Hyunseo Jang,Chang Sun Lee,Jong Chul Ye

Main category: cs.CV

TL;DR: 本文提出了一种统一框架,用于胸部X光片(CXR)的多标签分类(已知病变)和零样本分类(未知病变),通过投影特异性模型集成、改进的CheXzero双分支架构(结合对比学习、非对称损失和大语言模型生成提示)以及强数据与测试时增强来提升性能与鲁棒性。

Details Motivation: 解决胸部X光片中已知病变的多标签分类与未知病变的零样本分类双重挑战,并应对不同投影视角和长尾分布问题。 Method: 集成投影特异性模型构建统一分类框架;改进CheXzero为双分支结构,融合对比学习、非对称损失(ASL)和LLM生成的描述性提示;引入强数据增强与测试时增强(TTA)。 Result: 显著缓解了长尾类别不平衡问题,提升了零样本泛化能力,并在两类任务上均实现了更强的鲁棒性。 Conclusion: 所提方法在统一框架下有效兼顾已知与未知病变的分类需求,验证了多模态提示与自适应损失设计在医学影像零样本学习中的有效性。 Abstract: This challenge tackles multi-label classification for known chest X-ray (CXR) lesions and zero-shot classification for unseen ones. To handle diverse CXR projections, we integrate projection-specific models via a classification network into a unified framework. For zero-shot classification (Task 2), we extend CheXzero with a novel dual-branch architecture that combines contrastive learning, Asymmetric Loss (ASL), and LLM-generated descriptive prompts. This effectively mitigates severe long-tail imbalances and maximizes zero-shot generalization. Additionally, strong data and test-time augmentations (TTA) ensure robustness across both tasks.

[185] Lightweight Spatiotemporal Highway Lane Detection via 3D-ResNet and PINet with ROI-Aware Attention

Sorna Shanmuga Raja,Abdelhafid Zenati

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、端到端的高速公路车道线检测架构,结合3D CNN与实例分割,通过两种改进模型(FPN+自注意力、ROI检测头)提升精度与效率,在TuSimple数据集上达到93.40%准确率,参数更少、延迟更低,适用于ADAS/LAS。

Details Motivation: 现有车道线检测方法在真实驾驶场景中对空间-时间信息联合建模能力不足,且存在计算复杂度高、误检漏检等问题,亟需轻量、鲁棒、实时的解决方案。 Method: 提出两种基于3D-ResNet编码器与PINet解码器的联合模型:其一引入FPN和自注意力机制增强多尺度特征与空间依赖;其二增加ROI检测头以聚焦车道相关区域,降低计算开销。 Result: 在TuSimple数据集上,第二模型达93.40%准确率,显著降低漏检率;相比2D/3D基线,参数更少、推理延迟更低,并经离线训练与实时推断验证。 Conclusion: 所提轻量级端到端架构兼顾性能与效率,适合嵌入ADAS,具备向全功能Lane Assist Systems(LAS)扩展的潜力。 Abstract: This paper presents a lightweight, end-to-end highway lane detection architecture that jointly captures spatial and temporal information for robust performance in real-world driving scenarios. Building on the strengths of 3D convolutional neural networks and instance segmentation, we propose two models that integrate a 3D-ResNet encoder with a Point Instance Network (PINet) decoder. The first model enhances multi-scale feature representation using a Feature Pyramid Network (FPN) and Self-Attention mechanism to refine spatial dependencies. The second model introduces a Region of Interest (ROI) detection head to selectively focus on lane-relevant regions, thereby improving precision and reducing computational complexity. Experiments conducted on the TuSimple dataset (highway driving scenarios) demonstrate that the proposed second model achieves 93.40% accuracy while significantly reducing false negatives. Compared to existing 2D and 3D baselines, our approach achieves improved performance with fewer parameters and reduced latency. The architecture has been validated through offline training and real-time inference in the Autonomous Systems Laboratory at City, St George's University of London. These results suggest that the proposed models are well-suited for integration into Advanced Driver Assistance Systems (ADAS), with potential scalability toward full Lane Assist Systems (LAS).

[186] UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

Yongkang Li,Lijun Zhou,Sixu Yan,Bencheng Liao,Tianyi Yan,Kaixin Xiong,Long Chen,Hongwei Xie,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Haiyang Sun,Xinggang Wang

Main category: cs.CV

TL;DR: 本文提出UniDriveVLA模型,通过Mixture-of-Transformers实现感知与推理专家解耦,解决自动驾驶中视觉-语言-动作模型的空间感知与语义推理冲突问题,并在多项任务上达到SOTA。

Details Motivation: 现有VLA模型在自动驾驶中面临空间感知与语义推理难以兼顾的困境,根源在于二者在共享参数中耦合优化。 Method: 提出基于Mixture-of-Transformers的UniDriveVLA,包含驾驶理解、场景感知和动作规划三个专家,采用掩码联合注意力协调;结合稀疏感知范式与三阶段渐进训练策略。 Result: 在nuScenes(开环)和Bench2Drive(闭环)上达到SOTA;同时在3D检测、在线建图、运动预测、驾驶导向VQA等多类任务中表现优异。 Conclusion: UniDriveVLA通过专家解耦有效缓解感知-推理冲突,是一个具备广泛适用性的统一自动驾驶VLA模型。 Abstract: Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla

[187] SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition

Soroush Oraki,Feng Ding,Jie Liang

Main category: cs.CV

TL;DR: 本文提出SCALE框架,通过语义与置信度感知的列表式能量模型解决零样本骨架动作识别问题,利用条件变分自编码器实现无需采样的似然评估,并引入新损失函数和原型对比目标提升分类性能。

Details Motivation: 现有零样本骨架动作识别方法依赖显式的骨架-文本对齐,在动作名称无法准确描述细粒度动态、未见类别语义易混淆时表现脆弱。 Method: 提出SCALE框架:1)构建文本条件化的条件变分自编码器(CVAE),冻结文本表征以参数化隐变量先验和解码器;2)设计语义与置信度感知的列表式能量损失,强调语义相近的难负样本并融入后验不确定性;3)引入隐空间原型对比目标,使后验均值对齐文本导出的隐原型。 Result: 在NTU-60和NTU-120数据集上,SCALE持续优于基于VAE和对齐的基线方法,并与扩散模型方法性能相当。 Conclusion: SCALE提供了一种轻量、确定性的能量建模范式,避免了生成采样和显式对齐,通过联合优化能量排序、不确定性建模与语义组织,提升了零样本动作识别的鲁棒性与判别性。 Abstract: Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise energy loss that emphasizes semantically similar hard negatives and incorporates posterior uncertainty to adapt decision margins and reweight ambiguous training instances. Additionally, we utilize a latent prototype contrast objective to align posterior means with text-derived latent prototypes, improving semantic organization and class separability without direct feature matching. Experiments on NTU-60 and NTU-120 datasets show that SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.

[188] UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models

Qiyao Zhang,Shuhua Zheng,Jianli Sun,Chengxiang Li,Xianke Wu,Zihan Song,Zhiyong Cui,Yisheng Lv,Yonglin Tian

Main category: cs.CV

TL;DR: 本文提出UAV-Track VLA模型,用于无人机在动态城市环境中进行具身视觉跟踪,通过引入时序压缩网络和并行双分支解码器提升性能,在CARLA仿真中显著优于现有方法,并具备零样本泛化与实时性。

Details Motivation: 现有VLA模型存在时序特征冗余和缺乏空间几何先验的问题,难以满足复杂城市动态场景下对具身视觉跟踪的高要求。 Method: 构建了包含89万帧、176个任务、85类物体的大规模数据集和专用评测基准;提出基于π₀.₅架构的UAV-Track VLA模型,引入时序压缩网络捕捉帧间动态,并设计含空间感知辅助定位头和光流匹配动作专家的并行双分支解码器以解耦跨模态特征并生成细粒度连续动作。 Result: 在CARLA仿真中,长距离行人跟踪任务的成功率达61.76%,平均跟踪帧数达269.65;实现强零样本泛化能力;单步推理延迟降低33.4%至0.0571秒,支持高效实时控制。 Conclusion: UAV-Track VLA有效解决了VLA模型在具身视觉跟踪中的关键瓶颈,显著提升了多模态跟踪性能、泛化能力和实时性,为无人机复杂任务执行提供了新范式。 Abstract: Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $π_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76\% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4\% (to 0.0571s) compared to the original $π_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: https://github.com/Hub-Tian/UAV-Track\_VLA.

[189] SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

Naomi Kombol,Ivan Martinović,Siniša Šegvić,Giorgos Tolias

Main category: cs.CV

TL;DR: 本文提出SPAR(Single-Pass Any-Resolution ViT),一种无需滑动窗口、支持任意分辨率输入的单次前向ViT特征提取器,通过知识蒸馏将滑动窗口教师模型的空间推理能力迁移至单次前向学生模型,在开放词汇分割任务中显著提升mIoU并降低计算开销。

Details Motivation: 基础视觉Transformer(ViT)因固定预训练分辨率和粗粒度patch表示,在需细粒度空间理解的任务(如开放词汇分割)中表现受限;现有高分辨率处理方法(如滑动窗口)虽提升精度但计算代价高昂。 Method: 提出SPAR框架,采用特征回归损失将高步长滑动窗口教师模型的空间推理能力蒸馏至单次前向的学生ViT,不修改网络结构,也无需像素级监督。 Result: 在开放词汇分割任务上,SPAR相较单次前向基线提升最高达10.5 mIoU,且性能反超教师模型,验证了其高效高分辨率推理能力。 Conclusion: SPAR实现了分辨率无关、单次前向、高效精准的密集特征提取,为ViT在密集预测任务中的实际部署提供了新范式。 Abstract: Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR

[190] Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models

Yaoteng Tan,Zikui Cai,M. Salman Asif

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、基于推理时梯度反馈的文本到图像生成模型安全控制框架,利用冻结的多模态基础模型作为语义能量估计器,在采样过程中进行实时引导,兼顾安全性与生成质量。

Details Motivation: 现有文本到图像生成模型的安全控制方法(如微调或筛选数据集)常损害生成质量或难以扩展,亟需一种高效、通用且不损害性能的安全干预机制。 Method: 提出一种推理时引导框架,利用冻结的视觉-语言基础模型在每步采样中提供梯度反馈,结合干净潜在表示,将安全控制建模为基于能量的采样问题。 Result: 在NSFW红队测试中达到最优鲁棒性,支持多目标安全引导,同时在非目标良性提示下保持高质量图像生成。 Conclusion: 该框架为文本到图像生成提供了可即插即用、无需训练、跨模型兼容的安全控制范式,凸显了基础模型作为语义能量估计器的潜力。 Abstract: Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.

[191] Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Chongjie Ye,Cheng Cao,Chuanyu Pan,Yiming Hao,Yihao Zhi,Yuanming Hu,Xiaoguang Han

Main category: cs.CV

TL;DR: Omni123 是一个 3D 原生的自回归基础模型,通过将文本、图像和 3D 数据统一为离散 token 序列,并利用 2D 数据作为几何先验,实现高质量、几何一致的文本到 3D 生成与编辑。

Details Motivation: 现有方法难以直接生成高质量、几何一致的 3D 内容,因高质量 3D 数据稀缺,且主流间接生成流程(如 2D 编辑+提升)易导致几何失真。 Method: 提出 Omni123 模型:1)将文本、图像、3D 表示为共享序列空间中的离散 token;2)采用交错 X-to-X 训练范式,支持异构配对数据(无需完整 text-image-3D 三元组);3)在自回归序列中建模语义-视觉-几何循环(如 text→image→3D→image),联合约束语义对齐、外观保真与多视角几何一致性。 Result: 显著提升文本引导的 3D 生成与编辑质量,在几何一致性、外观真实性和多视图连贯性方面优于现有方法,验证了通向多模态 3D 世界模型的可扩展路径。 Conclusion: Omni123 证明了以 2D 数据为几何先验、通过跨模态一致性建模实现 3D 原生生成的可行性,为构建统一的多模态 3D 基础模型提供了新范式。 Abstract: Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

[192] AdamFlow: Adam-based Wasserstein Gradient Flows for Surface Registration in Medical Imaging

Qiang Ma,Qingjie Meng,Xin Hu,Yicheng Wu,Wenjia Bai

Main category: cs.CV

TL;DR: 本文提出了一种基于概率测度和切片Wasserstein距离的快速曲面配准方法AdamFlow,兼顾效率与鲁棒性,在解剖结构配准中表现优异。

Details Motivation: 现有曲面配准方法在效率与鲁棒性之间存在权衡:局部点匹配法快但易受噪声和初值影响;全局点集配准法鲁棒但计算代价高。 Method: 将曲面网格建模为概率测度,配准建模为分布优化问题;采用具有对数线性复杂度的切片Wasserstein距离度量差异;提出AdamFlow优化器,将Adam算法推广至概率空间。 Result: 理论证明AdamFlow渐近收敛;实验表明其在仿射与非刚性配准任务中,跨多种解剖结构均优于现有方法,兼具高效性与鲁棒性。 Conclusion: 所提方法有效缓解了效率与鲁棒性的矛盾,为医学影像中的解剖形状分析提供了实用、可扩展的新工具。 Abstract: Surface registration plays an important role for anatomical shape analysis in medical imaging. Existing surface registration methods often face a trade-off between efficiency and robustness. Local point matching methods are computationally efficient, but vulnerable to noise and initialisation. Methods designed for global point set alignment tend to incur a high computational cost. To address the challenge, here we present a fast surface registration method, which formulates surface meshes as probability measures and surface registration as a distributional optimisation problem. The discrepancy between two meshes is measured using an efficient sliced Wasserstein distance with log-linear computational complexity. We propose a novel optimisation method, AdamFlow, which generalises the well-known Adam optimisation method from the Euclidean space to the probability space for minimising the sliced Wasserstein distance. We theoretically analyse the asymptotic convergence of AdamFlow and empirically demonstrate its superior performance in both affine and non-rigid surface registration across various anatomical structures.

[193] VOID: Video Object and Interaction Deletion

Saman Motamed,William Harvey,Benjamin Klein,Luc Van Gool,Zhuoning Yuan,Ta-Ying Cheng

Main category: cs.CV

TL;DR: 本文提出VOID框架,用于视频中物体移除的物理合理修复,通过生成反事实数据集并结合视觉语言模型与视频扩散模型,实现对复杂物理交互(如碰撞)的准确建模与修复。

Details Motivation: 现有视频物体移除方法无法处理物体间显著物理交互(如碰撞),导致结果不真实;需提升视频编辑模型对物理因果关系的建模能力。 Method: 构建基于Kubric和HUMOTO的反事实物体移除配对数据集;使用视觉语言模型定位受移除物体影响的区域;以该区域为条件引导视频扩散模型生成物理一致的反事实视频。 Result: 在合成与真实数据上实验表明,VOID比现有方法更能保持场景动力学一致性,尤其在涉及碰撞等强物理交互的场景中效果显著。 Conclusion: VOID首次将高阶因果推理引入视频物体移除任务,证明了结合物理仿真、视觉语言理解与扩散建模可提升视频编辑的物理合理性,为构建世界模拟器式编辑模型提供新路径。 Abstract: Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

[194] A Simple Baseline for Streaming Video Understanding

Yujiao Shen,Shulin Tian,Jingkang Yang,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出SimpleStream,一种仅使用最近N帧输入现成视觉语言模型的滑动窗口基线方法,在OVO-Bench和StreamingBench上表现优于或媲美13种现有流式视频理解模型,揭示了长时记忆未必优于实时感知,并呼吁未来基准应区分二者以更清晰评估改进。

Details Motivation: 挑战当前流式视频理解中依赖复杂记忆机制的趋势,验证简单滑动窗口基线是否足以实现强性能,并探究长时上下文与实时感知之间的权衡关系。 Method: 提出SimpleStream:将仅含最近N帧(如4帧)的视频片段输入现成视觉语言模型(VLM),不引入额外记忆、检索或压缩模块;在OVO-Bench和StreamingBench上与13个离线/在线视频大模型对比,并进行控制变量消融实验。 Result: SimpleStream仅用4帧即在OVO-Bench达67.7%、StreamingBench达80.59%平均准确率;消融表明长上下文增益依赖骨干模型而非单纯扩大规模,且存在感知-记忆权衡:增加历史帧提升召回但削弱实时感知。 Conclusion: 复杂记忆模块不应默认视为进步,除非其在相同协议下明确超越SimpleStream;建议未来流式基准应解耦‘近期场景感知’与‘长程记忆’,以更公正评估模型改进。 Abstract: Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.

[195] Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

Junxuan Li,Rawal Khirodkar,Chengan He,Zhongshi Jiang,Giljoo Nam,Lingchen Yang,Jihyun Lee,Egor Zakharov,Zhaoen Su,Rinat Abdrashitov,Yuan Dong,Julieta Martinez,Kai Li,Qingyang Tan,Takaaki Shiratori,Matthew Hu,Peihong Guo,Xuhua Huang,Ariyan Zarei,Marco Pesavento,Yichen Xu,He Wen,Teng Deng,Wyatt Borsos,Anjali Thakrar,Jean-Charles Bazin,Carsten Stoll,Ginés Hidalgo,James Booth,Lucy Wang,Xiaowen Ma,Yu Rong,Sairanjith Thalanki,Chen Cao,Christian Häne,Abhishek Kar,Sofien Bouaziz,Jason Saragih,Yaser Sheikh,Shunsuke Saito

Main category: cs.CV

TL;DR: 本文提出Large-Scale Codec Avatars(LCA),一种结合大规模野外数据预训练与高质量工作室数据后训练的3D头像建模方法,在保持高保真度的同时实现对全球人群的泛化能力。

Details Motivation: 解决3D头像建模中高保真度与强泛化能力之间的权衡问题:多视角工作室数据保真度高但泛化差,大规模野外数据泛化好但质量低。 Method: 提出预训练/后训练范式:先在100万野外视频上预训练以学习外观与几何先验,再在高质量标注数据上后训练以提升表现力与保真度;模型支持全身体、细粒度表情及手指级控制。 Result: LCA实现了跨发型、服饰、人种的泛化,保持强身份一致性,并意外展现出重光照、宽松衣物模拟及风格化图像零样本鲁棒性等涌现能力。 Conclusion: LCA首次将大模型预/后训练范式引入3D头像建模,在 fidelity 与 generalization 之间取得突破性平衡,支持高效前馈推理。 Abstract: High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

[196] Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Ruozhen He,Nisarg A. Shah,Qihua Dong,Zilin Xiao,Jaywon Koo,Vicente Ordonez

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉指代定位任务——基于场景的指代理解(RSC),强调从角色、意图和关系上下文中推理目标,而非依赖显式命名;并构建了含31k训练样本的新基准及难度标注体系,同时提出ScenGround方法,通过课程式强化学习提升模型在复杂场景下的泛化能力。

Details Motivation: 现有视觉指代定位基准主要关注图像区域与字面指代表达的对齐,模型易通过匹配显著类别获胜;本文旨在探索更难、更贴近真实理解的场景式指代定位,要求模型基于角色、意图和关系上下文进行推理。 Method: 构建Referring Scenario Comprehension(RSC)基准,包含段落级查询、细粒度难度标签(唯一性、杂乱度、尺寸、重叠、位置)及分布外测试集;提出ScenGround方法,融合监督预热与难度感知的强化学习课程训练。 Result: 实验表明,场景式查询能系统性暴露当前模型在标准基准中无法发现的缺陷;课程训练显著提升困难子集性能,并可迁移到标准基准上。 Conclusion: 场景式视觉指代理解是更本质的语言-视觉对齐挑战;RSC基准与ScenGround方法为推动深层语义理解和鲁棒推理提供了新方向与实用工具。 Abstract: Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

[197] Steerable Visual Representations

Jona Ruthardt,Manu Gaur,Deva Ramanan,Makarand Tapaswi,Yuki M. Asano

Main category: cs.CV

TL;DR: 本文提出了一种可引导的视觉表征(Steerable Visual Representations),通过在视觉编码器层中早期注入文本提示(轻量级跨模态注意力),使ViT特征能被自然语言动态引导,兼顾通用性与可控性。

Details Motivation: 现有预训练ViT(如DINOv2、MAE)特征偏向显著区域,缺乏对非显著概念的可控引导;而多模态大模型(如CLIP)虽支持文本引导,但其表征偏语言化,损害通用视觉任务性能。 Method: 提出早期融合机制:将文本嵌入通过轻量级跨注意力模块直接注入ViT编码器各层,实现对全局和局部视觉特征的自然语言引导;并构建了衡量表征可引导性的新基准。 Result: 所提方法可在保持原始表征质量的同时,精准聚焦图像中任意指定对象;在异常检测与个性化目标判别任务上达到或超越专用方法,并具备零样本泛化能力。 Conclusion: Steerable Visual Representations 成功弥合了通用视觉表征与可控文本引导之间的鸿沟,为构建灵活、鲁棒、任务无关的视觉基础模型提供了新范式。 Abstract: Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.

[198] Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection

Alex Costanzino,Pierluigi Zama Ramirez,Giuseppe Lisanti,Luigi Di Stefano

Main category: cs.CV

TL;DR: ModMap是一种原生多视角、多模态的3D异常检测与分割框架,通过跨模态和跨视角特征映射及特征级调制建模视角依赖关系,并引入跨视角训练策略与专用深度编码器,在SiM3D基准上达到SOTA性能。

Details Motivation: 现有方法独立处理各视角,难以建模跨视角与跨模态关联,且缺乏适配工业高分辨率3D数据的深度编码器。 Method: 提出ModMap框架:1)基于跨模态特征映射范式实现模态与视角间特征映射;2)通过特征级调制显式建模视角依赖关系;3)设计跨视角训练策略,利用所有视角组合进行多视角集成与聚合以实现异常评分;4)训练并开源面向工业数据集的深度编码器。 Result: 在SiM3D多视角多模态3D异常检测与分割新基准上,ModMap显著超越先前方法,达到当前最优性能。 Conclusion: ModMap验证了联合建模多视角与多模态信息对3D异常检测与分割的有效性,为该任务提供了新范式与实用基础模型。 Abstract: We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.

[199] Generative World Renderer

Zheng-Hui Huang,Zhixiang Wang,Jiaming Tan,Ruihan Yu,Yidan Zhang,Bo Zheng,Yu-Lun Liu,Yung-Yu Chuang,Kaipeng Zhang

Main category: cs.CV

TL;DR: 本文提出一个基于AAA游戏的大规模动态合成数据集,通过双屏拼接捕获方法获取400万帧同步RGB与G-buffer数据,用于提升生成式逆向与正向渲染在真实场景中的 realism 和时序一致性,并设计基于VLM的无真值评估协议。

Details Motivation: 现有合成数据集在真实感和时序连贯性上不足,导致生成式逆向与正向渲染难以扩展到真实世界场景,存在显著域差距。 Method: 提出双屏拼接捕获方法,从AAA游戏中构建含4M帧、720p/30FPS、同步RGB与5通道G-buffer的大规模动态数据集;设计基于视觉语言模型(VLM)的语义-空间-时序一致性评估协议;开发支持文本驱动G-buffer编辑的前向渲染工具包。 Result: 逆向渲染器在该数据集微调后展现出更强的跨数据集泛化能力与可控生成性能;VLM评估结果与人类判断高度相关;前向渲染工具包支持对AAA游戏画面进行文本驱动风格编辑。 Conclusion: 该工作通过高质量合成数据与新型评估范式,有效弥合了生成式渲染在仿真与现实间的域差距,为真实场景下的双向渲染提供了可扩展的数据与方法基础。 Abstract: Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.

[200] ActionParty: Multi-Subject Action Binding in Generative Video Games

Alexander Pondaven,Ziyi Wu,Igor Gilitschenski,Philip Torr,Sergey Tulyakov,Fabio Pizzati,Aliaksandr Siarohin

Main category: cs.CV

TL;DR: 本文提出ActionParty,一种可动作控制的多主体世界模型,通过引入主体状态令牌和空间偏差机制,解决了视频扩散模型中动作绑定问题,实现了对多个主体的同时控制。

Details Motivation: 现有视频扩散模型在多主体场景中存在动作绑定问题,难以将特定动作与对应主体关联。 Method: 提出ActionParty模型,引入主体状态令牌(latent variables)来持续捕捉每个主体的状态,并通过空间偏差机制联合建模状态令牌和视频潜在表示,从而解耦全局帧渲染与个体动作控制更新。 Result: 在Melting Pot基准测试中,首次实现对最多七个玩家在46种不同环境中的同时控制,显著提升了动作跟随准确性和身份一致性,并支持复杂交互下的鲁棒自回归主体追踪。 Conclusion: ActionParty成功解决了多主体视频生成中的动作绑定难题,为生成式视频游戏和多智能体世界建模提供了新范式。 Abstract: Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

[201] EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors

Luca Bartolomei,Fabio Tosi,Matteo Poggi,Stefano Mattoccia,Guillermo Gallego

Main category: cs.CV

TL;DR: EventHub 是一种无需真实标注的深度事件立体视觉网络训练框架,仅使用标准彩色图像生成代理标注和代理事件数据,从而提升模型在夜间等挑战性场景下的泛化与精度。

Details Motivation: 解决事件立体视觉网络训练依赖昂贵主动传感器提供真值标注的问题,降低数据采集成本并提升模型泛化能力。 Method: 提出 EventHub 框架,利用前沿的新视角合成技术从彩色图像生成代理标注和代理事件数据;构建数据工厂生成训练集,并将 RGB 领域先进立体模型适配至事件数据处理任务。 Result: 在主流事件立体数据集上验证了 EventHub 的有效性;同时发现该数据蒸馏机制也能提升 RGB 立体基础模型在夜间等困难场景下的精度。 Conclusion: EventHub 成功实现了无真值标注的事件立体网络训练,兼具高效性与强泛化能力,并可反哺提升 RGB 基础模型性能。 Abstract: We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis techniques, or simply proxy annotations when images are already paired with event data. Using the training set generated by our data factory, we repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities. Experiments on widely used event stereo datasets support the effectiveness of EventHub and show how the same data distillation mechanism can improve the accuracy of RGB stereo foundation models in challenging conditions such as nighttime scenes.