Table of Contents
cs.CL [Back]
[1] The Overlooked Repetitive Lengthening Form in Sentiment Analysis
Lei Wang,Eduard Dragut
Main category: cs.CL
TL;DR: 本文探讨了重复延长形式(RLF)在情感分析中的重要性及大语言模型对其的理解能力,提出了首个专注于RLF的多领域数据集Lengthening,并设计了可解释的指令微调框架ExpInstruct以提升模型性能与可解释性。
Details
Motivation: 重复延长形式(RLF)作为一种独特且强调性的非正式表达风格,在情感分析中长期被忽视,本文旨在探究其重要性及大语言模型对其的理解能力。 Method: 构建首个面向RLF的情感分析多领域数据集Lengthening(85万样本),提出两阶段可解释指令微调框架ExpInstruct,并设计统一量化方法评估大语言模型对非正式表达的理解能力。 Result: RLF句子具有强情感表征能力,可作为文档级情感标志;微调预训练语言模型在RLF任务上性能超越零样本GPT-4但解释性不足;ExpInstruct可在少量样本下使开源大模型在性能与解释性上均达到零样本GPT-4水平。 Conclusion: RLF是情感分析中不可忽视的重要非正式表达形式,ExpInstruct框架有效提升了模型对RLF的理解能力与可解释性,为在线内容分析提供了新思路。 Abstract: Individuals engaging in online communication frequently express personal opinions with informal styles (e.g., memes and emojis). While Language Models (LMs) with informal communications have been widely discussed, a unique and emphatic style, the Repetitive Lengthening Form (RLF), has been overlooked for years. In this paper, we explore answers to two research questions: 1) Is RLF important for sentiment analysis (SA)? 2) Can LMs understand RLF? Inspired by previous linguistic research, we curate \textbf{Lengthening}, the first multi-domain dataset with 850k samples focused on RLF for SA. Moreover, we introduce \textbf{Exp}lainable \textbf{Instruct}ion Tuning (\textbf{ExpInstruct}), a two-stage instruction tuning framework aimed to improve both performance and explainability of LLMs for RLF. We further propose a novel unified approach to quantify LMs' understanding of informal expressions. We show that RLF sentences are expressive expressions and can serve as signatures of document-level sentiment. Additionally, RLF has potential value for online content analysis. Our results show that fine-tuned Pre-trained Language Models (PLMs) can surpass zero-shot GPT-4 in performance but not in explanation for RLF. Finally, we show ExpInstruct can improve the open-sourced LLMs to match zero-shot GPT-4 in performance and explainability for RLF with limited samples. Code and sample data are available at https://github.com/Tom-Owl/OverlookedRLF[2] Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming
Qianfan Zhang,Tianyu Guo,Xuandi Ren,Jiale Chen,Ming Ding,Ran Xin,Xia Xiao
Main category: cs.CL
TL;DR: 本文提出了一种通过训练时强化学习(RL)与测试时并行思维相结合的方法,以扩展竞争性编程中的推理token预算。通过验证RL预热和随机裁剪提升训练效率,并设计多轮并行思维流程,在测试时分配token预算,最终在AetherCode难题上超越GPT-5-high。
Details
Motivation: 竞争性编程需要大量推理token,但单次生成式推理在全注意力机制下成本高昂,亟需更高效利用token预算的方法。 Method: 结合训练时强化学习(含验证RL预热与随机裁剪)与测试时多线程、多轮的并行思维流程(生成-验证-精炼),并对模型进行端到端训练以对齐训练目标与测试结构。 Result: 基于Seed-OSS-36B,16线程×16轮的系统以平均7.6M tokens/题达成原RL模型的oracle pass@16效果(等效于pass@1),并在456道AetherCode难题上超越GPT-5-high。 Conclusion: 训练时RL优化与测试时结构化并行推理协同可显著提升token利用效率,在硬核编程问题上实现更强性能。 Abstract: We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.[3] M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency
Abolfazl Ansari,Delvin Ce Zhang,Zhuoyang Zou,Wenpeng Yin,Dongwon Lee
Main category: cs.CL
TL;DR: 本文介绍了M2-Verify,一个大规模、多模态、跨领域的科学主张一致性验证数据集,旨在评估模型对科学主张与其多模态证据间严格一致性的判断能力;实验表明当前SOTA模型在复杂视觉挑战下性能显著下降,并存在幻觉问题。
Details
Motivation: 现有基准缺乏足够规模、领域多样性和视觉复杂性,难以真实评估科学主张与其多模态证据之间的严格一致性。 Method: 构建了源自PubMed和arXiv的M2-Verify数据集,包含16个领域的46.9万+样本,并通过专家审核确保质量;开展基线实验与专家评估,检验模型一致性判断能力及解释生成中的幻觉现象。 Result: SOTA模型在低复杂度医学扰动上Micro-F1达85.8%,但在高复杂度解剖结构偏移等任务中降至61.6%;专家评估发现模型在生成科学解释时存在明显幻觉。 Conclusion: M2-Verify填补了多模态科学论证验证基准的空白,揭示了当前模型在严格一致性推理与可信解释生成上的关键短板。 Abstract: Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8\% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6\% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset's utility and provide comprehensive usage guidelines.[4] Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
Simona-Vasilica Oprea,Adela Bâra
Main category: cs.CL
TL;DR: 本文探讨了语言模型中人类偏好学习的挑战,提出了一种特征增强的奖励建模框架,通过引入可解释的辅助特征(如响应长度、拒绝信号、毒性分数和语义相似度)提升模型在HHRLHF数据集上的偏好判断性能,并结合SHAP/LIME增强可解释性,揭示了安全性和支持性表达对决策的关键作用。
Details
Motivation: 当前奖励建模依赖主观、模糊的偏好比较,难以捕捉人类判断的多维性,导致性能瓶颈。 Method: 在HHRLHF数据集上采用标准成对偏好设置评估10个LLM;引入响应长度、拒绝指示符、毒性分数和提示-响应语义相似度等可解释特征,构建特征增强的混合模型;结合SHAP与LIME进行可解释性分析,并考察特征交互对偏差放大的影响。 Result: 所有模型ROC AUC提升至最高0.84(基线<0.74),DeBERTav3Large表现最佳;可解释性分析表明决策依赖上下文化的安全性与支持性表达,而非孤立关键词;特征交互被证实影响偏好学习中的偏差放大。 Conclusion: 特征增强能有效提升奖励模型性能与可解释性,强调需建模多维、上下文敏感的人类偏好,而非仅依赖纯文本表示。 Abstract: Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similarity, enabling models to explicitly capture key aspects of helpfulness, safety and relevance. The proposed hybrid approach yields consistent improvements across all models, achieving up to 0.84 ROC AUC and significantly higher pairwise accuracy, with DeBERTav3Large demonstrating the best performance. Beyond accuracy, we integrate SHAP and LIME to provide fine-grained interpretability, revealing that model decisions depend on contextualized safety and supportive framing rather than isolated keywords. We further analyze bias amplification, showing that while individual features have weak marginal effects, their interactions influence preference learning.[5] Procedural Knowledge at Scale Improves Reasoning
Di Wu,Devendra Singh Sachan,Wen-tau Yih,Mingda Chen
Main category: cs.CL
TL;DR: 本文提出Reasoning Memory,一种检索增强生成(RAG)框架,通过从大量推理轨迹中提取并重用‘过程性知识’(如问题重构、方法选择、验证回溯等),显著提升语言模型在数学、科学与编程等复杂推理任务上的表现。
Details
Motivation: 现有测试时缩放方法孤立处理每个问题,未能系统复用过往推理路径中的过程性知识,导致对‘如何推理’这一关键能力建模不足。 Method: 构建包含3200万条子问题-子程序对的过程性知识库;在推理时,模型通过轻量级‘in-thought prompt’生成核心子问题,检索相关子程序,并将其作为隐式过程先验指导多路径推理。 Result: 在6个数学、科学和编程基准上持续超越基于文档、完整轨迹或模板的RAG方法,以及算力匹配的测试时缩放基线;最高提升达19.2%(相比无检索)和7.9%(相比最强基线)。 Conclusion: 过程性知识的广泛覆盖与精细化分解-检索设计是性能提升的关键,证明显式建模和复用‘如何推理’的知识可有效增强大模型推理能力。 Abstract: Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.[6] No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents
Tiankai Yang,Jiate Li,Yi Nian,Shen Dong,Ruiyao Xu,Ryan Rossi,Kaize Ding,Yue Zhao
Main category: cs.CL
TL;DR: 本文提出并形式化了无意跨用户污染(UCC)这一新型故障模式,指出在多用户共享状态的LLM代理中,良性交互产生的上下文残留可能被错误复用,导致静默错误;通过实验发现原始共享状态下污染率达57–71%,会话式写时清洗有效但对可执行构件无效,强调需引入构件级防御。
Details
Motivation: LLM代理在多用户共享知识层部署时,局部有效的信息可能因作用域混淆而损害其他用户表现,该问题由良性交互自发产生,缺乏系统性识别与评估框架。 Method: 提出UCC概念并形式化其定义,构建受控评估协议,建立三类污染类型的分类法,并在两种共享状态机制(纯文本对话态 vs 含可执行构件)上实证检验污染率及写时清洗策略效果。 Result: 原始共享状态下UCC发生率为57–71%;写时清洗在纯对话共享中有效,但在含可执行构件的共享中仍存在显著残余风险,且错误常以静默方式呈现(即无提示的错误答案)。 Conclusion: 共享状态LLM代理必须超越文本级清洗,发展针对具体构件(如代码、工具调用等)的细粒度作用域控制与防御机制,以防止静默跨用户失效。 Abstract: LLM-based agents increasingly operate across repeated sessions, maintaining task states to ensure continuity. In many deployments, a single agent serves multiple users within a team or organization, reusing a shared knowledge layer across user identities. This shared persistence expands the failure surface: information that is locally valid for one user can silently degrade another user's outcome when the agent reapplies it without regard for scope. We refer to this failure mode as unintentional cross-user contamination (UCC). Unlike adversarial memory poisoning, UCC requires no attacker; it arises from benign interactions whose scope-bound artifacts persist and are later misapplied. We formalize UCC through a controlled evaluation protocol, introduce a taxonomy of three contamination types, and evaluate the problem in two shared-state mechanisms. Under raw shared state, benign interactions alone produce contamination rates of 57--71%. A write-time sanitization is effective when shared state is conversational, but leaves substantial residual risk when shared state includes executable artifacts, with contamination often manifesting as silent wrong answers. These results indicate that shared-state agents need artifact-level defenses beyond text-level sanitization to prevent silent cross-user failures.[7] Open-Domain Safety Policy Construction
Di Wu,Siyue Liu,Zixiang Ji,Ya-Liang Chang,Zhe-Yu Liu,Andrew Pleffer,Kai-Wei Chang
Main category: cs.CL
TL;DR: 本文提出Deep Policy Research (DPR),一种轻量级、任务定制的智能体系统,仅需少量人工编写的领域种子信息,即可通过迭代式网络搜索与规则提炼,自动生成结构化的内容审核政策。实验表明其在多个基准上优于基线方法,甚至媲美专家撰写政策。
Details
Motivation: 内容审核策略的制定与维护成本高,尤其在特定领域;亟需自动化、低成本、高质量的政策生成方法。 Method: DPR采用单次网络搜索工具与轻量级编排框架,通过迭代式生成搜索查询、从多样化网页源中提炼政策规则,并将规则组织为索引化文档。 Result: 在OpenAI不良内容基准(5个领域)和内部多模态广告审核基准上,DPR持续优于仅用定义和上下文学习的基线;端到端设置下,在多个领域接近专家撰写政策质量;且优于通用深度研究系统。 Conclusion: 任务特定、结构化的研究循环比通用网络研究更适于政策起草;DPR为自动化安全策略构建提供了高效可行的新范式。 Abstract: Moderation layers are increasingly a core component of many products built on user- or model-generated content. However, drafting and maintaining domain-specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human-written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in-house multimodal advertisement moderation benchmark. DPR consistently outperforms definition-only and in-context learning baselines, and in our end-to-end setting it is competitive with expert-written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general-purpose deep research system, suggesting that a task-specific, structured research loop can be more effective than generic web research for policy drafting. We release our experiment code at https://github.com/xiaowu0162/deep-policy-research.[8] Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models
Itay Yona,Dan Barzilay,Michael Karasik,Mor Geva
Main category: cs.CL
TL;DR: 本文研究了语言模型中实体相关事实问答的内部机制,通过定位实体选择性MLP神经元并进行因果干预,发现早期层中存在集中分布的神经元,激活单个神经元即可恢复实体一致预测,支持实体规范化解释。
Details
Motivation: 语言模型能回答许多以实体为中心的事实性问题,但其内部机制尚不清楚。本文旨在探究这一机制。 Method: 使用模板化提示定位每个实体的选择性MLP神经元,并在PopQA数据集上的问答样例中进行因果干预验证;对200个精选实体进行分析,采用负向消融和受控注入等方法。 Result: 定位到的神经元集中在早期层;负向消融导致实体特异性失忆;在占位符处注入可提升答案检索效果;单个神经元激活足以恢复实体一致预测;对别名、缩写、拼写错误及多语言形式具有鲁棒性;流行实体更易找到可靠单神经元控制点。 Conclusion: 语言模型中存在稀疏、因果可操作的实体访问点,可用于分析和调控实体条件下的事实行为。 Abstract: Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.[9] Assessing Pause Thresholds for empirical Translation Process Research
Devi Sri Bandaru,Michael Carl,Xinyue Ren
Main category: cs.CL
TL;DR: 本文比较了三种计算翻译过程中打字暂停阈值的方法,并提出并评估了一种新的生产单元中断(Production Unit Breaks)计算方法。
Details
Motivation: 现有研究假设打字暂停时长可反映翻译过程中的自动化程度或困难程度,但关于如何确定区分自动化与反思性翻译过程的暂停阈值仍存在长期争议。 Method: 对比分析三种近期提出的暂停阈值计算方法,并提出并实证评估一种新的生产单元中断识别方法。 Result: 提出了一个更合理、可操作的生产单元中断计算新方法,并通过实证验证其有效性。 Conclusion: 暂停阈值的设定需更精细的方法支持;新提出的生产单元中断计算方法为翻译过程研究提供了更可靠的微观行为划分依据。 Abstract: Text production (and translations) proceeds in the form of stretches of typing, interrupted by keystroke pauses. It is often assumed that fast typing reflects unchallenged/automated translation production while long(er) typing pauses are indicative of translation problems, hurdles or difficulties. Building on a long discussion concerning the determination of pause thresholds that separate automated from presumably reflective translation processes (O'Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016), this paper compares three recent approaches for computing these pause thresholds, and suggest and evaluate a novel method for computing Production Unit Breaks.[10] Adaptive Stopping for Multi-Turn LLM Reasoning
Xiaofan Zhou,Huy Nguyen,Bo Yu,Chenxi Liu,Lu Cheng
Main category: cs.CL
TL;DR: 本文提出MiCP,首个面向多轮推理的符合性预测(CP)框架,通过在不同轮次间动态分配误差预算,实现自适应停止,同时保证整体覆盖度,显著降低推理成本和预测集大小。
Details
Motivation: 现有大语言模型多轮推理方法缺乏形式化停止准则,导致高风险领域中过早停止或过度推理,影响准确性和效率。 Method: 提出Multi-Turn Language Models with Conformal Prediction(MiCP),在多轮RAG与ReAct流程中动态分配误差预算,支持自适应停止并保持整体覆盖保证。 Result: MiCP在单跳与多跳问答基准上达成目标覆盖度,同时减少轮次、推理开销与预测集大小;并引入兼顾覆盖有效性与回答效率的新评估指标。 Conclusion: MiCP首次将符合性预测扩展至多轮推理场景,为高可靠性AI系统提供了兼具理论保障与实用效率的停止机制。 Abstract: Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.[11] Cost-Efficient Estimation of General Abilities Across Benchmarks
Michael Krumdick,Adam Wiemerslage,Seth Ebner,Charles Lovering,Chris Tanner
Main category: cs.CL
TL;DR: 本文提出了一种基于预测效度的高效LLM评估框架,利用改进的多维项目反应理论(IRT)与自适应项目选择,在仅观测16个测试项的情况下,以<7%的平均绝对误差预测112个未见任务的模型性能,并通过成本感知策略将评估所需token减少85%。
Details
Motivation: 现有大量LLM基准测试效率低下,而模型能力可由少量潜在因子解释;因此需一种以预测未见任务性能效率为标准的、更优的基准评估框架。 Method: 构建大规模细粒度数据集WILD(65个模型×109,564个题目×163任务),结合改进的多维IRT模型与基于最优实验设计的自适应题目选择,并引入成本感知的token折扣因子。 Result: 在112个预留基准任务上实现<7% MAE的预测精度,仅需观测16个题目;加入成本感知后,达相同精度所需token从141,000降至22,000,节省85%。 Conclusion: 以预测效度为导向、结合心理测量模型与自适应采样的评估框架,显著提升LLM评估的效率与经济性,为未来基准设计提供新范式。 Abstract: Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.[12] The power of context: Random Forest classification of near synonyms. A case study in Modern Hindi
Jacek Bąkowski
Main category: cs.CL
TL;DR: 本研究利用分布语义(词向量)和随机森林分类器,发现印地语中源自梵语与波斯-阿拉伯语的同义词即使语义相近,仍能通过其上下文使用模式被准确区分,表明词源信息在语言使用中留下可量化的痕迹,并支持同义词承载不同视角与文化关联的观点。
Details
Motivation: 检验同义词是否真的仅语义相同,还是隐含源于不同语言的历史、文化或认知差异;特别是验证词源信息能否在现代语言使用(分布语义)中被量化识别。 Method: 基于印地语中成对的梵语/波斯-阿拉伯语同义词构建词向量,使用随机森林模型仅依据上下文分布特征(不依赖语义标签)预测其词源类别,并控制语义相关性变量进行稳健性检验。 Result: 随机森林模型能显著高于随机水平地分类词源(梵语 vs. 波斯-阿拉伯语),即使在语义无关的同义词对上依然成立,证明分布语境中编码了稳定的词源信号。 Conclusion: 同义词并非语义冗余,其分布差异反映了深层的历史词源结构;词源塑造了词语的概念子空间,构成一种由历史驱动的新语义框架;上下文蕴含超越传统语义相似性的精细区分能力。 Abstract: Synonymy is a widespread yet puzzling linguistic phenomenon. Absolute synonyms theoretically should not exist, as they do not expand language's expressive potential. However, it was suggested that even if synonyms denote the same concept, they may reflect different perspectives or carry distinct cultural associations, claims that have rarely been tested quantitatively. In Hindi, prolonged contact with Persian produced many Perso-Arabic loanwords coexisting with their Sanskrit counterpart, forming numerous synonym pairs. This study investigates whether centuries after these borrowings appeared in the Subcontinent their origin can still be distinguished using distributional data alone and regardless of their semantic content. A Random Forest trained on word embeddings of Hindi synonyms successfully classified words by Sanskrit or Perso-Arabic origin, even when they were semantically unrelated, suggesting that usage patterns preserve traces of etymology. These findings provide quantitative evidence that context encodes etymological signals and that synonymy may reflect subtle but systematic distinctions linked to origin. They support the idea that synonymous words can offer different perspectives and that etymologically related words may form distinct conceptual subspaces, creating a new type of semantic frame shaped by historical origin. Overall, the results highlight the power of context in capturing nuanced distinctions beyond traditional semantic similarity.[13] Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation
Hexuan Wang,Jingyu Zhang,Benjamin Van Durme,Daniel Khashabi
Main category: cs.CL
TL;DR: 本文研究了引用粒度(句子级、段落级、文档级)对归因生成模型性能的影响,发现中等粒度(段落级)在归因质量和答案正确性上达到最优平衡,而过细或过粗的粒度均会损害模型表现,且影响程度随模型规模非单调变化。
Details
Motivation: 细粒度引用虽利于人工验证,但其对模型归因性能的影响尚未充分探索;需探究如何在满足人类可验证性的同时兼顾模型自身的语义建模能力。 Method: 在四种不同规模(8B–120B)的语言模型上系统评估句子级、段落级和多段落级引用对归因质量与答案正确性的影响,并分析性能变化模式及其与模型规模的关系。 Result: 段落级引用在所有模型尺度上均取得最佳归因质量,较最优粒度外的设置提升显著(细粒度导致16–276%性能下降);细粒度干扰语义依赖,粗粒度引入噪声;大模型受细粒度约束惩罚更重;最优粒度还能保持甚至提升答案正确性。 Conclusion: 单纯追求人类可验证的细粒度引用忽视了模型内在语义建模机制,应将引用粒度与模型的自然语义范围对齐,以实现归因忠实性与生成可靠性的统一。 Abstract: Citation granularity - whether to cite individual sentences, paragraphs, or documents - is a critical design choice in attributed generation. While fine-grained citations are often preferred for precise human verification, their impact on model performance remains under-explored. We analyze four model scales (8B-120B) and demonstrate that enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity. We observe a consistent performance pattern where attribution quality peaks at intermediate granularities (paragraph-level). Our analysis suggests that fine-grained (sentence-level) citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively coarse citations (multi-paragraph) introduce distracting noise. Importantly, the magnitude of this performance gap varies non-monotonically with model scale: fine-grained constraints disproportionately penalize larger models, suggesting that atomic citation units disrupt the multi-sentence information synthesis at which these models excel. Strikingly, citation-optimal granularity leads to substantial gains in attribution quality while preserving or even improving answer correctness. Overall, our findings demonstrate that optimizing solely for human verification via fine-grained citation disregards model constraints, compromising both attribution faithfulness and generation reliability. Instead, effective attribution requires aligning citation granularity with the model's natural semantic scope.[14] Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs
Tianyi Zhao,Yinhan He,Wendy Zheng,Yujie Zhang,Chen Chen
Main category: cs.CL
TL;DR: 本文通过电路级机制分析,揭示了大语言模型(LLM)在生成错误答案时仍表现出高自信(即“自信地错误”)的内在机制,并识别出导致这种自信膨胀的关键MLP模块和注意力头;进一步提出在推理阶段对这些电路进行定向干预,显著提升了模型的置信度校准效果。
Details
Motivation: 大型语言模型常出现“自信地错误”现象,即生成事实错误答案时却表现出过高的口头化置信度,这会误导用户并削弱置信度作为不确定性信号的可靠性,但其内部机制尚不清楚。 Method: 采用电路级机制分析方法:1)将口头化置信度建模为可微分的内部信号;2)识别因果性地膨胀该信号的神经电路(MLP块与注意力头);3)基于发现进行推理时的定向校准干预。实验覆盖两个指令微调LLM和三个数据集。 Result: 发现中后期层中一组紧凑的MLP块和注意力头在最终token位置持续写入自信膨胀信号;对其实施定向推理干预可显著改善模型校准性能。 Conclusion: LLM中的口头化过度自信由可识别的内部电路驱动,且可通过针对性干预有效缓解。 Abstract: Large language models are often not just wrong, but \emph{confidently wrong}: when they produce factually incorrect answers, they tend to verbalize overly high confidence rather than signal uncertainty. Such verbalized overconfidence can mislead users and weaken confidence scores as a reliable uncertainty signal, yet its internal mechanisms remain poorly understood. We present a circuit-level mechanistic analysis of this inflated verbalized confidence in LLMs, organized around three axes: capturing verbalized confidence as a differentiable internal signal, identifying the circuits that causally inflate it, and leveraging these insights for targeted inference-time recalibration. Across two instruction-tuned LLMs on three datasets, we find that a compact set of MLP blocks and attention heads, concentrated in middle-to-late layers, consistently writes the confidence-inflation signal at the final token position. We further show that targeted inference-time interventions on these circuits substantially improve calibration. Together, our results suggest that verbalized overconfidence in LLMs is driven by identifiable internal circuits and can be mitigated through targeted intervention.[15] A Dynamic Atlas of Persian Poetic Symbolism: Families, Fields, and the Historical Rewiring of Meaning
Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar
Main category: cs.CL
TL;DR: 本文通过构建包含129,451首波斯诗歌的语料库,识别并追踪象征符号家族(如‘夜’、‘日’、‘地’等),构建多层图谱分析其历时演变,发现波斯诗歌象征体系并非静态词库,而是内部权重与关联随时间动态变化的长生命周期系统。
Details
Motivation: 现有计算方法将波斯诗歌符号扁平化为孤立词汇或宽泛文档语义,忽视了其以‘家族’形式反复出现、通过关系增强力量这一诗学核心组织单元。 Method: 基于129,451首波斯诗歌语料,将反复出现的符号聚类为可追踪的‘家族’,分离意象性成分与神圣/宫廷指涉成分,并构建多层关系图谱;按11个伊斯兰历世纪分箱,分析图结构指标(模块度、跨范围连接、枢纽节点变化等)的历时趋势。 Result: 发现象征核心稀疏而指涉成分稠密,二者间连接具有选择性;‘夜’‘日’‘地’等家族长期广泛分布;酒器、花园、火焰、抒情音律等意象后期增强,而尊贵/英雄-宫廷词汇前期更重;模块度上升、跨范畴连接下降、宫廷桥梁弱化、神圣桥梁强化;枢纽节点如‘苏菲袍’后期凸显,‘吉祥’‘紫罗兰’衰退,‘酒杯’始终居中。 Conclusion: 波斯诗歌象征体系是一个动态演化的长生命周期系统,其内部权重分配与符号间关系随历史时期持续调整,而非固定不变的符号集合。 Abstract: Persian poetry is often remembered through recurrent symbols before it is remembered through plot. Wine vessels, gardens, flames, sacred titles, bodily beauty, and courtly names return across centuries, yet computational work still tends to flatten this material into isolated words or broad document semantics. That misses a practical unit of organization in Persian poetics: related forms travel as families and gain force through recurring relations. Using a corpus of 129,451 poems, we consolidate recurrent forms into traceable families, separate imagistic material from sacred and courtly reference, and map their relations in a multi-layer graph. The symbolic core is relatively sparse, the referential component much denser, and the attachment zone between them selective rather than diffuse. Across 11 Hijri-century bins, some families remain widely distributed, especially Shab (Night), Ruz (Day), and Khaak (Earth). Wine vessels, garden space, flame, and lyric sound strengthen later, while prestige-coded and heroic-courtly vocabulary is weighted earlier. Century-specific graphs show change in arrangement as well as membership. Modularity rises, cross-scope linkage declines, courtly bridges weaken, and sacred bridges strengthen. Hub positions shift too: Kherqe (Sufi Robe) gains late prominence, Farkhondeh {Blessed} and Banafsheh (Violet) recede, and Saaghar (Wine Cup) stays central across the chronology. In this corpus, Persian symbolism appears less as a fixed repertory than as a long-lived system whose internal weights and connections change over time.[16] Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once
Harnoor Dhingra
Main category: cs.CL
TL;DR: 本文提出Magic, Madness, Heaven, Sin框架,从任务的规范性目标(认知、交互、社会、安全)出发,系统分析大语言模型输出变异性的不同含义与权衡,主张应基于上下文评估变异,而非视其为模型固有属性。
Details
Motivation: 现有研究对大语言模型输出“多样性”的讨论术语混乱,根源在于未明确任务背后的规范性目标。 Method: 提出四象限框架(Magic/事实性、Madness/用户效用、Heaven/社会表征、Sin/安全性),沿同质-异质轴建模输出变异,并系统分析各语境下的失效模式与术语,进而考察跨语境交互影响。 Result: 发现优化某一目标(如安全性)可能损害其他目标(如人口代表性或创造性多样性);揭示了不同规范性语境下变异评价标准的根本差异。 Conclusion: 输出变异不应被视为模型的内在属性,而应被理解为由具体任务目标所塑造的、需上下文敏感评估的性质;需推动规范性目标显式化与跨目标协同评估。 Abstract: Research on Large Language Models (LLMs) studies output variation across generation, reasoning, alignment, and representational analysis, often under the umbrella of "diversity." Yet the terminology remains fragmented, largely because the normative objectives underlying tasks are rarely made explicit. We introduce the Magic, Madness, Heaven, Sin framework, which models output variation along a homogeneity-heterogeneity axis, where valuation is determined by the task and its normative objective. We organize tasks into four normative contexts: epistemic (factuality), interactional (user utility), societal (representation), and safety (robustness). For each, we examine the failure modes and vocabulary such as hallucination, mode collapse, bias, and erasure through which variation is studied. We apply the framework to analyze all pairwise cross-contextual interactions, revealing that optimizing for one objective, such as improving safety, can inadvertently harm demographic representation or creative diversity. We argue for context-aware evaluation of output variation, reframing it as a property shaped by task objectives rather than a model's intrinsic trait.[17] Why Instruction-Based Unlearning Fails in Diffusion Models?
Zeliang Zhang,Rui Sun,Jiani Liu,Qi Wu,Chenliang Xu
Main category: cs.CL
TL;DR: 本文研究了基于指令的遗忘方法在扩散模型中的有效性,发现仅靠自然语言指令无法有效抑制目标概念,揭示了扩散模型中提示级指令遗忘的根本局限性。
Details
Motivation: 探究基于指令的遗忘范式是否适用于扩散模型等其他生成模型。 Method: 通过在多个概念和提示变体上进行受控实验,并分析CLIP文本编码器和去噪过程中的交叉注意力动态,研究扩散模型对自然语言遗忘指令的响应。 Result: 扩散模型无法通过自然语言遗忘指令系统性地抑制目标概念;遗忘指令不能持续降低对目标概念词元的注意力,导致目标概念表征在整个生成过程中持续存在。 Conclusion: 提示级指令在扩散模型中存在根本性局限,有效的遗忘需要超越推理时语言控制的干预手段。 Abstract: Instruction-based unlearning has proven effective for modifying the behavior of large language models at inference time, but whether this paradigm extends to other generative models remains unclear. In this work, we investigate instruction-based unlearning in diffusion-based image generation models and show, through controlled experiments across multiple concepts and prompt variants, that diffusion models systematically fail to suppress targeted concepts when guided solely by natural-language unlearning instructions. By analyzing both the CLIP text encoder and cross-attention dynamics during the denoising process, we find that unlearning instructions do not induce sustained reductions in attention to the targeted concept tokens, causing the targeted concept representations to persist throughout generation. These results reveal a fundamental limitation of prompt-level instruction in diffusion models and suggest that effective unlearning requires interventions beyond inference-time language control.[18] Read More, Think More: Revisiting Observation Reduction for Web Agents
Masafumi Enomoto,Ryoma Obara,Haochen Zhang,Masafumi Oyamada
Main category: cs.CL
TL;DR: 本文重新审视了网页代理中HTML观察表示的简洁性与详细性之间的权衡,发现高能力大模型从详细HTML中获益更多,而低能力模型更适合简洁的可访问性树;同时,增加思考token和引入基于diff的历史观察表示可提升性能。
Details
Motivation: 先前工作将HTML的冗长视为性能障碍并普遍采用观察压缩,本文质疑这一做法,探究不同模型能力下最优观察表示的选择依据。 Method: 通过系统实验对比不同模型能力下使用HTML与可访问性树(AT)作为观察输入的效果,分析错误类型,并评估历史观察及diff表示的影响。 Result: 高能力模型在HTML输入下表现更优,且受益于更多thinking token;低能力模型在AT下更稳定;diff历史表示在保持性能的同时更省token。 Conclusion: 应根据模型能力与思考token预算自适应选择观察表示,并推荐结合diff形式的历史观察以提升效率与效果。 Abstract: Web agents based on large language models (LLMs) rely on observations of web pages -- commonly represented as HTML -- as the basis for identifying available actions and planning subsequent steps. Prior work has treated the verbosity of HTML as an obstacle to performance and adopted observation reduction as a standard practice. We revisit this trend and demonstrate that the optimal observation representation depends on model capability and thinking token budget: (1) compact observations (accessibility trees) are preferable for lower-capability models, while detailed observations (HTML) are advantageous for higher-capability models; moreover, increasing thinking tokens further amplifies the benefit of HTML. (2) Our error analysis suggests that higher-capability models exploit layout information in HTML for better action grounding, while lower-capability models suffer from increased hallucination under longer inputs. We also find that incorporating observation history improves performance across most models and settings, and a diff-based representation offers a token-efficient alternative. Based on these findings, we suggest practical guidelines: adaptively select observation representations based on model capability and thinking token budget, and incorporate observation history using diff-based representations.[19] Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging
Mengxian Lyu,Cheng Peng,Ziyi Chen,Mengyuan Zhang,Jieting Li Lu,Yonghui Wu
Main category: cs.CL
TL;DR: 本文提出了一种模型融合框架,通过插值法合并临床基础模型(GatorTronLlama)与通用指令模型(Llama-3.1-8B-Instruct),在保留指令遵循能力的同时提升医学任务性能,缓解大模型微调中的灾难性遗忘问题,并在低资源场景下实现高效适配。
Details
Motivation: 大型语言模型在医疗领域应用时,微调后常出现指令遵循能力严重退化(即“遗忘”)的问题,阻碍其临床落地。 Method: 采用插值式权重空间融合方法,将临床基础模型GatorTronLlama与通用指令模型Llama-3.1-8B-Instruct进行合并,构建兼具临床专业性与指令遵循能力的混合模型。 Result: 在多个医学基准及五类临床生成任务(如放射科、出院摘要)上验证,融合模型显著缓解灾难性遗忘,同时保持临床领域性能和指令遵循能力;在极低监督数据(64样本)下性能媲美全量微调基线。 Conclusion: 权重空间融合是一种高效、可扩展的开源LLM临床适配方案,适用于资源受限的医疗环境部署。 Abstract: Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often "forget" a significant amount of instruction-following ability when fine-tuned using a task-specific medical dataset, a critical challenge in adopting general-purpose LLMs for clinical applications. This study presents a model merging framework to efficiently adapt general-purpose LLMs to the medical domain by countering this forgetting issue. By merging a clinical foundation model (GatorTronLlama) with a general instruct model (Llama-3.1-8B-Instruct) via interpolation-based merge methods, we seek to derive a domain-adapted model with strong performance on clinical tasks while retaining instruction-following ability. Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain expertise, and retain instruction-following ability. In addition, our model merging strategies demonstrate training efficiency, achieving performance on par with fully fine-tuned baselines under severely constrained supervision (e.g., 64-shot vs. 256-shot). Consequently, weight-space merging constitutes a highly scalable solution for adapting open-source LLMs to clinical applications, facilitating broader deployment in resource-constrained healthcare environments.[20] DeltaMem: Towards Agentic Memory Management via Reinforcement Learning
Qi Zhang,Shen Huang,Chu Liu,Shouqing Yang,Junbo Zhao,Haobo Wang,Pengjun Xie
Main category: cs.CL
TL;DR: 本文提出DeltaMem,一种单智能体的个性化记忆管理系统,通过模仿人类记忆演化构建对话数据集和记忆更新标签,并引入基于记忆的Levenshtein距离与强化学习框架,在多个长期记忆基准上显著优于现有方法。
Details
Motivation: 现有以人格为中心的记忆管理多智能体框架存在信息丢失和跨场景脆弱性问题,导致性能不佳。 Method: 提出DeltaMem单智能体记忆管理系统;构建用户-助手对话数据集及操作级记忆更新标签;设计基于记忆的Levenshtein距离作为更新奖励;采用定制化强化学习框架优化记忆管理。 Result: DeltaMem(含训练前与RL训练版本)在LoCoMo、HaluMem和PersonaMem等多个长期记忆基准上全面超越产品级基线模型。 Conclusion: 单智能体、端到端、受人类记忆启发的DeltaMem框架能更鲁棒高效地管理人格化记忆,为对话系统长期记忆建模提供新范式。 Abstract: Recent advances in persona-centric memory have revealed the powerful capability of multi-agent systems in managing persona memory, especially in conversational scenarios. However, these complex frameworks often suffer from information loss and are fragile across varying scenarios, resulting in suboptimal performance. In this paper, we propose DeltaMem, an agentic memory management system that formulates persona-centric memory management as an end-to-end task within a single-agent setting. To further improve the performance of our agentic memory manager, we draw inspiration from the evolution of human memory and synthesize a user-assistant dialogue dataset along with corresponding operation-level memory updating labels. Building on this, we introduce a novel Memory-based Levenshtein Distance to formalize the memory updating reward, and propose a tailored reinforcement learning framework to further enhance the management capabilities of DeltaMem. Extensive experiments show that both training-free and RL-trained DeltaMem outperform all product-level baselines across diverse long-term memory benchmarks, including LoCoMo, HaluMem, and PersonaMem.[21] Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression
Ruoling Qi,Yirui Liu,Xuaner Wu,Xiangyu Wang,Ming Li,Chen Chen,Jian Chen,Yin Chen,Qizhen Weng
Main category: cs.CL
TL;DR: 本文提出Swift-SVD,一种激活感知、闭式求解的SVD压缩框架,兼顾理论最优性、实际高效性与数值稳定性,显著提升大语言模型权重与KV缓存压缩的速度与精度。
Details
Motivation: 大型语言模型部署受限于静态权重和动态KV缓存的内存与带宽需求;现有SVD压缩方法在重建误差或计算效率上存在明显缺陷。 Method: 提出Swift-SVD:基于批量输入的输出激活协方差增量聚合,单次特征值分解实现训练无关、快速且最优的逐层低秩近似;引入有效秩分析层可压缩性,并设计兼顾局部重建误差与端到端层重要性的动态秩分配策略。 Result: 在六个LLM和八个数据集上的实验表明,Swift-SVD在压缩精度上达到最优,端到端压缩时间比SOTA方法快3–70倍。 Conclusion: Swift-SVD为LLM高效部署提供了一种硬件友好、理论保证、实践高效的SVD压缩新范式。 Abstract: The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3-70X speedups in end-to-end compression time. Our code will be released upon acceptance.[22] Grounding AI-in-Education Development in Teachers' Voices: Findings from a National Survey in Indonesia
Nurul Aisyah,Muhammad Dehan Al Kautsar,Arif Hidayat,Fajri Koto
Main category: cs.CL
TL;DR: 本研究通过面向印尼全国349名K-12教师的大规模调查,揭示了AI在课堂教学中的实际应用模式、使用差异及主要障碍:小学教师使用更频繁,高中教师较少;中年教师更重视AI;东印尼教师感知价值更高;主要用途是减负(如备课、出题、材料生成),但受限于通用输出、基础设施和本地化适配不足。
Details
Motivation: 填补印尼课堂中AI实践应用与教师支持需求方面缺乏大规模、以教师为中心的实证证据的空白,从而推动符合本土情境的AI系统与政策发展。 Method: 开展覆盖印尼全国小学、初中和高中共349名K-12教师的问卷调查,分析不同学段、教龄和地区教师在AI使用频率、目的、感知价值及障碍等方面的差异。 Result: 发现AI在教学法、内容开发和教学媒体中使用呈上升趋势但不均衡;小学教师使用更一致,高中教师参与度低;中年教师更重视AI;东印尼教师感知价值更高;主要用途为减轻教学准备负担;主要障碍包括AI输出通用性高、基础设施薄弱及缺乏本土语境适配。 Conclusion: 需设计更具教育情境敏感性、本地化支持能力与基础设施适配性的AI工具,并制定分层分类的教师支持策略与政策,以促进AI在印尼教育中的公平、有效整合。 Abstract: Despite emerging use in Indonesian classrooms, there is limited large-scale, teacher-centred evidence on how AI is used in practice and what support teachers need, hindering the development of context-appropriate AI systems and policies. To address this gap, we conduct a nationwide survey of 349 K-12 teachers across elementary, junior high, and senior high schools. We find increasing use of AI for pedagogy, content development, and teaching media, although adoption remains uneven. Elementary teachers report more consistent use, while senior high teachers engage less; mid-career teachers assign higher importance to AI, and teachers in Eastern Indonesia perceive greater value. Across levels, teachers primarily use AI to reduce instructional preparation workload (e.g., assessment, lesson planning, and material development). However, generic outputs, infrastructure constraints, and limited contextual alignment continue to hinder effective classroom integration.[23] Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations
Shou-Tzu Han,Rodrigue Rizk,KC Santosh
Main category: cs.CL
TL;DR: 本文研究大型语言模型在数学推理任务中对语义不变的表面扰动(如名字替换、数字格式改写)的脆弱性,提出MPD诊断框架分析失败机制,并基于发现构建失败分类法及修复实验。
Details
Motivation: 尽管大语言模型在数学推理基准上表现强劲,但其对语义不变的表面扰动却异常脆弱,亟需系统性诊断其失败的内在机制。 Method: 在677个GSM8K问题及其语义等价变体上评估Mistral-7B、Llama-3-8B和Qwen2.5-7B;提出Mechanistic Perturbation Diagnostics(MPD)框架,整合logit lens、激活修补、组件消融与新指标Cascading Amplification Index(CAI);构建机制化失败分类法并开展定向修复实验。 Result: 三模型答案翻转率高达28.8%–45.1%,数字改写比名字替换更具破坏性;CAI在预测失败方面优于首发散层(AUC最高0.679);logit lens显示翻转样本更早偏离正确路径;激活修补揭示架构差异:Llama-3失败可局部修复,Mistral/Qwen则呈广泛分布;修复实验表明针对‘局部型’失败效果最佳(+12.2%),而‘纠缠型’和‘分布式’失败修复效果有限(+7.2%/+5.2%)。 Conclusion: LLM数学推理的脆弱性源于不同架构下失败机制的本质差异,MPD框架可有效识别失败类型,为针对性干预(如层微调、转向向量)提供依据,推动鲁棒推理能力的机制化提升。 Abstract: Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistently more disruptive than name swaps. To trace the mechanistic basis of these failures, we introduce the Mechanistic Perturbation Diagnostics (MPD) framework, combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI) into a unified diagnostic pipeline. CAI, a novel metric quantifying layer-wise divergence amplification, outperforms first divergence layer as a failure predictor for two of three architectures (AUC up to 0.679). Logit lens reveals that flipped samples diverge from correct predictions at significantly earlier layers than stable samples. Activation patching reveals a stark architectural divide in failure localizability: Llama-3 failures are recoverable by patching at specific layers (43/60 samples), while Mistral and Qwen failures are broadly distributed (3/60 and 0/60). Based on these diagnostic signals, we propose a mechanistic failure taxonomy (localized, distributed, and entangled) and validate it through targeted repair experiments: steering vectors and layer fine-tuning recover 12.2% of localized failures (Llama-3) but only 7.2% of entangled (Qwen) and 5.2% of distributed (Mistral) failures.[24] What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis
Delip Rao,Chris Callison-Burch
Main category: cs.CL
TL;DR: 本文系统分析了24K个声明验证样本的推理模式,发现现有基准主要测试直接证据提取和词汇匹配,严重缺乏多句合成与数值推理;不同领域错误类型差异显著,高分更多反映检索+蕴含能力而非深层推理能力,并提出构建更具挑战性评测套件的建议。
Details
Motivation: 尽管声明验证进展迅速,但对其基准实际考察的推理能力缺乏系统理解。 Method: 使用GPT-4o-mini为9个数据集共24K样本生成结构化推理轨迹,并利用1B参数推理验证器分析五类错误及其领域分布。 Result: 发现直接证据提取占主导,多句合成与数值推理严重不足;不同数据集存在显著偏差(如纯词汇匹配 vs. 约50%需信息合成);错误类型依领域而异:通用领域以词汇重叠偏差为主,科学领域以过度谨慎为主,数学领域以算术推理失败为主。 Conclusion: 当前高基准分数主要体现检索加蕴含能力,而非真正复杂的推理能力;需构建更均衡、更具挑战性的评测套件以全面评估验证系统的推理能力。 Abstract: Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain -- general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.[25] PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation
Yanxin Luo,Xiaoyu Zhang,Jing Li,Yan Gao,Donghong Han
Main category: cs.CL
TL;DR: 本文提出PRCCF框架,通过人格引导的检索和因果感知的认知过滤,提升情感支持对话中的上下文理解与共情响应生成性能。
Details
Motivation: 现有方法在情感支持对话中难以有效支持深层次的上下文理解。 Method: 提出PRCCF框架,包含人格引导的检索机制(联合建模语义兼容性与人格一致性)和因果感知的认知过滤模块(优先选择因果相关的外部知识)。 Result: 在ESConv数据集上,PRCCF在自动指标和人工评估中均优于当前最优基线。 Conclusion: 人格引导与因果感知的结合能显著增强情感推理中的上下文认知理解,提升共情响应质量。 Abstract: Emotional Support Conversation (ESC) aims to alleviate individual emotional distress by generating empathetic responses. However, existing methods face challenges in effectively supporting deep contextual understanding. To address this issue, we propose PRCCF, a Persona-guided Retrieval and Causality-aware Cognitive Filtering framework. Specifically, the framework incorporates a persona-guided retrieval mechanism that jointly models semantic compatibility and persona alignment to enhance response generation. Furthermore, it employs a causality-aware cognitive filtering module to prioritize causally relevant external knowledge, thereby improving contextual cognitive understanding for emotional reasoning. Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations. Our code is publicly available at: https://github.com/YancyLyx/PRCCF.[26] PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment
Chenning Xu,Mao Zheng,Mingyang Song
Main category: cs.CL
TL;DR: 本文提出PRISM框架,通过引入句子级事实性风险标签和句间依赖标注,在监督微调中对事实性薄弱位置进行差异化学习调整,以减少大模型生成中的幻觉问题。
Details
Motivation: 监督微调(SFT)使用词元级硬标签易导致模型过度模仿缺乏事实依据的目标输出,从而在多句生成中放大幻觉;需引入更结构化、细粒度的事实性监督信号。 Method: 提出PRISM——一种可微的风险门控框架:在标准SFT基础上,结合句子级事实性风险标签与句间依赖标注,通过风险加权与模型感知的概率重分配目标,仅在事实关键位置抑制高置信度错误预测。 Result: 在事实敏感型基准和通用评测中,PRISM显著提升事实性指标(如事实准确率),同时保持整体语言能力不下降;消融实验证明保守使用辅助信号、知识掩蔽与模型感知重分配具有互补作用。 Conclusion: 结构化、轻量级、模型感知的风险引导微调是缓解大模型幻觉的有效路径,兼顾事实性增强与能力保留。 Abstract: Supervised fine-tuning (SFT) with token-level hard labels can amplify overconfident imitation of factually unsupported targets, causing hallucinations that propagate in multi-sentence generation. We study an augmented SFT setting in which training instances include coarse sentence-level factuality risk labels and inter-sentence dependency annotations, providing structured signals about where factual commitments are weakly supported. We propose \textbf{PRISM}, a differentiable risk-gated framework that modifies learning only at fact-critical positions. PRISM augments standard SFT with a lightweight, model-aware probability reallocation objective that penalizes high-confidence predictions on risky target tokens, with its scope controlled by span-level risk weights and model-aware gating. Experiments on hallucination-sensitive factual benchmarks and general evaluations show that PRISM improves factual aggregates across backbones while maintaining a competitive overall capability profile. Ablations further show that the auxiliary signal is most effective when used conservatively, and that knowledge masking and model-aware reallocation play complementary roles in balancing factual correction and capability preservation.[27] On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning
Zhaoyi Li,Xiangyu Xi,Zhengyu Chen,Wei Wang,Gangwei Jiang,Ranran Shen,Linqi Song,Ying Wei,Defu Lian
Main category: cs.CL
TL;DR: 本文研究了不同来源的验证链式思维(CoT)轨迹对监督微调(SFT)模型泛化能力的影响,发现训练损失低并不意味着泛化好;通过分析推理模式差异,提出过滤高分支轨迹的方法,显著提升了模型在多个推理基准上的性能。
Details
Motivation: 尽管监督微调长链式思维轨迹已成为构建大型推理模型的关键步骤,但不同来源的CoT轨迹如何影响模型泛化性能仍不清楚。 Method: 对比使用DeepSeek-R1-0528和gpt-oss-120b生成的、问题集相同的验证CoT轨迹进行SFT,并从token级损失和step级推理行为两方面分析其差异;进而提出过滤频繁分支轨迹的策略。 Result: 发现gpt-oss-120b生成的CoT更收敛、演绎性强,而DeepSeek-R1-0528生成的CoT更发散、分支多;基于该洞察过滤DeepSeek-R1数据后,在AIME25、BeyondAIME等五个基准上平均提升3.6%推理性能。 Conclusion: CoT轨迹的质量(如推理模式)比单纯降低训练损失更重要;过滤高分支轨迹是一种简单有效提升SFT泛化能力的方法。 Abstract: Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.[28] Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy
Ruijie Yang,Yan Zhu,Peiyao Fu,Te Luo,Zhihua Wang,Xian Yang,Quanlin Li,Pinghong Zhou,Shuo Wang
Main category: cs.CL
TL;DR: 本文提出EndoASR,一种面向胃肠内镜场景的领域自适应语音识别系统,通过两阶段合成数据适配策略提升术语准确率与噪声鲁棒性,在多中心真实临床环境中显著降低字符错误率、提升医学术语准确率,并具备低实时因子与轻量模型特性,支持边缘部署及与大语言模型协同。
Details
Motivation: 现有ASR系统在胃肠内镜真实临床场景中受限于专业术语复杂和声学环境干扰,可靠性不足,亟需领域适配方案。 Method: 提出EndoASR系统,采用基于合成内镜报告的两阶段适配策略:第一阶段优化领域语言建模,第二阶段增强噪声鲁棒性;并在多中心前瞻性研究中验证其泛化能力。 Result: 在回顾性评估中,CER从20.52%降至14.14%,Med ACC从54.30%升至87.59%;前瞻性多中心研究中,相较Paraformer基线,CER由16.20%降至14.97%,Med ACC由61.63%升至84.16%;RTF达0.005(快于Whisper-large-v3),参数量仅220M。 Conclusion: EndoASR实现了高精度、低延迟、轻量化的内镜语音识别,经多中心临床验证具备强实用性与泛化性,可作为人-AI协同在胃肠内镜中的可靠语音接口。 Abstract: Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.[29] Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
Yanchen Wu,Tenghui Lin,Yingli Zhou,Fangyuan Zhang,Qintian Guo,Xun Zhou,Sibo Wang,Xilin Liu,Yuchi Ma,Yixiang Fang
Main category: cs.CL
TL;DR: 本文系统性地综述并统一了LLM智能体中的记忆方法框架,通过在两个基准上的全面实验比较,分析了各类记忆方法的有效性,并提出了一种性能更优的新记忆方法,同时指出了未来研究方向。
Details
Motivation: 现有LLM智能体的记忆方法缺乏在统一实验设置下的系统性、全面性比较,难以评估其真实有效性与适用场景。 Method: 1)提出一个涵盖所有现有记忆方法的统一框架;2)在两个知名基准上对代表性记忆方法进行大规模实验对比;3)基于分析结果,融合现有模块设计一种新记忆方法。 Result: 揭示了不同记忆方法的性能差异与适用条件;所提出的新记忆方法在实验中超越当前最优方法;总结出若干有前景的未来研究方向。 Conclusion: 对现有记忆方法的深入理解不仅能指导实际应用中的方法选择,更能为下一代智能体记忆机制的设计提供关键洞见。 Abstract: Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. A number of memory methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework that incorporates all the existing agent memory methods from a high-level perspective. We then extensively compare representative agent memory methods on two well-known benchmarks and examine the effectiveness of all methods, providing a thorough analysis of those methods. As a byproduct of our experimental analysis, we also design a new memory method by exploiting modules in the existing methods, which outperforms the state-of-the-art methods. Finally, based on these findings, we offer promising future research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide valuable new insights for future research.[30] Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition
Truc Nguyen,Then Tran,Binh Truong,Phuoc Nguyen T. H
Main category: cs.CL
TL;DR: 本文提出了一种人机协同框架,结合大语言模型(LLM)推理与声学特征模型,通过置信度路由和迭代规则优化,提升越南语语音情感识别(SER)在模糊样本和低资源场景下的性能。
Details
Motivation: 越南语语音情感识别面临声学模式模糊、标注数据稀缺、真实场景中情感边界不清等挑战,现有纯数据驱动方法难以应对。 Method: 构建以LLM推理为核心的人机协同框架:1)声学模型提供置信度与特征级证据;2)置信度路由机制区分易/难样本,将不确定样本交由LLM基于人类标注行为导出的结构化规则进行深度推理;3)通过错误分析与规则迭代更新持续优化系统。 Result: 在2764条三类(平静、愤怒、恐慌)越南语语音数据集上达到86.59%准确率和0.85–0.86宏F1值,显著提升对模糊及难分类样本的识别能力。 Conclusion: 融合人类知识与数据驱动模型是低资源SER任务的有效范式,该框架具有模型无关性与强鲁棒性。 Abstract: Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.[31] Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text
Melania Berbatova,Tsvetoslav Vasev
Main category: cs.CL
TL;DR: 本文提出了一种针对保加利亚语文本的更细致的毒性内容检测方法,通过构建毒性词汇本体和包含4384条人工标注句子的数据集,并训练BERT模型,实现了0.89的宏F1分数。
Details
Motivation: 现有毒性内容检测方法常误删重要信息(如医学术语和少数群体相关文本),需在保加利亚语场景下提升检测精度与公平性。 Method: 构建保加利亚语毒性词汇本体;收集并人工标注涵盖毒性语言、医学术语、非毒性语言及少数群体相关术语的四类数据集(共4384句);训练BERT-based分类模型。 Result: BERT模型在毒性分类任务上达到0.89的宏F1分数,具备实际部署能力,可集成至内容审核系统。 Conclusion: 该方法在保障检测性能的同时,有效减少对关键合法信息的误判,提升了毒性检测的语境敏感性与社会包容性。 Abstract: Toxic content detection in online communication remains a significant challenge, with current solutions often inadvertently blocking valuable information, including medical terms and text related to minority groups. This paper presents a more nu-anced approach to identifying toxicity in Bulgarian text while preserving access to essential information. The research explores two distinct methodologies for detecting toxic content. The developed methodologies have po-tential applications across diverse online platforms and content moderation systems. First, we propose an ontology that models the potentially toxic words in Bulgarian language. Then, we compose a dataset that comprises 4,384 manually anno-tated sentences from Bulgarian online forums across four categories: toxic language, medical terminology, non-toxic lan-guage, and terms related to minority communities. We then train a BERT-based model for toxic language classification, which reaches a 0.89 F1 macro score. The trained model is directly applicable in a real environment and can be integrated as a com-ponent of toxic content detection systems.[32] LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
Linyang He,Qiyao Yu,Hanze Dong,Baohao Liao,Xinxing Xu,Micah Goldblum,Jiang Bian,Nima Mesgarani
Main category: cs.CL
TL;DR: 本文提出LiveMathematicianBench,一个基于最新arXiv论文、动态构建的十三类定理逻辑分类、采用证明草图引导干扰项生成与替换抵抗机制的数学推理评测基准,显著提升对LLM真实数学推理能力的检验敏感性。
Details
Motivation: 现有数学推理基准受限于合成场景和数据污染,难以真实评估LLM在研究级数学推理中的能力;需更贴近前沿科研实践、抗污染、细粒度的评测方法。 Method: 构建LiveMathematicianBench:1)选取训练截止后新发布的arXiv论文定理;2)建立十三类定理逻辑类型学;3)利用证明草图生成语义合理但逻辑错误的干扰项;4)引入符号替换抵抗机制区分记忆识别与实质推理;5)设计双模式(有/无证明草图)评测协议。 Result: 当前最强模型Gemini-3.1-pro-preview在标准评测中仅得43.5%,在替换抵抗下骤降至17.6%(低于20%随机基线);GPT-5.4在替换抵抗下最高为30.6%;提供证明草图可稳定提升准确率。 Conclusion: LiveMathematicianBench是一种可扩展、抗污染、面向研究级数学推理的新型评测基准,揭示当前LLM在实质性数学推理上仍严重不足,且依赖高阶证明策略可带来可测提升。 Abstract: Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.[33] Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens
Hanna Hubarava,Yingqiang Gao
Main category: cs.CL
TL;DR: 本文提出了一种基于指令微调与离散控制标记的领域无关可控自动文本简化(CATS)框架,发现小模型(1-3B)在可读性控制上表现良好,但压缩率控制受限于训练数据中目标属性变化不足;同时指出传统评估指标不足以衡量可控性,需采用基于误差的目标-输出对齐度量。
Details
Motivation: 可控自动文本简化(CATS)中,可控性常被简化为解码问题,且缺乏能真实反映控制能力的评估指标;作者指出数据与评估两方面严重制约了可控性。 Method: 提出基于指令微调和离散控制标记的领域无关CATS框架,使用Llama、Mistral、Qwen等1–14B规模模型,在医学、政务、新闻、百科四领域开展实验,并设计误差导向的可控性评估方法及分层采样策略。 Result: 小模型(1–3B)在可读性控制(FKGL、ARI、Dale-Chall)上表现稳定;压缩控制效果差,主因现有语料中压缩率信号变异性不足;标准简化与相似度指标无法准确评估可控性;随机数据划分会引发分布失配,影响训练与评估可靠性。 Conclusion: 可控性核心依赖训练数据中目标属性的充分变异;需构建高变异可控语料并采用目标-输出对齐的误差型评估指标;该框架具备跨模型与跨领域适用性。 Abstract: Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.[34] DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment
Liang Zhu,Feiteng Fang,Yuelin Bai,Longze Chen,Zhexiang Zhang,Minghuan Tan,Min Yang
Main category: cs.CL
TL;DR: 本文提出了一种名为DEFT的高效对齐框架,通过差分分布奖励进行数据过滤和分布引导,提升大语言模型与人类价值观的对齐效率与泛化能力,同时显著减少训练时间。
Details
Motivation: 现有基于人类反馈的强化学习(RLHF)方法如PPO成本高、不稳定;替代方法仍需大量偏好数据,且可能削弱大语言模型的泛化能力。 Method: 提出Distribution-guided Efficient Fine-Tuning(DEFT),利用语言模型输出分布与偏好数据差异分布计算差分分布奖励,据此筛选高质量小规模子集,并融入现有对齐方法以引导模型输出分布。 Result: 实验表明,结合DEFT的方法在对齐能力和泛化能力上均优于原始方法,且训练时间显著减少。 Conclusion: DEFT是一种高效、稳定、低开销的大语言模型价值对齐新范式,兼顾性能提升与泛化保持。 Abstract: Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.[35] PLOT: Enhancing Preference Learning via Optimal Transport
Liang Zhu,Yuelin Bai,Xiankun Ren,Jiaxi Yang,Lei Zhang,Feiteng Fang,Hamid Alinejad-Rokny,Minghuan Tan,Min Yang
Main category: cs.CL
TL;DR: 本文提出PLOT方法,通过最优传输理论构建词元级损失函数,提升大语言模型偏好学习的性能、稳定性和鲁棒性。
Details
Motivation: 现有偏好学习方法存在性能提升有限、计算成本高、超参数敏感、未能充分建模全局词元级关系等问题。 Method: 将偏好学习建模为最优传输问题,设计基于词元嵌入的词元级损失函数,在对齐人类偏好同时保持LLM原始分布,并利用词元嵌入捕捉语义关系以实现全局优化。 Result: 在人类价值观和逻辑与问题求解两大类共七个子偏好上实验表明,PLOT持续提升对齐性能,同时保持生成流畅性与连贯性。 Conclusion: 最优传输为偏好学习提供了原理性、理论严谨的新框架,为大语言模型偏好学习带来新洞见。 Abstract: Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global token-level relationships. We introduce PLOT, which enhances Preference Learning in fine-tuning-based alignment through a token-level loss derived from Optimal Transport. By formulating preference learning as an Optimal Transport Problem, PLOT aligns model outputs with human preferences while preserving the original distribution of LLMs, ensuring stability and robustness. Furthermore, PLOT leverages token embeddings to capture semantic relationships, enabling globally informed optimization. Experiments across two preference categories - Human Values and Logic & Problem Solving - spanning seven subpreferences demonstrate that PLOT consistently improves alignment performance while maintaining fluency and coherence. These results substantiate optimal transport as a principled methodology for preference learning, establishing a theoretically grounded framework that provides new insights for preference learning of LLMs.[36] From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion
Liang Zhu,Haolin Chen,Lidong Zhao,Xian Wu
Main category: cs.CL
TL;DR: 本文提出了一种自适应占位符补全(APC)框架,通过在高熵位置输出占位符而非硬性生成具体代码,以降低用户后续编辑成本;理论证明其在特定熵阈值下优于传统硬补全(HC),实验显示可减少19%–50%编辑成本,且不损害原有补全性能。
Details
Motivation: 现有大语言模型在代码补全中采用硬补全(HC)范式,在上下文不足时仍强制生成具体代码,导致大量建议被编辑或拒绝(分析300万真实交互发现61%存在该问题),说明模型在不确定性高时易出错。 Method: 提出自适应占位符补全(APC)框架:将代码补全建模为不确定性下的成本最小化问题,理论上证明存在临界熵阈值,超过该阈值时使用占位符比纠错成本更低;基于真实编辑日志构建训练数据,并设计基于成本的奖励函数用于强化学习。 Result: 在1.5B–14B参数模型上广泛评估表明,APC将期望编辑成本降低19%–50%,同时保持与传统HC相当的标准补全性能(如准确率、相似度等)。 Conclusion: APC为不确定性感知的代码补全提供了理论基础与实用训练框架,验证了端到端学习自适应‘ abstention’(即适时输出占位符)的可行性,且不牺牲传统补全质量。 Abstract: While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user's subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.[37] Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution
Samuel Rose,Debarati Chakraborty
Main category: cs.CL
TL;DR: 本文提出了一种用于区分阅读障碍者与普通写作者拼写错误的二分类方法,结合多维度语言学特征与双输入神经网络模型,在writer-independent条件下达到93.01%准确率;同时强调伦理先行,系统分析了公平性、可解释性、知情同意、透明度、人工监督与申诉机制等关键问题,并给出教育场景中负责任部署的具体指南。
Details
Motivation: 阅读障碍者的拼写错误具有系统性语音与正字法模式,但现有研究偏重纠错而非归因,且严重忽视自动分类可能带来的标签化、隐性筛查、算法偏见与制度滥用等伦理风险。 Method: 将阅读障碍错误归因建模为二分类任务(给定错误词及其正确形式,判断是否为阅读障碍者所犯),构建涵盖正字法、语音和形态学特征的综合特征集,并设计双输入神经网络模型,在writer-independent设置下与传统机器学习基线对比评估。 Result: 神经模型在writer-independent条件下取得93.01%准确率和94.01% F1分数;语音上合理的错误和元音混淆是最强归因信号;同时完成公平性分析、可解释性评估及伦理部署框架设计。 Conclusion: 尽管高精度的阅读障碍错误归因技术可行,但其在高风险教育场景中的部署必须以健全的伦理与法律框架为前提,可行性不等于可部署性。 Abstract: Dyslexic spelling errors exhibit systematic phonological and orthographic patterns that distinguish them from the errors produced by typically developing writers. While this observation has motivated dyslexic-specific spell-checking and assistive writing tools, prior work has focused predominantly on error correction rather than attribution, and has largely neglected the ethical risks. The risk of harmful labelling, covert screening, algorithmic bias, and institutional misuse that automated classification of learners entails requires the development of robust ethical and legal frameworks for research in this area. This paper addresses both gaps. We formulate dyslexic error attribution as a binary classification task. Given a misspelt word and its correct target form, determine whether the error pattern is characteristic of a dyslexic or non-dyslexic writer. We develop a comprehensive feature set capturing orthographic, phonological, and morphological properties of each error, and propose a twin-input neural model evaluated against traditional machine learning baselines under writer-independent conditions. The neural model achieves 93.01% accuracy and an F1-score of 94.01%, with phonetically plausible errors and vowel confusions emerging as the strongest attribution signals. We situate these technical results within an explicit ethics-first framework, analysing fairness across subgroups, the interpretability requirements of educational deployment, and the conditions, consent, transparency, human oversight, and recourse, under which a system could be responsibly used. We provide concrete guidelines for ethical deployment and an open discussion of the systems limitations and misuse potential. Our results demonstrate that dyslexic error attribution is feasible at high accuracy while underscoring that feasibility alone is insufficient for deployment in high-stakes educational contexts.[38] SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations
Yiqiang Cai,Chengyan Wu,Bolei Ma,Bo Chen,Yun Xue,Julia Hirschberg,Ziwei Gong
Main category: cs.CL
TL;DR: SURE is a new framework for multimodal emotion recognition in conversations that improves robustness to noise and contextual reasoning through uncertainty-aware fusion, iterative reasoning, and transformer-based gating.
Details
Motivation: Existing MERC methods focus on fusion but neglect uncertainty in noisy multimodal features and fine-grained contextual reasoning. Method: SURE comprises three components: (1) Uncertainty-Aware Mixture-of-Experts for modality-specific noise handling, (2) Iterative Reasoning module for multi-turn contextual reasoning, and (3) Transformer Gate module for intra- and inter-modal interaction modeling. Result: SURE consistently outperforms state-of-the-art methods on benchmark MERC datasets. Conclusion: Modeling uncertainty and enabling iterative reasoning are critical for robust and effective multimodal emotion recognition in conversations. Abstract: Multimodal emotion recognition in conversations (MERC) requires integrating multimodal signals while being robust to noise and modeling contextual reasoning. Existing approaches often emphasize fusion but overlook uncertainty in noisy features and fine-grained reasoning. We propose SURE (Synergistic Uncertainty-aware REasoning) for MERC, a framework that improves robustness and contextual modeling. SURE consists of three components: an Uncertainty-Aware Mixture-of-Experts module to handle modality-specific noise, an Iterative Reasoning module for multi-turn reasoning over context, and a Transformer Gate module to capture intra- and inter-modal interactions. Experiments on benchmark MERC datasets show that SURE consistently outperforms state-of-the-art methods, demonstrating its effectiveness in robust multimodal reasoning. These results highlight the importance of uncertainty modeling and iterative reasoning in advancing emotion recognition in conversational settings.[39] Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients
Oumaima El Khettari,Virgile Barthet,Guillaume Hocquet,Joconde Weller,Emmanuel Morin,Pierre Zweigenbaum
Main category: cs.CL
TL;DR: 本文评估了基于transformer的模型在法国心力衰竭队列中进行短期死亡率预测的性能,发现实体感知的多模态transformer效果最优,而大语言模型在临床决策支持中表现不稳定且受限。
Details
Motivation: 准确预测心力衰竭患者的短期死亡率具有挑战性,尤其仅依赖结构化电子健康记录数据时。 Method: 采用基于transformer的模型,在法国心力衰竭队列上比较纯文本、纯结构化、多模态融合及大语言模型(LLM)方法;引入实体级文本表征,并进行有监督的多模态融合。 Result: 实体级文本表征优于CLS嵌入;有监督多模态融合性能最佳;LLM表现不稳定,纯文本提示效果优于结构化或多模态输入。 Conclusion: 实体感知的多模态transformer是短期心衰预后预测最可靠方案,当前LLM提示方法在临床决策支持中仍存在局限。 Abstract: Accurate short-term mortality prediction in heart failure (HF) remains challenging, particularly when relying on structured electronic health record (EHR) data alone. We evaluate transformer-based models on a French HF cohort, comparing text-only, structured-only, multimodal, and LLM-based approaches. Our results show that enriching clinical text with entity-level representations improves prediction over CLS embeddings alone, and that supervised multimodal fusion of text and structured variables achieves the best overall performance. In contrast, large language models perform inconsistently across modalities and decoding strategies, with text-only prompts outperforming structured or multimodal inputs. These findings highlight that entity-aware multimodal transformers offer the most reliable solution for short-term HF outcome prediction, while current LLM prompting remains limited for clinical decision support.[40] ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues
Bhaskara Hanuma Vedula,Darshan Anghan,Ishita Goyal,Ponnurangam Kumaraguru,Abhijnan Chakraborty
Main category: cs.CL
TL;DR: 本文提出ImplicitBBQ基准,通过特征线索评估大语言模型在年龄、性别、地域、宗教、种姓和经济地位等维度上的隐式偏见,发现当前对齐与提示策略难以有效缓解文化根植的刻板印象。
Details
Motivation: 现有基准依赖姓名代理检测隐式偏见,但其与多数社会人口维度关联弱,且无法覆盖年龄、社会经济地位等维度;需更可靠、可扩展的方法评估隐式偏见。 Method: 构建基于文化特征线索(而非姓名)的问答基准ImplicitBBQ,覆盖6个社会维度;在11个大模型上系统评测显式与隐式偏见差异,并测试安全提示、思维链、少样本提示等缓解策略效果。 Result: 隐式偏见在模糊语境中比显式偏见高六倍以上;安全提示与思维链效果有限;少样本提示虽降低84%隐式偏见,但种姓偏见仍为其他维度的四倍。 Conclusion: 当前对齐与提示策略仅缓解表层偏见,未能解决深层文化刻板联想;ImplicitBBQ为后续偏见缓解研究提供新基准与开源资源。 Abstract: Large Language Models increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally associated attributes that signal implicitly, across age, gender, region, religion, caste, and socioeconomic status. Evaluating 11 models, we find that implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap; even few-shot prompting, which reduces implicit bias by 84%, leaves caste bias at four times the level of any other dimension. These findings indicate that current alignment and prompting strategies address the surface of bias evaluation while leaving culturally grounded stereotypic associations largely unresolved. We publicly release our code and dataset for model providers and researchers to benchmark potential mitigation techniques.[41] Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification
Géraud Faye,Benjamin Icard,Morgane Casanova,Guillaume Gadek,Guillaume Gravier,Wassila Ouerdane,Céline Hudelot,Sylvain Gatepaille,Paul Égré
Main category: cs.CL
TL;DR: 本文提出了一种结合非上下文文本嵌入(fastText)与符号化概念特征(如体裁、主题和说服技巧)的神经符号方法,以提升新闻宣传内容检测的鲁棒性和泛化能力,实验表明其优于纯文本方法。
Details
Motivation: 现有基于BERT等语言模型的宣传新闻检测方法易因数据采集偏差而过拟合,缺乏对新来源的泛化能力。 Method: 提出一种神经符号混合方法,融合fastText文本嵌入与符号化概念特征(包括体裁、主题、说服技巧)进行分类。 Result: 该方法在宣传新闻检测任务上优于纯文本基线;消融实验和可解释性分析验证了符号特征的有效性。 Conclusion: 引入符号化概念特征能显著增强模型鲁棒性与泛化能力,神经符号融合是应对信息失序问题的有效路径。 Abstract: Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness[42] How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization
Ramon Ferrer-i-Cancho
Main category: cs.CL
TL;DR: 本文提出了一种基于置换多面体(permutohedron)中交换距离最小化的语言词序和手势顺序优化性分析框架,并通过跨语言手势数据验证其有效性(至少77%最优),同时将二次指派问题(QAP)引入语言学,提出统一的最优指派原理。
Details
Motivation: 探究语言中词序和手势顺序是否在置换多面体中最小化相邻交换距离,以解释其认知或交际成本最小化机制。 Method: 构建基于permutohedron的交换距离度量框架,结合跨语言手势数据进行优化性量化分析,并引入二次指派问题(QAP)作为统一优化模型。 Result: 跨语言手势顺序在交换距离意义上至少77%最优,显著高于随机预期;QAP被确立为涵盖多种语言优化问题的统一框架。 Conclusion: 词序与手势顺序存在系统性优化倾向;swap距离最小化是语言与交际系统中普遍存在的优化原则,可由QAP建模并推广为一般最优指派原理。 Abstract: The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.[43] Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
Klaudia Thellmann,Bernhard Stadler,Michael Färber
Main category: cs.CL
TL;DR: 本文提出了一种三步自动化质量保障方法,用于评估和提升机器翻译基准数据集(EU20)的质量,发现低COMET得分与高细粒度错误率相关,并发布了清洗后的数据集和代码。
Details
Motivation: 机器翻译基准数据集虽具规模与成本优势,但存在噪声、结构丢失和质量不均等问题,亟需可扩展的翻译可靠性量化与验证方法。 Method: 采用三步自动化质量保障流程:(i) 结构化语料审计与修复;(ii) 基于COMET(参考式与无参考式)的质量画像,并对比DeepL/ChatGPT/Google翻译服务;(iii) 利用大语言模型构建词元级翻译错误图谱。 Result: 发现COMET得分较低的数据集(如HellaSwag)在词元级准确率/误译错误比例更高;MMLU上参考式COMET与人工校对结果趋势一致;发布了EU20清洗版数据集及全部代码。 Conclusion: 自动化质量保障能提供实用、可扩展的质量指标,辅助人工评审优先级排序,是对人类金标准的补充而非替代。 Abstract: Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.[44] SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
Daeyong Kwon,Soyoung Yoon,Seung-won Hwang
Main category: cs.CL
TL;DR: 本文提出SAFE框架,通过知识图谱(KG)支持的验证流程,在训练和推理阶段分别实现对多跳问答中推理步骤的严格可验证性,从而解决现有基准中LLM因虚假正确性而被奖励的问题。
Details
Motivation: 多跳问答基准常因虚假正确性奖励大语言模型,掩盖其不 grounded 或有缺陷的推理步骤,亟需更严格的推理评估机制。 Method: 提出SAFE动态基准框架:训练时构建原子错误分类法和KG支撑的验证流水线,识别并剔除噪声监督样本(如14%不可回答实例);推理时使用在验证数据上训练的反馈模型,实时检测未 grounding 的推理步骤。 Result: SAFE不仅在训练时揭示现有基准的关键缺陷,还在推理时显著优于基线方法,平均准确率提升8.4个百分点,并保证推理路径全程可验证。 Conclusion: SAFE为多跳问答提供了更可靠、可验证的评估范式,推动模型从表面正确走向真正 grounded 的推理能力。 Abstract: Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.[45] $k$NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection
Kahim Wong,Kemou Li,Haiwei Wu,Jiantao Zhou
Main category: cs.CL
TL;DR: 本文提出了一种无需训练、查询高效的零样本LLM生成文本检测方法kNNProxy,通过复用kNN-LM检索机制对固定代理LLM进行领域自适应,并进一步扩展为多代理混合(MoP)以提升跨领域鲁棒性。
Details
Motivation: 现有零样本检测方法依赖代理LLM与源LLM高度对齐,但在黑盒场景下该假设难以满足;而现有对齐方法需监督微调或频繁API调用,带来高成本、脆弱性和域偏移问题。 Method: 提出kNNProxy框架:构建轻量目标相关LGT语料库的数据集,推理时利用k近邻检索得到token级预测分布,并与代理LLM输出插值得到对齐预测;进一步扩展为MoP,按输入路由至对应领域数据集。 Result: 在多个基准上实验表明,kNNProxy在检测性能、查询效率和跨域鲁棒性方面均优于现有零样本及部分有监督方法。 Conclusion: kNNProxy是一种无需训练、低查询开销且鲁棒性强的零样本LGT检测新范式,为实际部署提供了更可靠、低成本的解决方案。 Abstract: LLM-generated text (LGT) detection is essential for reliable forensic analysis and for mitigating LLM misuse. Existing LGT detectors can generally be categorized into two broad classes: learning-based approaches and zero-shot methods. Compared with learning-based detectors, zero-shot methods are particularly promising because they eliminate the need to train task-specific classifiers. However, the reliability of zero-shot methods fundamentally relies on the assumption that an off-the-shelf proxy LLM is well aligned with the often unknown source LLM, a premise that rarely holds in real-world black-box scenarios. To address this discrepancy, existing proxy alignment methods typically rely on supervised fine-tuning of the proxy or repeated interactions with commercial APIs, thereby increasing deployment costs, exposing detectors to silent API changes, and limiting robustness under domain shift. Motivated by these limitations, we propose the $k$-nearest neighbor proxy ($k$NNProxy), a training-free and query-efficient proxy alignment framework that repurposes the $k$NN language model ($k$NN-LM) retrieval mechanism as a domain adapter for a fixed proxy LLM. Specifically, a lightweight datastore is constructed once from a target-reflective LGT corpus, either via fixed-budget querying or from existing datasets. During inference, nearest-neighbor evidence induces a token-level predictive distribution that is interpolated with the proxy output, yielding an aligned prediction without proxy fine-tuning or per-token API outputs. To improve robustness under domain shift, we extend $k$NNProxy into a mixture of proxies (MoP) that routes each input to a domain-specific datastore for domain-consistent retrieval. Extensive experiments demonstrate strong detection performance of our method.[46] Why Gaussian Diffusion Models Fail on Discrete Data?
Alexander Shabalin,Simon Elistratov,Viacheslav Meshchaninov,Ildus Sadrtdinov,Dmitry Vetrov
Main category: cs.CL
TL;DR: 本文探讨了高斯扩散模型(如DDPM)在离散数据生成中表现不佳的原因,指出其在噪声化数据密度多峰的关键采样区间内易陷入低密度区域,导致样本质量下降;作者提出结合自条件机制与在该区间切换至q-sampling求解器的策略,并在文本、代码和蛋白质等多任务上验证了有效性。
Details
Motivation: 高斯扩散模型在连续域生成中已很成熟,但在离散数据(如文本、代码)上效果受限,亟需理解其根本缺陷并提出改进方案。 Method: 通过构建随机层次模型(Random Hierarchy Model)进行理论分析,识别出导致采样失败的关键多模态噪声区间;提出q-sampling求解器,并结合自条件机制,在关键区间动态切换采样策略。 Result: 所提方法显著提升离散数据生成质量,在文本、编程代码和蛋白质建模等真实任务中均取得更好样本保真度与分布匹配性。 Conclusion: DDPM在离散数据上的失效源于多模态噪声区间的低密度采样问题;引入q-sampling与自条件协同策略可有效缓解,为离散扩散建模提供了新思路与实用方案。 Abstract: Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.[47] Tracking the emergence of linguistic structure in self-supervised models learning from speech
Marianne de Heer Kloots,Martijn Bentum,Hosein Mohebbi,Charlotte Pouw,Gaofei Shen,Willem Zuidema
Main category: cs.CL
TL;DR: 本文研究了六种Wav2Vec2和HuBERT模型在荷兰语语音预训练过程中,不同层次和训练阶段对多种语言结构的编码规律,发现语言结构的出现具有层次性和时序性,并受预训练目标层级的影响。
Details
Motivation: 探究自监督语音模型在训练过程中何时以及如何习得不同层次的语言结构。 Method: 分析六种Wav2Vec2和HuBERT模型在荷兰语上的多层中间检查点,评估其对多种语言结构的编码能力,并考察层间模式与学习轨迹差异。 Result: 不同语言结构展现出显著不同的层分布模式和学习轨迹;抽象程度和时间尺度影响其编码方式;高阶预测任务(如迭代优化伪标签)增强了层间并行性。 Conclusion: 语言结构在自监督语音模型中的浮现具有系统性规律,不仅依赖于模型深度,更受预训练目标定义层级的深刻影响。 Abstract: Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).[48] BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs
Nicolas Boizard,Théo Deschamps-Berger,Hippolyte Gisserot-Boukhlef,Céline Hudelot,Pierre Colombo
Main category: cs.CL
TL;DR: 本文提出了一种将因果生成式语言模型(如Gemma3、Qwen3)高效转化为双向编码器(BidirLM)的开源方法,通过引入先验掩码阶段、线性权重融合与轻量多领域数据混合策略,缓解灾难性遗忘,并融合专用因果模型以增强多模态表征能力,在文本、视觉和音频基准上超越现有方法。
Details
Motivation: 现有将因果生成模型转为双向编码器的方法缺乏训练目标共识、存在大规模下的灾难性遗忘问题,且难以灵活集成大量专用生成模型。 Method: 基于Gemma3和Qwen3系列开展系统消融实验,发现先验掩码阶段至关重要;提出无需原始预训练数据的双策略:线性权重合并 + 轻量多领域数据混合;进一步通过与专用因果模型融合来增强编码器的模态/领域能力。 Result: 构建了开源BidirLM编码器家族(共5个),在文本、视觉和音频表征基准上均优于现有替代方案。 Conclusion: 该方法为任意因果解码器大模型提供了一套通用、可扩展、模块化的双向编码器改造范式,兼顾性能、灵活性与实用性。 Abstract: Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.[49] Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Tao Jin,Phuong Minh Nguyen,Naoya Inoue
Main category: cs.CL
TL;DR: 本文提出GOOSE框架,通过构建自适应脊树(anisotropic tree)来提升大语言模型的推测解码效率,在不损失精度的前提下实现1.9–4.3倍加速。
Details
Motivation: 现有无训练推测解码方法未区分不同候选token来源的质量差异(如n-gram匹配与统计预测),导致树结构低效;而高质量与低质量token混合使用时,传统平衡树受限于深度-宽度权衡。 Method: 提出GOOSE框架:构建‘自适应脊树’——以高接受率的上下文匹配token构成主干链(spine),在每个节点挂载低接受率token作为宽分支;理论证明该结构在固定验证预算下优于单一来源或平衡树。 Result: 在5个LLM(7B–33B)和5个基准上,GOOSE实现1.9–4.3倍无损加速,较平衡树基线提升12–33%。 Conclusion: 利用token来源质量异质性设计非对称树结构可突破推测解码性能瓶颈;GOOSE是首个理论保证优于单源、且在实践中显著超越平衡树的无训练推测解码框架。 Abstract: Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.[50] Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Yuhang Wu,Xiangqing Shen,Fanfan Wang,Cangqi Zhou,Zhen Wu,Xinyu Dai,Rui Xia
Main category: cs.CL
TL;DR: 本文提出RRPO框架,通过强化学习将重排序与大语言模型生成质量直接对齐,无需人工标注,显著提升知识密集型任务的重排序效果。
Details
Motivation: 现有重排序模型仅基于静态人工标注的相关性标签进行优化,与下游生成任务脱节,导致高相关性文档未必对LLM生成答案有实际效用。 Method: 提出ReRanking Preference Optimization(RRPO)强化学习框架,将重排序建模为序列决策过程,利用LLM反馈优化上下文效用,并引入参考锚定确定性基线保障训练稳定性。 Result: 在知识密集型基准测试中,RRPO显著优于包括RankZephyr在内的强基线;且具备跨LLM泛化性、与查询扩展模块正交兼容性及对噪声监督的鲁棒性。 Conclusion: RRPO有效弥合了重排序与生成任务之间的效用鸿沟,是一种无需人工标注、稳定高效且通用的重排序优化新范式。 Abstract: Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.[51] Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
Haitong Sun,Stephen McIntosh,Kwanghee Choi,Eunjung Yeo,Daisuke Saito,Nobuaki Minematsu
Main category: cs.CL
TL;DR: 本文提出了一种名为'prosodic ABX'的新框架,用于评估自监督语音模型(S3Ms)对韵律对比(如重音、声调、音高重音)的敏感性,并构建了英、日、中文的最小对立对数据集进行实验验证。
Details
Motivation: 现有研究关注S3Ms对音素对比的敏感性,但缺乏对其对韵律对比(如重音、声调、音高重音)敏感性的直接测量。 Method: 扩展ABX判别任务为'prosodic ABX',仅需少量样本且无需显式标签;构建并发布英语、日语最小对立对数据集,并结合已有汉语声调数据集,在不同语言韵律特征上进行评估。 Result: 验证了S3Ms对多种韵律对比具有可测的敏感性;模型与层级的性能排序在多种实验条件下保持稳定,适用于低资源场景。 Conclusion: prosodic ABX是一种有效、轻量、跨语言的评估框架,可用于量化S3Ms对韵律信息的建模能力,并支持低资源条件下的模型比较。 Abstract: Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.[52] Reliable Control-Point Selection for Steering Reasoning in Large Language Models
Haomin Zhuang,Hojun Yoo,Xiaonan Luo,Kehan Guo,Xiangliang Zhang
Main category: cs.CL
TL;DR: 本文提出了一种基于稳定性过滤的推理行为控制方法,通过识别模型中真正稳定、可复现的内在推理行为边界,构建更有效的无训练 steering vectors,显著提升数学推理准确率并支持跨模型迁移。
Details
Motivation: 现有基于关键词匹配检测推理行为(如自省)的方法假设所有检测到的行为边界都代表真实、稳定的行为信号,但实证发现其中93.3%不可复现,导致steering信号被严重稀释。 Method: 提出概率模型将内在推理行为建模为上下文依赖的随机事件;设计稳定性过滤机制,仅保留行为可一致复现的边界;结合内容子空间投影去除问题特异性噪声,从而提取鲁棒的steering向量。 Result: 在MATH-500上达到0.784准确率(较最强基线+5.0);steering向量可在同架构家族模型间直接迁移,使Nemotron-Research-Reasoning-1.5B和DeepScaleR-1.5B-Preview分别提升+5.0和+6.0。 Conclusion: 行为边界的稳定性是构建高质量steering vectors的关键前提;稳定性过滤与子空间投影相结合,为无训练、可迁移的推理控制提供了新范式。 Abstract: Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model's hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors -- such as self-reflection -- emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3\% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content-subspace projection that removes residual question-specific noise, our method achieves 0.784 accuracy on MATH-500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0). Code is available at https://github.com/zhmzm/stability-steering.[53] GaelEval: Benchmarking LLM Performance for Scottish Gaelic
Peter Devine,William Lamb,Beatrice Alex,Ignatius Ezeani,Dawn Knight,Mícheál J. Ó Meachair,Paul Rayson,Martin Wynne
Main category: cs.CL
TL;DR: 本文提出了GaelEval,首个针对苏格兰盖尔语的多维评测基准,涵盖形态句法理解、文化相关翻译与文化知识问答三类任务;实验发现前沿闭源模型(如Gemini 3 Pro Preview)在部分语言能力上已超越母语者水平,且使用盖尔语提示有小幅但稳定的提升,而开源模型整体落后。
Details
Motivation: 现有评测难以准确衡量多语言大模型在未受支持的形态丰富型少数语言(如苏格兰盖尔语)上的真实结构能力,亟需专门、多维、文化适配的基准。 Method: 构建首个苏格兰盖尔语多维评测基准GaelEval,包含专家设计的形态句法多项选择题(MCQA)、文化扎根的翻译任务和大规模文化知识问答任务;对19个LLM在30名流利使用者构成的人类基线上进行系统评估,并对比英语与盖尔语提示效果。 Result: Gemini 3 Pro Preview在形态句法任务中达83.3%准确率,超过人类基线(78.1%);闭源模型整体显著优于开源模型;盖尔语提示带来+2.4%稳定增益;文化知识任务中多数模型准确率超90%,但盖尔语提示反而降低表现,且绝对分数因人工标注宽松被高估。 Conclusion: 前沿多语言大模型已在苏格兰盖尔语的部分语法维度上展现出超越人类的表现,证实‘影子能力’的真实存在;GaelEval有效揭示了模型在少数语言上的能力分布与局限,强调需结合语言特性和文化语境设计评测,并警惕现有指标的偏差。 Abstract: Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3\%$ accuracy on the linguistic task, surpassing the human baseline ($78.1\%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4\%$). On the cultural task, leading models exceed $90\%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.[54] Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
Xuan Qi
Main category: cs.CL
TL;DR: 本文系统研究了链式思维(CoT)推理长度对函数调用型语言智能体性能的影响,发现短推理(32 tokens)显著提升准确率,而长推理反而严重损害性能;进一步分析揭示其核心机制在于函数路由作用,并据此提出结构化简短CoT方法FR-CoT,在保持高性能的同时彻底消除函数幻觉。
Details
Motivation: 链式思维(CoT)被广泛认为能提升智能体性能,但在结构化工具使用场景中,推理长度与准确率的关系尚不明确,尤其缺乏对函数调用型智能体的系统性预算分析。 Method: 在 Berkeley Function Calling Leaderboard v3 Multiple 的 200 个任务上,对 Qwen2.5-1.5B-Instruct 模型进行六档 token 预算(0–512)的系统扫频实验;引入三类错误分解分析失败原因;开展 oracle 分析确定最优推理长度;并提出结构化简短CoT方法 FR-CoT,强制在推理起始即指定合法函数名。 Result: 发现非单调性能曲线:32-token CoT 相对提升准确率 45%(44.0%→64.0%),而 256-token CoT 反降至 25.0%;错误分解显示短CoT将函数误选率从30.5%降至1.5%,长CoT则升至28.0%并引入18.0%幻觉;oracle 分析表明88.6%可解任务仅需≤32 tokens,最优区间为8–16 tokens;FR-CoT 实现与自由形式32-token CoT相当的准确率,且函数幻觉降为0.0%。 Conclusion: CoT 在函数调用智能体中的核心价值在于轻量级函数路由而非深度推理;过度推理会破坏路由可靠性并诱发幻觉;结构化、极简的 FR-CoT 方法可在无需预算调优的前提下提供更强的可靠性保障。 Abstract: How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.[55] AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics
Atilla Kaan Alkan,Felix Grezes,Sergi Blanco-Cuaresma,Jennifer Lynn Bartlett,Daniel Chivvis,Anna Kelbert,Kelly Lockhart,Alberto Accomazzi
Main category: cs.CL
TL;DR: 本文介绍了AstroConcepts语料库,用于研究天体物理学领域多标签文本分类中的极端类别不平衡问题,并提出了频率分层评估方法以揭示不同方法在罕见术语上的性能差异。
Details
Motivation: 科学领域的多标签文本分类面临极端类别不平衡的挑战,特别是专业术语呈现严重的幂律分布,而现有语料库缺乏全面受控词汇表,难以系统研究该问题。 Method: 构建AstroConcepts语料库(21702篇天体物理论文摘要,标注2367个统一天文词表概念),分析其标签不平衡特性;采用传统模型、神经网络及词汇约束大语言模型进行实验;提出频率分层评估策略。 Result: 词汇约束大语言模型在天体物理分类中表现接近领域适配模型;领域适配对罕见术语提升更明显但绝对性能仍有限;频率分层评估能揭示聚合指标掩盖的性能模式。 Conclusion: AstroConcepts为科学NLP中极端不平衡研究提供了新基准和资源,提出的评估方法强调鲁棒性,结果对科学文本分类具有实际指导意义。 Abstract: Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.[56] Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions
Atilla Kaan Alkan,Felix Grezes,Jennifer Lynn Bartlett,Anna Kelbert,Kelly Lockhart,Alberto Accomazzi
Main category: cs.CL
TL;DR: 本文参与SOMD 2026跨文档软件提及共指解析共享任务,提出两种无需微调的方法——模糊匹配(FM)与上下文感知表示(CAR),在三个子任务中均取得第二名;CAR在准确率、鲁棒性和可扩展性上整体优于FM,尤其适合大规模、低噪声场景。
Details
Motivation: 软件提及共指解析是尚被低估的重要任务,现有方法多依赖复杂微调,本文旨在探索轻量、高效、无需微调的替代方案,并分析其在不同噪声和规模下的适用边界。 Method: 提出两种无监督/无需微调方法:1)Fuzzy Matching(FM),基于字符串相似度的词法匹配;2)Context Aware Representations(CAR),融合提及级与文档级嵌入;并开展噪声注入实验与推理效率分析。 Result: CAR在官方测试集CoNLL F1上稳定领先FM约1个百分点(0.94–0.96);抗边界噪声能力更强(F1仅降0.07 vs. FM降0.20),但对提及替换更敏感;CAR推理时间近似线性扩展,FM为超线性;代码已开源。 Conclusion: 对于软件共指解析,简单但精心设计的表示方法(如CAR)可媲美甚至超越复杂微调模型;系统选型应综合考虑上游提及检测器的噪声特性与目标语料规模。 Abstract: We present our participation in the SOMD 2026 shared task on cross-document software mention coreference resolution, where our systems ranked second across all three subtasks. We compare two fine-tuning-free approaches: Fuzzy Matching (FM), a lexical string-similarity method, and Context Aware Representations (CAR), which combines mention-level and document-level embeddings. Both achieve competitive performance across all subtasks (CoNLL F1 of 0.94-0.96), with CAR consistently outperforming FM by 1 point on the official test set, consistent with the high surface regularity of software names, which reduces the need for complex semantic reasoning. A controlled noise-injection study reveals complementary failure modes: as boundary noise increases, CAR loses only 0.07 F1 points from clean to fully corrupted input, compared to 0.20 for FM, whereas under mention substitution, FM degrades more gracefully (0.52 vs. 0.63). Our inference-time analysis shows that FM scales superlinearly with corpus size, whereas CAR scales approximately linearly, making CAR the more efficient choice at large scale. These findings suggest that system selection should be informed by both the noise profile of the upstream mention detector and the scale of the target corpus. We release our code to support future work on this underexplored task.[57] Adam's Law: Textual Frequency Law on Large Language Models
Hongyuan Adam Lu,Z. L.,Victor Wei,Zefan Zhang,Zhao Hong,Qiqi Xiang,Bowen Cao,Wai Lam
Main category: cs.CL
TL;DR: 本文提出文本频率定律(TFL)及配套框架(含TFD和CTFT),主张在提示与微调中优先使用高频文本,并通过在线资源估计、 paraphrasing 和蒸馏增强频率估计,最终按频率递增顺序进行课程式微调,在多个任务上验证了有效性。
Details
Motivation: 文本频率已被证实影响人类阅读速度,但其对大语言模型(LLMs)的影响尚缺乏研究;且LLMs训练数据常不公开,亟需可操作的频率建模方法。 Method: 提出文本频率定律(TFL);利用在线资源估计句子级频率,结合输入改写器生成更频繁表达;设计文本频率蒸馏(TFD)扩展语料以优化频率估计;构建课程式文本频率训练(CTFT)策略,按频率升序微调LLM;在自建数据集TFPD上开展实验。 Result: 在数学推理、机器翻译、常识推理和智能体工具调用四个任务上,所提框架显著提升模型性能,验证了文本频率对LLM的有效性。 Conclusion: 文本频率是影响LLM性能的重要因素;TFL及其配套方法(TFD、CTFT)为频率驱动的模型优化提供了新范式,具有普适性和实用价值。 Abstract: While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.[58] The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Jeremy Herbst,Jae Hee Lee,Stefan Wermter
Main category: cs.CL
TL;DR: 本文研究了混合专家(MoE)架构相较于密集前馈网络(FFN)在可解释性上的优势,发现MoE的稀疏性促使专家单元趋向单义性,从而使其在专家层面更具可解释性;作者通过k-稀疏探测和数百个专家的自动解释验证了专家是细粒度的语言或语义任务专家,而非宽泛领域或简单token级处理器。
Details
Motivation: 探究MoE架构的稀疏性是否使其比密集FFN更易于解释,解决其内在可解释性问题。 Method: 采用k-稀疏探测方法比较MoE专家与密集FFN的神经元多义性,并在专家层面开展大规模自动解释分析。 Result: MoE专家比密集FFN神经元更单义,且随路由稀疏度增加差异更显著;专家表现为细粒度语言/语义任务专家(如LaTeX括号闭合),而非宽泛领域专家或token级处理器。 Conclusion: MoE架构在专家层面具有固有可解释性,为大规模模型可解释性提供了更清晰路径。 Abstract: Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis[59] Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model
Jaemin Kim,Jae O Lee,Sumyeong Ahn,Seo Yeon Park
Main category: cs.CL
TL;DR: 本文提出Neuro-RIT框架,通过神经元级归因分析区分相关与无关上下文处理神经元,并采用两阶段指令微调实现噪声抑制与证据蒸馏,显著提升检索增强语言模型在噪声场景下的鲁棒性。
Details
Motivation: 现有检索增强语言模型(RALMs)在面对无关或噪声检索上下文时性能易下降,而主流鲁棒性增强方法多在模块或层级别进行粗粒度参数更新,忽视了大语言模型固有的神经元级稀疏性。 Method: 提出Neuro-RIT:首先基于归因法挖掘关键神经元,解耦负责处理相关与无关上下文的神经元;再通过两阶段指令微调——第一阶段功能失活仅响应无关上下文的神经元以直接抑制噪声,第二阶段优化特定层以增强证据提取能力。 Result: 在多个问答基准上实验表明,Neuro-RIT持续优于强基线及其它鲁棒性增强方法。 Conclusion: 神经元级精细调控比粗粒度适配更有效,Neuro-RIT为提升RALMs在噪声环境下的鲁棒性提供了新范式。 Abstract: Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.[60] Towards Position-Robust Talent Recommendation via Large Language Models
Silin Du,Hongyan Liu
Main category: cs.CL
TL;DR: 本文提出L3TR框架,通过块注意力机制、局部位置编码和ID采样方法,解决大语言模型在人才推荐中因点对式范式导致的高token消耗、位置偏差及中间丢失等问题,实现更高效准确的列表式人才推荐。
Details
Motivation: 现有基于大语言模型(LLM)的人才推荐系统多采用点对式范式,导致高token消耗、难以建模候选人关系,并受位置偏差和‘中间丢失’问题影响,推荐效果受限。 Method: 提出列表式人才推荐框架L3TR,包含块注意力机制、局部位置编码以缓解位置与并发token偏差,以及ID采样方法解决训练与推理阶段候选集规模不一致问题;并设计无训练去偏方法与偏差检测评估手段。 Result: 在两个真实数据集上的大量实验表明,L3TR相较现有基线方法具有一致性提升,有效缓解位置偏差与token偏差,降低token消耗,提升推荐质量。 Conclusion: L3TR是一种高效、鲁棒的列表式LLM人才推荐框架,通过结构化建模候选人关系与针对性偏差校正,显著提升了推荐性能与实用性。 Abstract: Talent recruitment is a critical, yet costly process for many industries, with high recruitment costs and long hiring cycles. Existing talent recommendation systems increasingly adopt large language models (LLMs) due to their remarkable language understanding capabilities. However, most prior approaches follow a pointwise paradigm, which requires LLMs to repeatedly process some text and fails to capture the relationships among candidates in the list, resulting in higher token consumption and suboptimal recommendations. Besides, LLMs exhibit position bias and the lost-in-the-middle issue when answering multiple-choice questions and processing multiple long documents. To address these issues, we introduce an implicit strategy to utilize LLM's potential output for the recommendation task and propose L3TR, a novel framework for listwise talent recommendation with LLMs. In this framework, we propose a block attention mechanism and a local positional encoding method to enhance inter-document processing and mitigate the position bias and concurrent token bias issue. We also introduce an ID sampling method for resolving the inconsistency between candidate set sizes in the training phase and the inference phase. We design evaluation methods to detect position bias and token bias and training-free debiasing methods. Extensive experiments on two real-world datasets validated the effectiveness of L3TR, showing consistent improvements over existing baselines.[61] CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech
Youssef Saidi,Haroun Elleuch,Fethi Bougares
Main category: cs.CL
TL;DR: 本文提出了首个面向阿拉伯语语音的端到端命名实体识别(NER)公开数据集CV-18 NER,并在该数据集上对比评估了端到端模型(Whisper、AraBEST-RQ)与级联系统(ASR+文本NER)的性能,结果表明端到端方法显著更优;同时分析了预训练策略与模型规模对低资源阿拉伯语语音NER的影响。
Details
Motivation: 阿拉伯语因其形态复杂性、缺失短元音及标注资源匮乏,其语音端到端NER长期未被探索,亟需高质量数据集与基准评测。 Method: 构建首个阿拉伯语语音NER数据集CV-18 NER(基于Common Voice 18,采用Wojood细粒度schema人工标注21类实体);在该数据集上对比评估端到端模型(Whisper-medium、AraBEST-RQ 300M)与ASR+文本NER级联系统;分析阿拉伯语自监督预训练与多语言弱监督对语音到实体联合建模的影响。 Result: 端到端模型显著优于最佳级联系统:AraBEST-RQ 300M达37.0% CoER,Whisper-medium达38.0% CVER;阿拉伯语专用自监督预训练利于ASR,而多语言弱监督更利于端到端语音NER;大模型在低资源下更难适配。 Conclusion: 端到端方法更适合阿拉伯语语音NER任务;CV-18 NER填补了该领域空白,为后续研究提供了首个开放基准;预训练策略的选择比模型规模更重要。 Abstract: End-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech https://huggingface.co/datasets/Elyadata/CV18-NER.[62] No Single Best Model for Diversity: Learning a Router for Sample Diversity
Yuhan Liu,Fangyuan Xu,Vishakh Padmakumar,Daphne Ippolito,Eunsol Choi
Main category: cs.CL
TL;DR: 本文研究如何通过多个大语言模型(LLM)协同生成多样且全面的回答,提出‘多样性覆盖率’评估指标,并设计一个路由机制为每个开放性问题选择最优模型,显著提升综合回答质量。
Details
Motivation: 当提示允许大量有效答案时,需全面生成这些答案以满足不同用户需求;现有单一模型无法在所有开放性问题上稳定生成高多样性回答,因此需要一种动态选择最优模型的机制。 Method: 提出多样性覆盖率(diversity coverage)作为评估指标,评测18个LLM在开放性提示下的表现;基于发现‘每条提示下总有一个模型显著更优’,构建一个查询级模型路由器,并在NB-Wildchat等数据集上训练与验证。 Result: 路由器在NB-Wildchat上达到26.3%多样性覆盖率,优于单模型最优基线(23.8%),并在NB-Curated及不同提示策略上展现良好泛化性。 Conclusion: 面向综合性回答生成任务,应摒弃单一模型范式,转向多模型协同与查询自适应路由的新框架;该工作为后续多模型集成与开放生成评估奠定基础。 Abstract: When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.[63] Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
Daiwei Chen,Zhoutong Fu,Chengming Jiang,Haichao Zhang,Ran Zhou,Tan Wang,Chunnan Yao,Guoyao Li,Rui Cai,Yihan Cao,Ruijie Jiang,Fedor Borisyuk,Jianqiang Shen,Jingwei Wu,Ramya Korlakai Vinayak
Main category: cs.CL
TL;DR: 本文发现语言模型中新增词汇的均值初始化方法会导致新词嵌入坍缩到退化子空间,损害其区分性;为此提出基于语言学基础的GTI初始化方法,在微调前利用配对语言监督将新词映射到预训练嵌入空间中有意义的位置,显著提升生成式推荐任务性能。
Details
Motivation: 标准的新词均值初始化策略导致新词嵌入坍缩、丢失区分性,且后续微调难以完全恢复,表明词元初始化是扩展语言模型词汇的关键瓶颈。 Method: 提出‘基于基础的词元初始化假说’,并设计轻量级GTI(Grounded Token Initialization)方法:在监督微调前,仅利用配对语言监督将新词映射到预训练嵌入空间中语义明确、彼此区分的位置。 Result: GTI在多个生成式推荐基准(含工业级与公开数据集)上普遍优于均值初始化及现有辅助任务适配方法;分析显示其产生的嵌入具有更丰富的词间结构,且该结构在微调后仍保持稳定。 Conclusion: 词元初始化质量是语言模型词汇扩展的核心瓶颈;通过语言学基础进行预 grounding 的初始化策略(如GTI)能更有效地激活预训练知识,显著提升下游任务性能。 Abstract: Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.cs.CV [Back]
[64] DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation
Xinhao Huang,Jinke Yu,Wenhao Xu,Zeyi Wen,Ying Zhou,Junzhuo Liu,Junhao Ji,Zulong Chen
Main category: cs.CV
TL;DR: 本文提出DOne框架,通过解耦结构理解与元素渲染来解决视觉语言模型在设计到代码生成中的整体瓶颈问题,并引入HiFi2Code基准进行评估。
Details
Motivation: Vision Language Models (VLMs) 在Design-to-Code生成中存在“整体瓶颈”,难以兼顾高层次结构层次与细粒度视觉细节,导致布局失真或泛化占位符。 Method: 提出DOne端到端框架:(1) 学习式布局分割模块以分解复杂设计;(2) 专用混合元素检索器处理UI组件的极端长宽比和密度;(3) 模式引导的生成范式连接布局与代码。同时构建高复杂度基准HiFi2Code。 Result: 在HiFi2Code上,DOne在高层视觉相似性(如GPT Score提升超10%)和细粒度元素对齐上均优于现有方法;人工评估显示生产力提升3倍且视觉保真度更高。 Conclusion: DOne有效缓解了VLM在Design-to-Code任务中的结构-细节协同难题,显著提升了生成质量与实用性。 Abstract: While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a "holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. To bridge this gap, we propose DOne, an end-to-end framework that decouples structure understanding from element rendering. DOne introduces (1) a learned layout segmentation module to decompose complex designs, avoiding the limitations of heuristic cropping; (2) a specialized hybrid element retriever to handle the extreme aspect ratios and densities of UI components; and (3) a schema-guided generation paradigm that bridges layout and code. To rigorously assess performance, we introduce HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations on the HiFi2Code demonstrate that DOne outperforms exiting methods in both high-level visual similarity (e.g., over 10% in GPT Score) and fine-grained element alignment. Human evaluations confirm a 3 times productivity gain with higher visual fidelity.[65] CLPIPS: A Personalized Metric for AI-Generated Image Similarity
Khoi Trinh,Jay Rothenberger,Scott Seidenberger,Dimitrios Diochnos,Anindya Maiti
Main category: cs.CV
TL;DR: 本文提出CLPIPS,一种基于LPIPS的轻量级、人类反馈驱动的定制化图像相似度度量方法,通过仅微调LPIPS层权重来提升其与人类感知判断的一致性,在人机协同文生图流程中增强感知对齐。
Details
Motivation: 现有图像相似度指标(如LPIPS、CLIP)虽客观但常与人类主观判断不一致,尤其在特定上下文或用户驱动任务中;亟需一种能适配人类感知的可定制、轻量级相似度度量。 Method: 提出Customized Learned Perceptual Image Patch Similarity(CLPIPS),在人类对生成图像对的排序数据上,采用margin ranking loss仅微调LPIPS的层组合权重。 Result: CLPIPS在Spearman秩相关系数和组内相关系数(ICC)上均优于原始LPIPS,显著提升与人类排序判断的一致性;验证了少量人类标注即可有效增强感知对齐。 Conclusion: 轻量级、人类增强的微调能实质性提升相似度指标与人类感知的对齐能力;CLPIPS可作为人机协同文生图工作流中自适应、可定制的反馈组件。 Abstract: Iterative prompt refinement is central to reproducing target images with text to image generative models. Previous studies have incorporated image similarity metrics (ISMs) as additional feedback to human users. Existing ISMs such as LPIPS and CLIP provide objective measures of image likeness but often fail to align with human judgments, particularly in context specific or user driven tasks. In this paper, we introduce Customized Learned Perceptual Image Patch Similarity (CLPIPS), a customized extension of LPIPS that adapts a metric's notion of similarity directly to human judgments. We aim to explore whether lightweight, human augmented fine tuning can meaningfully improve perceptual alignment, positioning similarity metrics as adaptive components for human in the loop workflows with text to image tools. We evaluate CLPIPS on a human subject dataset in which participants iteratively regenerate target images and rank generated outputs by perceived similarity. Using margin ranking loss on human ranked image pairs, we fine tune only the LPIPS layer combination weights and assess alignment via Spearman rank correlation and Intraclass Correlation Coefficient. Our results show that CLPIPS achieves stronger correlation and agreement with human judgments than baseline LPIPS. Rather than optimizing absolute metric performance, our work emphasizes improving alignment consistency between metric predictions and human ranks, demonstrating that even limited human specific fine tuning can meaningfully enhance perceptual alignment in human in the loop text to image workflows.[66] Camouflage-aware Image-Text Retrieval via Expert Collaboration
Yao Jiang,Zhongkuan Mao,Xuan Wu,Keren Fu,Qijun Zhao
Main category: cs.CV
TL;DR: 本文提出了一种新的‘伪装感知图像-文本检索’(CA-ITR)任务,构建了首个专用数据集CamoIT,并设计了伪装专家协同网络(CECNet)以提升跨模态对齐性能,在CA-ITR任务上实现约29%的准确率提升。
Details
Motivation: 现有伪装场景理解(CSU)中,鲁棒的图像-文本跨模态对齐尚未被充分探索,限制了对伪装场景的深层理解与应用。 Method: 构建包含约10.5K样本、多粒度文本标注的CamoIT数据集;提出双分支视觉编码器的CECNet模型,其中一分支建模整体图像表征,另一分支注入伪装物体表征;引入置信度条件图注意力(C²GA)机制融合双分支互补信息。 Result: 在CamoIT基准测试中,CECNet相较七种主流检索模型平均提升约29%的CA-ITR准确率,显著优于现有方法。 Conclusion: CA-ITR是一个具有挑战性且值得深入研究的新任务;CECNet通过显式建模伪装特性与跨分支协同机制,有效提升了图像-文本跨模态检索性能;所构建的数据集和代码将开源以推动该领域发展。 Abstract: Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval'' (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising $\sim$10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript{2}GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves $\sim$29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at https://github.com/jiangyao-scu/CA-ITR.[67] Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models
Marco Morini,Sara Sarto,Marcella Cornia,Lorenzo Baraldi
Main category: cs.CV
TL;DR: 本文提出Look Twice(LoT)框架,无需训练即可在推理阶段提升多模态大语言模型(MLLMs)对视觉与外部知识联合推理的能力,通过注意力机制识别并高亮关键视觉区域和文本证据,显著提升知识密集型视觉问答性能。
Details
Motivation: 现有MLLMs在回答需结合图像与外部知识的问题时,难以准确识别最相关的视觉线索和检索到的文本证据,尤其面对噪声或部分相关文本及细粒度视觉定位时表现不佳。 Method: 提出训练无关的推理时框架LoT,利用预训练MLLMs的注意力模式自动估计查询相关的视觉区域和文本片段,并通过轻量级提示标记高亮这些证据,引导模型在生成答案时重新关注关键信息。 Result: 在多个知识型VQA基准上持续超越零样本MLLMs;在以视觉为中心及幻觉导向的基准上也验证了仅高亮视觉证据即可提升性能,且无需额外训练或架构修改。 Conclusion: LoT是一种通用、高效、即插即用的推理增强方法,显著提升了MLLMs在多模态证据整合任务中的准确性和鲁棒性。 Abstract: Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.[68] Sparse Spectral LoRA: Routed Experts for Medical VLMs
Omid Nejati Manzari,Hojat Asgariandehkordi,Taha Koleilat,Yiming Xiao,Hassan Rivaz
Main category: cs.CV
TL;DR: 本文提出MedQwen,一种参数高效的医学视觉语言模型,通过谱路由的混合专家(MoE)结构与理论支持的缩放规则,解决医学影像中跨数据集干扰、数据分布偏移及持续学习中的灾难性遗忘问题,在23个医学数据集上展现出优异且鲁棒的性能。
Details
Motivation: 大型视觉语言模型在通用基准上表现优异,但在医学影像领域鲁棒性不足,主要由于异构监督导致跨数据集干扰、对数据范式敏感,以及临床中数据任务流式到达引发的灾难性遗忘。 Method: 提出MedQwen模型,采用谱路由的Mixture-of-Experts(MoE)架构,并设计理论支撑的缩放规则,使低秩更新与全秩微调的MoE对齐;每个专家由预训练权重非重叠SVD分段初始化,并引入残差补偿与缩放机制以保障专家专业化和路由稳定性。 Result: 在23个涵盖视觉问答、报告生成、放射分类和幻觉缓解的医学数据集上,MedQwen零样本分类性能接近全参数微调,仅需其1/339的可训练参数;顺序学习中遗忘率降至约5%,显著优于基线模型20–50%的退化。 Conclusion: MedQwen通过参数高效、理论严谨的MoE设计,有效提升了医学VLM在分布偏移与持续学习场景下的鲁棒性与实用性,为临床部署提供了可行方案。 Abstract: Large vision-language models (VLMs) excel on general benchmarks but often lack robustness in medical imaging, where heterogeneous supervision induces cross-dataset interference and sensitivity to data regime (i.e., how the supervisory signals are mixed). In realistic clinical workflows, data and tasks arrive sequentially, so naive continual training further leads to catastrophic forgetting. To address these challenges, we propose MedQwen, a parameter-efficient medical VLM that couples a spectrally routed Mixture-of-Experts (MoE) with a theoretically grounded scaling rule that aligns low-rank updates with a full-rank, fully fine-tuned MoE, without changing the base architecture. Concretely, we initialize each expert from non-overlapping singular value decomposition (SVD) segments of the pretrained weight and introduce a residual compensation and scaling scheme to enable stable expert specialization and consistent routing under distribution shift. Across 23 medical datasets covering visual question answering, report generation, radiology classification, and hallucination mitigation, MedQwen achieves strong, reliable performance: it approaches full fine-tuning on zero-shot classification with 339$\times$ fewer trainable parameters, and reduces sequential forgetting to $\sim$5\% where strong baselines degrade by $>$20-50\%.[69] ViTs for Action Classification in Videos: An Approach to Risky Tackle Detection in American Football Practice Videos
Syed Ahsan Masud Zaidi,William Hsu,Scott Dietrich
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉Transformer的视频分析方法,用于在美式橄榄球训练视频中检测危险擒抱动作,并构建了一个包含733个标注片段的更大规模数据集,显著提升了对罕见但关键安全事件的检测性能。
Details
Motivation: 早期识别接触性运动中的危险动作可及时干预并提升运动员安全性。 Method: 使用基于视觉Transformer的模型,并结合针对类别不平衡的训练策略,在新构建的大规模数据集(733个单人-假人擒抱片段,带SATT-3击打区域标注)上进行危险擒抱检测。 Result: 交叉验证下危险擒抱召回率为0.67,Risky F1为0.59;相比先前在小数据集上的基线(召回率0.58,F1 0.56),召回率提升超8个百分点。 Conclusion: 视觉Transformer结合不平衡学习能可靠检测罕见但安全关键的擒抱模式,为教练导向的防伤工具提供了实用路径。 Abstract: Early identification of hazardous actions in contact sports enables timely intervention and improves player safety. We present a method for detecting risky tackles in American football practice videos and introduce a substantially expanded dataset for this task. Our work contains 733 single-athlete-dummy tackle clips, each temporally localized around first point contact and labeled with a strike zone component of the standardized Assessment for Tackling Technique (SATT-3), extending prior work that reported 178 annotated videos. Using a Vision transformer-based model with imbalance-aware training, we obtain risky recall of 0.67 and Risky F1 of 0.59 under crossvalidation. Relative to the previous baseline in a smaller subset (risky recall of 0.58; Risky F1 0.56 ), our approach improves risky recall by more than 8% points on a much larger dataset. These results indicate that the vision transformer-based video analysis, coupled with careful handling of class imbalance, can reliably detect rare but safety-critical tackling patterns, offering a practical pathway toward coach-centered injury prevention tools.[70] Human Pose Estimation in Trampoline Gymnastics: Improving Performance Using a New Synthetic Dataset
Léa Drolet-Roy,Victor Nogues,Sylvain Gaudet,Eve Charbonneau,Mickaël Begon,Lama Séoud
Main category: cs.CV
TL;DR: 本文提出了一种通过合成数据(STP)微调ViTPose模型以提升蹦床体操中极端姿态估计精度的方法,显著改善了2D和3D姿态估计性能。
Details
Motivation: 现有姿态估计模型在蹦床体操这类包含极端人体姿态和非常规视角的场景下表现不佳。 Method: 基于动作捕捉数据构建合成蹦床姿态数据集(STP),通过拟合噪声动捕数据至参数化人体模型并生成多视角逼真图像;用该数据集微调ViTPose模型,并在真实多视角蹦床图像上测试。 Result: 2D姿态估计达到该挑战性数据集上的SOTA;3D三角化MPJPE降低12.5 mm(相对提升19.6%)。 Conclusion: 使用高质量合成数据微调可有效弥合常见姿态与极端姿态估计之间的性能差距。 Abstract: Trampoline gymnastics involves extreme human poses and uncommon viewpoints, on which state-of-the art pose estimation models tend to under-perform. We demonstrate that this problem can be addressed by fine-tuning a pose estimation model on a dataset of synthetic trampoline poses (STP). STP is generated from motion capture recordings of trampoline routines. We develop a pipeline to fit noisy motion capture data to a parametric human model, then generate multiview realistic images. We use this data to fine-tune a ViTPose model, and test it on real multi-view trampoline images. The resulting model exhibits accuracy improvements in 2D which translates to improved 3D triangulation. In 2D, we obtain state-of-the-art results on such challenging data, bridging the performance gap between common and extreme poses. In 3D, we reduce the MPJPE by 12.5 mm with our best model, which represents an improvement of 19.6% compared to the pretrained ViTPose model.[71] Regularizing Attention Scores with Bootstrapping
Neo Christopher Chung,Maxim Laletin
Main category: cs.CV
TL;DR: 本文提出了一种基于统计引导的注意力正则化方法(Attention Regularization),利用自助法(bootstrapping)为ViT中的注意力分数构建基线分布,从而估计其显著性和后验概率,有效去除噪声引起的虚假注意力,提升注意力图的稀疏性与可解释性。
Details
Motivation: ViT的注意力分数通常非零但含噪声,导致注意力图模糊、可解释性差,亟需量化其不确定性并进行正则化。 Method: 将注意力分数建模为含独立噪声的统计量,通过自助法对输入特征重采样生成注意力分数的基线分布,并据此计算显著性与后验概率,实现注意力分数的统计正则化。 Result: 在自然与医学图像上显著提升注意力图的稀疏性与收缩性,定量实验(仿真与真实数据)验证了该方法能有效滤除噪声引起的虚假注意力。 Conclusion: 自助法是一种实用且有效的注意力正则化工具,可显著增强ViT注意力机制作为模型解释手段的可靠性与可解释性。 Abstract: Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision-making process. However, attention scores are almost always non-zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non-zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed \emph{Attention Regularization} approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real-world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT. Code available: https://github.com/ncchung/AttentionRegularization[72] Perceptual misalignment of texture representations in convolutional neural networks
Ludovica de Paolis,Fabio Anselmi,Alessio Ansuini,Eugenio Piasini
Main category: cs.CV
TL;DR: 本文探讨了卷积神经网络(CNN)的纹理表征是否与人类纹理感知一致,发现CNN在视觉系统建模上的优劣(如Brain-Score)与其纹理感知对齐程度无相关性,表明人类纹理感知可能依赖于CNN(尤其是物体识别训练所得)未涵盖的机制,如上下文整合。
Details
Motivation: 探究CNN作为视觉系统模型时,其基于特征相关性的纹理表征是否自然地与人类纹理感知对齐,尤其关注更优视觉模型是否具备更类人的纹理表征能力。 Method: 比较多种CNN提取的非线性特征间的线性相关性(Gram矩阵)所构成的纹理表征,并将其与人类纹理感知内容进行比对;同时将这些CNN在Brain-Score上对哺乳动物视觉系统的建模表现作为基准,分析二者间是否存在相关性。 Result: 发现CNN在Brain-Score等常规视觉系统建模指标上的优劣,与其纹理表征对人类感知的对齐程度之间不存在显著关联。 Conclusion: 人类纹理感知可能依赖于不同于当前主流CNN(特别是以物体识别为目标训练的CNN)所建模的机制,例如需要整合上下文信息,而现有CNN纹理表征未能反映这一特性。 Abstract: Mathematical modeling of visual textures traces back to Julesz's intuition that texture perception in humans is based on local correlations between image features. An influential approach for texture analysis and generation generalizes this notion to linear correlations between the nonlinear features computed by convolutional neural networks (CNNs), compiled into Gram matrices. Given that CNNs are often used as models for the visual system, it is natural to ask whether such "texture representations" spontaneously align with the textures' perceptual content, and in particular whether those CNNs that are regarded as better models for the visual system also possess more human-like texture representations. Here we compare the perceptual content captured by feature correlations computed for a diverse pool of CNNs, and we compare it to the models' perceptual alignment with the mammalian visual system as measured by Brain-Score. Surprisingly, we find that there is no connection between conventional measures of CNN quality as a model of the visual system and its alignment with human texture perception. We conclude that texture perception involves mechanisms that are distinct from those that are commonly modeled using approaches based on CNNs trained on object recognition, possibly depending on the integration of contextual information.[73] IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation
Nermin Samet,Gilles Puy,Renaud Marlet
Main category: cs.CV
TL;DR: 本文提出了一种用于3D激光雷达数据零样本开放词汇语义分割(OVSS)的新方法,通过文本生成图像构建原型,并利用2D视觉基础模型蒸馏的3D网络将点云特征与原型图像特征匹配实现分割,在nuScenes和SemanticKITTI上达到SOTA。
Details
Motivation: 解决基于CLIP等视觉语言模型(VLM)的OVSS方法中固有的图像-文本模态鸿沟问题。 Method: 利用文本生成图像(text-to-image generation)创建类别原型图像;基于2D视觉基础模型(VFM)蒸馏得到3D网络;将3D点云特征与生成的2D原型图像特征进行匹配以实现语义标注。 Result: 在nuScenes和SemanticKITTI数据集上的零样本开放词汇语义分割任务中达到当前最优性能(state-of-the-art)。 Conclusion: 文本生成图像可有效桥接模态鸿沟,为3D点云零样本语义分割提供一种无需3D文本对齐监督的新范式。 Abstract: This paper presents a new method for the zero-shot open-vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image-text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state-of-the-art for OVSS on nuScenes and SemanticKITTI. Code, pre-trained models, and generated images are available at https://github.com/valeoai/IGLOSS.[74] AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction
Aiza Maksutova,Lalithkumar Seenivasan,Hao Ding,Jiru Xu,Chenhao Yu,Chenyan Jing,Yiqing Shen,Mathias Unberath
Main category: cs.CV
TL;DR: 本文提出AffordTissue框架,用于腹腔镜胆囊切除术中预测器械-动作特异性的组织可操作区域(affordance),通过多模态建模(时序视觉编码器+语言条件+DiT解码器)实现密集热图预测,并构建首个组织可操作性基准数据集(15,638个视频片段),显著优于现有视觉语言模型。
Details
Motivation: 现有外科动作自动化方法在临床部署中面临两大挑战:缺乏对器械与组织交互位置的可预测性,且缺少显式条件输入以约束工具-动作特定的安全交互区域。 Method: 提出AffordTissue多模态框架,包含:1)多视角时序视觉编码器捕捉器械运动与组织动态;2)语言条件模块支持跨器械-动作对泛化;3)DiT风格解码器实现密集可操作区域热图预测;并构建首个组织可操作性基准(103台手术、6种工具-动作组合、15,638个标注视频片段)。 Result: 在密集外科可操作性预测任务上显著优于Molmo-VLM等视觉语言模型(ASSD:20.6 px vs. 60.2 px);验证了任务专用架构在该细粒度空间预测任务上的优势。 Conclusion: AffordTissue提供了显式的空间推理能力,可为外科自动化提供策略引导(如仅在预测安全区域内执行动作)和早期安全停机机制,推动临床级手术自动化的安全落地。 Abstract: Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.[75] GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization
Syed Ahsan Masud Zaidi,Lior Shamir,William Hsu,Scott Dietrich,Talha Zaidi
Main category: cs.CV
TL;DR: 本文提出GRAZE无训练流水线,用于在无标注数据条件下精确定位美式足球训练视频中球员首次触碰假人(FPOC)的帧,结合Grounding DINO、运动感知时序推理与SAM2像素级验证,在738段视频上实现77.5%±10帧精度。
Details
Motivation: 美式足球训练视频长且未剪辑,关键交互(如首次触碰)仅占极短时间;可靠生物力学分析亟需鲁棒的时空定位能力,尤其在摄像机运动、场景杂乱、多人相似装备及冲击前后快速姿态变化等现实挑战下。 Method: GRAZE为无训练流水线:首先用Grounding DINO发现候选球员-假人交互区域;其次引入运动感知的时序推理优化时序定位;最后利用SAM2进行像素级接触验证(而非依赖检测置信度),解耦候选发现与接触确认以提升鲁棒性。 Result: 在738段实战训练视频上,GRAZE有效输出率达97.4%;77.5%的样本FPOC定位误差≤±10帧,82.7%≤±20帧。 Conclusion: 无需任务特定训练,即可在真实训练视频中实现帧级精度的接触起始点定位,验证了无监督/弱监督范式在体育生物力学分析中的可行性与实用性。 Abstract: American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within $\pm$ 10 frames on 77.5% of all clips and within $\pm$ 20 frames on 82.7% of all clips. These results show that frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training.[76] LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding
Fusang Wang,Nathan Piasco,Moussab Bennehar,Luis Roldão,Dzmitry Tsishkou,Fabien Moutarde
Main category: cs.CV
TL;DR: 本文提出了一种基于稀疏体素光栅化(SVRaster)的新框架,以解决现有开放词汇3D场景理解中3D高斯泼溅(3DGS)方法存在的空间与语义模糊性问题,实现了细粒度查询下的SOTA性能。
Details
Motivation: 现有基于3D高斯泼溅的开放词汇3D理解方法存在两大缺陷:一是无结构、重叠的高斯导致空间模糊,需概率化特征注册;二是对象级掩码池化引发多层级语义模糊,削弱细粒度细节。 Method: 采用稀疏体素光栅化(SVRaster)作为结构化、非重叠的几何表征,并通过单目深度和法向先验进行正则化,构建稳定几何基础;在此基础上实现确定性、置信度感知的特征注册,并利用基础模型AM-RADIO的密集对齐特性消除多级语义歧义。 Result: 在开放词汇3D物体检索和点云理解基准上达到SOTA性能,尤其在细粒度查询任务中显著优于依赖注册的方法。 Conclusion: SVRaster提供了一种更可靠、结构化的3D几何表征方式,有效克服了3DGS在开放词汇3D理解中的模糊性问题,为细粒度、确定性语义注册开辟了新路径。 Abstract: Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.[77] EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
Abhishek Saroha,Huajian Zeng,Xingxing Zuo,Daniel Cremers,Xi Wang
Main category: cs.CV
TL;DR: EgoFlow是一种基于流匹配的框架,用于从第一人称视频中生成物理上合理且逼真的6DoF物体运动轨迹,结合混合Mamba-Transformer-Perceiver架构与可微物理约束,在多个真实数据集上显著降低碰撞率并提升泛化能力。
Details
Motivation: 现有生成模型缺乏显式物理推理,难以在遮挡、快速运动等复杂条件下生成物理一致的6DoF轨迹。 Method: 提出EgoFlow,采用混合Mamba-Transformer-Perceiver架构联合建模时序动态、场景几何与语义意图,并通过梯度引导推理施加可微物理约束(如避碰、运动平滑)。 Result: 在HD-EPIC、EgoExo4D和HOT3D数据集上优于扩散模型和Transformer基线,碰撞率最高降低79%,具备强跨场景泛化能力。 Conclusion: 基于流的生成建模为可扩展、物理 grounded 的第一人称运动理解提供了新路径。 Abstract: Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba-Transformer-Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on real-world datasets HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.[78] Better Rigs, Not Bigger Networks: A Body Model Ablation for Gaussian Avatars
Derek Austin
Main category: cs.CV
TL;DR: 本文提出用Momentum Human Rig (MHR) 替代SMPL,结合SAM-3D-Body估计,构建更简化的3D高斯泼溅人体建模流程,在多个指标上达到最优性能,并通过控制实验验证了人体模型表达能力是avatar重建的主要瓶颈。
Details
Motivation: 现有基于SMPL的3D高斯泼溅方法虽视觉效果好,但训练架构日益复杂;作者旨在探索是否可简化流程并提升性能。 Method: 用MHR替代SMPL,由SAM-3D-Body估计驱动;不引入学习型形变或姿态相关校正;并通过两种控制实验(网格转换与姿态迁移)分离姿态估计质量与模型表达能力的影响。 Result: 在PeopleSnapshot和ZJU-MoCap数据集上取得最高PSNR及有竞争力/更优的LPIPS和SSIM;控制实验证明MHR的表达能力和姿态估计质量共同推动性能提升。 Conclusion: 人体模型的表达能力(而非仅姿态估计)是avatar重建的关键瓶颈;更简洁、更具表达力的刚体模型(如MHR)可替代复杂可学习变形模块,实现更优性能与更简架构。 Abstract: Recent 3D Gaussian splatting methods built atop SMPL achieve remarkable visual fidelity while continually increasing the complexity of the overall training architecture. We demonstrate that much of this complexity is unnecessary: by replacing SMPL with the Momentum Human Rig (MHR), estimated via SAM-3D-Body, a minimal pipeline with no learned deformations or pose-dependent corrections achieves the highest reported PSNR and competitive or superior LPIPS and SSIM on PeopleSnapshot and ZJU-MoCap. To disentangle pose estimation quality from body model representational capacity, we perform two controlled ablations: translating SAM-3D-Body meshes to SMPL-X, and translating the original dataset's SMPL poses into MHR both retrained under identical conditions. These ablations confirm that body model expressiveness has been a primary bottleneck in avatar reconstruction, with both mesh representational capacity and pose estimation quality contributing meaningfully to the full pipeline's gains.[79] Nonlinear Methods for Analyzing Pose in Behavioral Research
Carter Sale,Margaret C. Macpherson,Gaurav Patil,Kelly Miles,Rachel W. Kallen,Sebastian Wallot,Michael J. Richardson
Main category: cs.CV
TL;DR: 本文提出了一种通用的人类姿态数据分析流程,结合预处理、降维和基于递归的时间序列分析,以量化运动动态的时序结构,适用于多种实验场景。
Details
Motivation: 高维、含噪且时序复杂的姿态数据难以提取有意义的协调与行为变化模式。 Method: 构建了一个包含原理性预处理、降维和基于递归的时间序列分析的通用分析流程。 Result: 通过三个涵盖面部/全身、2D/3D、单人/多智能体行为的案例研究,验证了该流程的灵活性与适应性。 Conclusion: 同一分析工作流可适配不同复杂姿态时序数据,提取出具有理论意义的行为洞察。 Abstract: Advances in markerless pose estimation have made it possible to capture detailed human movement in naturalistic settings using standard video, enabling new forms of behavioral analysis at scale. However, the high dimensionality, noise, and temporal complexity of pose data raise significant challenges for extracting meaningful patterns of coordination and behavioral change. This paper presents a general-purpose analysis pipeline for human pose data, designed to support both linear and nonlinear characterizations of movement across diverse experimental contexts. The pipeline combines principled preprocessing, dimensionality reduction, and recurrence-based time series analysis to quantify the temporal structure of movement dynamics. To illustrate the pipeline's flexibility, we present three case studies spanning facial and full-body movement, 2D and 3D data, and individual versus multi-agent behavior. Together, these examples demonstrate how the same analytic workflow can be adapted to extract theoretically meaningful insights from complex pose time series.[80] Reinforcing Consistency in Video MLLMs with Structured Rewards
Yihao Quan,Zeru Shi,Jinman Zhao,Ruixiang Tang
Main category: cs.CV
TL;DR: 本文提出了一种结构化奖励机制,用于提升多模态大语言模型(MLLMs)在视频理解中的视觉与时间定位能力,通过分解字幕为事实性和时序性主张进行一致性审计,并设计包含场景图、时序和视频问答三部分的细粒度奖励,显著减少了幻觉并提高了理解忠实度。
Details
Motivation: 现有MLLMs在视频理解中常出现视觉和时间定位不准的问题,如虚构物体、错误属性或忽略重复事件,而标准句子级监督和奖励难以定位具体 grounding 失败。 Method: 提出一种自上而下的组合一致性审计方法,并设计结构化奖励:(1) 实例感知的场景图奖励;(2) 时序奖励(事件顺序与重复);(3) 视频接地的VQA奖励用于分层自验证;结合强化学习进行训练。 Result: 在时序理解、通用视频理解和幻觉评测基准上,该方法在多个开源骨干模型上均取得一致性能提升。 Conclusion: 结构化奖励塑形是提升视频理解忠实度的一种切实可行路径,优于粗粒度句子级监督与奖励。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.[81] Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation
Yunbei Zhang,Chengyi Cai,Feng Liu,Jihun Hamm
Main category: cs.CV
TL;DR: 本文提出AReS方法,通过单次API调用对本地预训练编码器进行轻量级微调,再在本地模型上进行白盒重编程,从而避免大量API调用,显著提升适配现代闭源大模型(如GPT-4o)的效率与效果。
Details
Motivation: 现有基于零阶优化(ZOO)的闭源服务模型重编程方法存在API调用成本高、优化不稳定、且对现代敏感度低的大模型(如GPT-4o)效果差等问题。 Method: AReS采用两阶段策略:第一阶段为单次API交互以轻量微调本地预训练编码器;第二阶段在该本地代理模型上进行白盒重编程,后续全部适应与推理均在本地完成,无需再调用API。 Result: 在GPT-4o上相对零样本基线提升27.8%,而ZOO方法几乎无增益;在10个数据集上平均优于SOTA方法(VLMs +2.5%,标准VMs +15.6%),API调用量减少超99.99%。 Conclusion: AReS是一种高效、鲁棒且实用的闭源服务模型适配新范式,解决了ZOO方法在现代API上的关键瓶颈。 Abstract: Adapting closed-box service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS's effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5% for VLMs, +15.6% for standard VMs) while reducing API calls by over 99.99%. AReS thus provides a robust and practical solution for adapting modern closed-box models.[82] UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
Zhisheng Huang,Jiahao Chen,Cheng Lin,Chenyu Hu,Hanzhuo Huang,Zhengming Yu,Mengfei Li,Yuheng Liu,Zekai Gu,Zibo Zhao,Yuan Liu,Xin Li,Wenping Wang
Main category: cs.CV
TL;DR: 本文提出UniRecGen框架,统一稀疏视角3D重建与扩散生成,通过共享规范空间和解耦协同学习实现高保真、结构完整且多视角一致的3D建模。
Details
Motivation: 稀疏视角3D建模中,前馈重建效率高但缺乏全局先验导致结构不完整,扩散生成细节丰富但多视角不一致,需融合二者优势。 Method: 构建UniRecGen统一框架:1)将重建模块与扩散生成器对齐至共享规范空间;2)采用解耦协同学习策略;3)重建模块提供规范几何锚点,扩散模块通过潜在增强条件进行几何细化与补全。 Result: 在稀疏视角下生成更完整、一致且高保真的3D模型,实验表明其性能优于现有方法。 Conclusion: UniRecGen成功弥合了重建与生成范式的鸿沟,在保持输入对齐的同时引入强几何先验,为稀疏视角3D建模提供了新范式。 Abstract: Sparse-view 3D modeling represents a fundamental tension between reconstruction fidelity and generative plausibility. While feed-forward reconstruction excels in efficiency and input alignment, it often lacks the global priors needed for structural completeness. Conversely, diffusion-based generation provides rich geometric details but struggles with multi-view consistency. We present UniRecGen, a unified framework that integrates these two paradigms into a single cooperative system. To overcome inherent conflicts in coordinate spaces, 3D representations, and training objectives, we align both models within a shared canonical space. We employ disentangled cooperative learning, which maintains stable training while enabling seamless collaboration during inference. Specifically, the reconstruction module is adapted to provide canonical geometric anchors, while the diffusion generator leverages latent-augmented conditioning to refine and complete the geometric structure. Experimental results demonstrate that UniRecGen achieves superior fidelity and robustness, outperforming existing methods in creating complete and consistent 3D models from sparse observations.[83] Universal computational thermal imaging overcoming the ghosting effect
Hongyi Xu,Du Wang,Chenjun Zhao,Jiashuo Chen,Jiale Lin,Liqin Cao,Yanfei Zhong,Yiyuan She,Fanglin Bao
Main category: cs.CV
TL;DR: 本文提出了一种名为TAG(Thermal Anti-Ghosting)的通用计算热成像框架,通过非参数纹理恢复方法,有效克服热成像中因材料非均匀性导致的鬼影效应,首次在热图像中实现高保真人脸纹理、表情、3D拓扑对齐及情绪识别。
Details
Motivation: 现有热成像受鬼影效应限制,细节丢失严重;HADAR虽具潜力,但仅适用于均匀材质场景,而真实世界普遍存在材料非均匀性,亟需通用抗鬼影方案。 Method: 提出TAG框架,利用超光谱光子流进行非参数纹理恢复,不依赖材质先验模型,适配各类非均匀材料场景。 Result: 实验首次在热图像中高保真恢复鬼影严重的人脸纹理与表情;超越HADAR性能;揭示材料非均匀性对HADAR效果的制约;实现热域3D拓扑对齐与情绪检测。 Conclusion: TAG为高保真计算夜视提供了通用基础,突破了热成像长期存在的鬼影瓶颈,拓展了其在自动驾驶、侦察、医疗和野生动物监测等领域的应用潜力。 Abstract: Thermal imaging is crucial for night vision but fundamentally hampered by the ghosting effect, a loss of detailed texture in cluttered photon streams. While conventional ghosting mitigation has relied on data post-processing, the recent breakthrough in heat-assisted detection and ranging (HADAR) opens a promising frontier for hyperspectral computational thermal imaging that produces night vision with day-like visibility. However, universal anti-ghosting imaging remains elusive, as state-of-the-art HADAR applies only to limited scenes with uniform materials, whereas material non-uniformity is ubiquitous in the real world. Here, we propose a universal computational thermal imaging framework, TAG (thermal anti-ghosting), to address material non-uniformity and overcome ghosting for high-fidelity night vision. TAG takes hyperspectral photon streams for nonparametric texture recovery, enabling our experimental demonstration of unprecedented expression recovery in thus-far-elusive ghostly human faces -- the archetypal, long-recognized ghosting phenomenon. Strikingly, TAG not only universally outperforms HADAR across various scenes, but also reveals the influence of material non-uniformity, shedding light on HADAR's effectiveness boundary. We extensively test facial texture and expression recovery across day and night, and demonstrate, for the first time, thermal 3D topological alignment and mood detection. This work establishes a universal foundation for high-fidelity computational night vision, with potential applications in autonomous navigation, reconnaissance, healthcare, and wildlife monitoring.[84] Prototype-Based Low Altitude UAV Semantic Segmentation
Da Zhang,Gao Junyu,Zhao Zhiyuan
Main category: cs.CV
TL;DR: 本文提出了一种面向低空无人机影像语义分割的高效原型分割框架PBSeg,通过原型交叉注意力(PBCA)和多尺度特征提取模块(结合可变形卷积与上下文感知调制)在保持高精度的同时显著降低计算开销。
Details
Motivation: 现有基于Transformer的方法计算开销大,而轻量级方法难以捕捉高分辨率航拍图像中的细节,且无人机边缘设备资源受限。 Method: 提出PBSeg框架,核心包括原型基交叉注意力(PBCA)以利用特征冗余降低复杂度,以及融合可变形卷积(DConv)与上下文感知调制(CAM)的高效多尺度特征提取模块。 Result: 在UAVid和UDD6两个无人机数据集上分别达到71.86%和80.92%的mIoU,性能优越且计算高效。 Conclusion: PBSeg在精度与效率间取得良好平衡,适用于资源受限的无人机边缘场景语义分割任务。 Abstract: Semantic segmentation of low-altitude UAV imagery presents unique challenges due to extreme scale variations, complex object boundaries, and limited computational resources on edge devices. Existing transformer-based segmentation methods achieve remarkable performance but incur high computational overhead, while lightweight approaches struggle to capture fine-grained details in high-resolution aerial scenes. To address these limitations, we propose PBSeg, an efficient prototype-based segmentation framework tailored for UAV applications. PBSeg introduces a novel prototype-based cross-attention (PBCA) that exploits feature redundancy to reduce computational complexity while maintaining segmentation quality. The framework incorporates an efficient multi-scale feature extraction module that combines deformable convolutions (DConv) with context-aware modulation (CAM) to capture both local details and global semantics. Experiments on two challenging UAV datasets demonstrate the effectiveness of the proposed approach. PBSeg achieves 71.86\% mIoU on UAVid and 80.92\% mIoU on UDD6, establishing competitive performance while maintaining computational efficiency. Code is available at https://github.com/zhangda1018/PBSeg.[85] Cross-Domain Vessel Segmentation via Latent Similarity Mining and Iterative Co-Optimization
Zhanqiang Guo,Jianjiang Feng,Jie Zhou
Main category: cs.CV
TL;DR: 本文提出了一种基于潜在血管相似性和生成-分割网络迭代协同优化的跨域视网膜血管分割新框架,显著提升了在模态差异大的临床场景下的分割性能。
Details
Motivation: 现有基于CNN的方法在训练与测试数据存在域偏移时性能显著下降,亟需提升跨域泛化能力。 Method: 提出一种域迁移框架:1)分别预训练源域和目标域的生成网络;2)利用源域条件扩散模型进行确定性反演,构建域无关的血管图像潜在原型以合成目标域图像;3)通过循环参数更新实现分割网络与生成模型的迭代协同优化。 Result: 在跨域视网膜血管分割任务上达到SOTA性能,尤其在模态差异显著的挑战性临床场景中表现突出。 Conclusion: 所提框架通过联合优化生成与分割任务,有效缓解域偏移问题,为医学图像跨域分析提供了新思路。 Abstract: Retinal vessel segmentation serves as a critical prerequisite for automated diagnosis of retinal pathologies. While recent advances in Convolutional Neural Networks (CNNs) have demonstrated promising performance in this task, significant performance degradation occurs when domain shifts exist between training and testing data. To address these limitations, we propose a novel domain transfer framework that leverages latent vascular similarity across domains and iterative co-optimization of generation and segmentation networks. Specifically, we first pre-train generation networks for source and target domains. Subsequently, the pretrained source-domain conditional diffusion model performs deterministic inversion to establish intermediate latent representations of vascular images, creating domain-agnostic prototypes for target synthesis. Finally, we develop an iterative refinement strategy where segmentation network and generative model undergo mutual optimization through cyclic parameter updating. This co-evolution process enables simultaneous enhancement of cross-domain image synthesis quality and segmentation accuracy. Experiments demonstrate that our framework achieves state-of-the-art performance in cross-domain retinal vessel segmentation, particularly in challenging clinical scenarios with significant modality discrepancies.[86] ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction
Yanzhe Liang,Ruijie Zhu,Hanzhi Chang,Zhuoyuan Li,Jiahao Lu,Tianzhu Zhang
Main category: cs.CV
TL;DR: ReFlow提出了一种无需外部光流引导的单目动态场景重建新框架,通过自校正流匹配机制实现静态与动态成分的解耦建模和鲁棒4D重建。
Details
Motivation: 现有单目动态场景重建方法常因动态区域初始化不完整而导致重建和运动估计不稳定,依赖外部密集运动引导(如预计算光流)会引入复杂性和误差传播。 Method: ReFlow包含完整规范空间构建模块(增强静态与动态区域初始化)、基于分离的动态场景建模模块(解耦静态与动态成分以实现针对性运动监督),以及核心的自校正流匹配机制(含全流匹配与相机流匹配)。 Result: 在多种场景下实验表明,ReFlow在重建质量与鲁棒性上优于现有方法,确立了单目4D重建的新自校正范式。 Conclusion: ReFlow实现了不依赖外部运动先验的高质量单目动态场景重建,为4D重建提供了更简洁、稳定且可扩展的新思路。 Abstract: We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation. To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision. The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction. Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.[87] VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification
Jiahao Meng,Tan Yue,Qi Xu,Haochen Wang,Zhongwei Ren,Weisong Liu,Yuhao Wang,Renrui Zhang,Yunhai Tong,Haodong Duan
Main category: cs.CV
TL;DR: 本文提出VideoZeroBench,一个用于长视频问答的分层基准,强调对时空证据的严格验证;实验表明当前模型在需同时正确回答与精确定位时空证据时表现极差,揭示了 grounded video understanding 的根本瓶颈。
Details
Motivation: 现有视频多模态大模型评估存在两大缺陷:一是分数虚高掩盖细粒度视觉理解与推理能力不足;二是仅判断答案正误,未检验模型是否定位到支持答案的确切时空证据。 Method: 构建VideoZeroBench——包含500个跨13个领域的手工标注问题,每个问题附带时间区间和空间边界框作为真实证据;设计五级评估协议,逐级增强对时空定位的要求,解耦答案生成、时间定位与空间定位能力。 Result: Gemini-3-Pro在标准端到端QA(Level-3)下正确率低于17%;在最严苛的Level-5(要求答案正确且时空定位完全准确)下,所有模型准确率均低于1%,多数模型零成功;揭示表面答案正确性与真实证据驱动推理之间存在巨大鸿沟。 Conclusion: 当前视频多模态大模型在长视频问答中严重缺乏基于时空证据的深层理解能力;VideoZeroBench暴露了 grounded video understanding 是核心瓶颈,为未来研究提供了可量化的评估框架与分析维度。 Abstract: Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.[88] Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning
Longfei Huang,Yang Yang
Main category: cs.CV
TL;DR: 本文提出了一种梯度对齐交替学习(GAAL)范式,通过交替单模态学习与共享分类器、结合基于不确定性的跨模态梯度手术,缓解多模态融合中的梯度冲突问题,显著提升表格-图像融合性能。
Details
Motivation: 现有表格-图像多模态融合方法受模态间梯度冲突限制,导致单模态学习器优化困难。 Method: 提出GAAL范式:1)交替进行单模态学习与共享分类器训练,解耦多模态梯度;2)设计基于不确定性的跨模态梯度手术,选择性对齐梯度以协同优化共享参数。 Result: 在多个常用数据集上,GAAL在表格-图像融合及测试时表格缺失场景下均优于多种SOTA基线方法。 Conclusion: GAAL通过梯度对齐机制有效缓解模态冲突,提升单模态辅助能力与整体融合性能,为多模态融合提供了新思路。 Abstract: Multimodal tabular-image fusion is an emerging task that has received increasing attention in various domains. However, existing methods may be hindered by gradient conflicts between modalities, misleading the optimization of the unimodal learner. In this paper, we propose a novel Gradient-Aligned Alternating Learning (GAAL) paradigm to address this issue by aligning modality gradients. Specifically, GAAL adopts an alternating unimodal learning and shared classifier to decouple the multimodal gradient and facilitate interaction. Furthermore, we design uncertainty-based cross-modal gradient surgery to selectively align cross-modal gradients, thereby steering the shared parameters to benefit all modalities. As a result, GAAL can provide effective unimodal assistance and help boost the overall fusion performance. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SoTA) tabular-image fusion baselines and test-time tabular missing baselines. The source code is available at https://github.com/njustkmg/ICME26-GAAL.[89] Satellite-Free Training for Drone-View Geo-Localization
Tao Liu,Yingzhi Zhang,Kan Ren,Xiaoqi Zhao
Main category: cs.CV
TL;DR: 本文提出了一种无需卫星图像训练的无人机视角地理定位(DVGL)框架(SFT),通过多视角无人机图像重建3D场景、生成伪正射影像并进行几何引导修复,再利用DINOv3特征与Fisher向量聚合实现跨视角检索,在无卫星数据条件下显著提升定位性能。
Details
Motivation: 现有DVGL方法依赖卫星图像进行训练(配对监督或无监督对齐),限制了在卫星数据不可用或受限场景下的实际部署;而真实应用中常有多视角无人机序列,需构建几何归一化的无人机端表征。 Method: 提出卫星免训练(SFT)框架:1)用3D Gaussian Splatting从多视角无人机图像重建稠密3D场景;2)基于PCA引导的正交投影生成伪正射影像(无需渲染时相机参数);3)轻量几何引导修复获得纹理完整的伪正射图;4)提取DINOv3 patch特征,仅用无人机数据学习Fisher向量聚合模型,并复用于测试时编码卫星图像。 Result: 在University-1652和SUES-200数据集上,SFT显著优于其他卫星免训练基线,并大幅缩小与使用卫星图像训练方法的性能差距。 Conclusion: SFT证明了仅用无人机图像即可学习有效的跨视角地理定位表征,为GPS拒止环境下无卫星依赖的实用化DVGL提供了新范式。 Abstract: Drone-view geo-localization (DVGL) aims to determine the location of drones in GPS-denied environments by retrieving the corresponding geotagged satellite tile from a reference gallery given UAV observations of a location. In many existing formulations, these observations are represented by a single oblique UAV image. In contrast, our satellite-free setting is designed for multi-view UAV sequences, which are used to construct a geometry-normalized UAV-side location representation before cross-view retrieval. Existing approaches rely on satellite imagery during training, either through paired supervision or unsupervised alignment, which limits practical deployment when satellite data are unavailable or restricted. In this paper, we propose a satellite-free training (SFT) framework that converts drone imagery into cross-view compatible representations through three main stages: drone-side 3D scene reconstruction, geometry-based pseudo-orthophoto generation, and satellite-free feature aggregation for retrieval. Specifically, we first reconstruct dense 3D scenes from multi-view drone images using 3D Gaussian splatting and project the reconstructed geometry into pseudo-orthophotos via PCA-guided orthographic projection. This rendering stage operates directly on reconstructed scene geometry without requiring camera parameters at rendering time. Next, we refine these orthophotos with lightweight geometry-guided inpainting to obtain texture-complete drone-side views. Finally, we extract DINOv3 patch features from the generated orthophotos, learn a Fisher vector aggregation model solely from drone data, and reuse it at test time to encode satellite tiles for cross-view retrieval. Experimental results on University-1652 and SUES-200 show that our SFT framework substantially outperforms satellite-free generalization baselines and narrows the gap to methods trained with satellite imagery.[90] SHOE: Semantic HOI Open-Vocabulary Evaluation Metric
Maja Noack,Qinqian Lei,Taipeng Tian,Bihan Dong,Robby T. Tan,Yixin Chen,John Young,Saijun Zhang,Bo Wang
Main category: cs.CV
TL;DR: 本文提出SHOE,一种语义感知的开放词汇HOI检测评估框架,通过分解HOI为动词和物体成分并利用多个大语言模型计算其语义相似性,从而超越传统基于精确匹配的mAP指标。
Details
Motivation: 现有HOI评估指标(如mAP)仅依赖离散标签匹配,无法衡量语义相近但表述不同的预测(如“lean on couch” vs. “sit on couch”),难以适用于开放词汇场景。 Method: SHOE将HOI预测分解为动词与物体两部分,分别用多个大语言模型计算其与真值的语义相似度,并加权融合为整体相似分;在HICO-DET等标准基准上进行评估。 Result: SHOE评分与人类判断一致性达85.73%,显著优于LLM和嵌入式基线方法;能更灵活、可扩展地评估现有检测器与生成式模型。 Conclusion: 语义驱动的HOI评估(如SHOE)更贴近人类对交互的理解,是开放词汇HOI研究的关键支撑,作者将公开该评估工具以推动后续工作。 Abstract: Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., "lean on couch" vs. "sit on couch"), limiting their applicability for evaluating open-vocabulary predictions that go beyond any predefined set of HOI labels. We introduce SHOE (Semantic HOI Open-Vocabulary Evaluation), a new evaluation framework that incorporates semantic similarity between predicted and ground-truth HOI labels. SHOE decomposes each HOI prediction into its verb and object components, estimates their semantic similarity using the average of multiple large language models (LLMs), and combines them into a similarity score to evaluate alignment beyond exact string match. This enables a flexible and scalable evaluation of both existing HOI detection methods and open-ended generative models using standard benchmarks such as HICO-DET. Experimental results show that SHOE scores align more closely with human judgments than existing metrics, including LLM-based and embedding-based baselines, achieving an agreement of 85.73% with the average human ratings. Our work underscores the need for semantically grounded HOI evaluation that better mirrors human understanding of interactions. We will release our evaluation metric to the public to facilitate future research.[91] Mitigating the ID-OOD Tradeoff in Open-Set Test-Time Adaptation
Wenjie Zhao,Jia Li,Xin Dong,Yapeng Tian,Yu Xiang,Yunhui Guo
Main category: cs.CV
TL;DR: 本文提出ROSETTA方法,通过引入角度损失和特征范数损失来解决开放集测试时适应(OSTTA)中熵最小化与最大化之间的固有冲突,从而在保持ID样本分类性能的同时提升csOOD样本检测能力。
Details
Motivation: 在开放集测试时适应(OSTTA)中,模型需同时处理分布偏移的ID样本(csID)和OOD样本(csOOD),而传统熵最小化与最大化策略存在内在冲突,导致性能权衡。 Method: 提出ROSETTA方法,包括角损失(调节特征范数大小)和特征范数损失(抑制csOOD的logits),以协同优化ID分类与OOD检测。 Result: 在CIFAR-10-C、CIFAR-100-C、Tiny-ImageNet-C和ImageNet-C上实现强OOD检测与高ID分类精度;在Cityscapes语义分割和HAC数据集上验证了泛化性。 Conclusion: ROSETTA有效缓解了熵目标冲突,在多种基准和任务上实现了鲁棒的开放集测试时适应性能。 Abstract: Open-set test-time adaptation (OSTTA) addresses the challenge of adapting models to new environments where out-of-distribution (OOD) samples coexist with in-distribution (ID) samples affected by distribution shifts. In such settings, covariate shift-for example, changes in weather conditions such as snow-can alter ID samples, reducing model reliability. Consequently, models must not only correctly classify covariate-shifted ID (csID) samples but also effectively reject covariate-shifted OOD (csOOD) samples. Entropy minimization is a common strategy in test-time adaptation to maintain ID performance under distribution shifts, while entropy maximization is widely applied to enhance OOD detection. Several studies have sought to combine these objectives to tackle the challenges of OSTTA. However, the intrinsic conflict between entropy minimization and maximization inevitably leads to a trade-off between csID classification and csOOD detection. In this paper, we first analyze the limitations of entropy maximization in OSTTA and then introduce an angular loss to regulate feature norm magnitudes, along with a feature-norm loss to suppress csOOD logits, thereby improving OOD detection. These objectives form ROSETTA, a $\underline{r}$obust $\underline{o}$pen-$\underline{se}$t $\underline{t}$est-$\underline{t}$ime $\underline{a}$daptation. Our method achieves strong OOD detection while maintaining high ID classification performance on CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C and ImageNet-C. Furthermore, experiments on the Cityscapes validate the method's effectiveness in real-world semantic segmentation, and results on the HAC dataset demonstrate its applicability across different open-set TTA setups.[92] Riemannian and Symplectic Geometry for Hierarchical Text-Driven Place Recognition
Tianyi Shang,Zhenyu Li
Main category: cs.CV
TL;DR: 本文提出SympLoc框架,通过粗到细的多级对齐策略(实例级、关系级和全局级)提升文本到点云定位性能,在KITTI360Pose数据集上Top-1召回率@10m提升19%。
Details
Motivation: 现有方法依赖全局描述符进行相似性检索,导致严重信息丢失且难以捕捉判别性场景结构。 Method: 提出SympLoc框架:粗阶段包含三个互补对齐层级——1)实例级对齐:利用双曲空间中的黎曼自注意力建立点云对象与文本提示的直接对应;2)关系级对齐:使用信息辛关系编码器(ISRE),结合Fisher-Rao度量与哈密顿动力学建模对象间空间关系;3)全局级对齐:通过谱流形变换(SMT)提取图谱结构不变特征生成判别性全局描述符。 Result: 在KITTI360Pose数据集上,SympLoc相比现有最先进方法Top-1召回率@10m提升19%。 Conclusion: SympLoc通过多层次几何一致对齐显著提升了文本到点云定位的鲁棒性和精度,为自然语言驱动的空间理解提供了新范式。 Abstract: Text-to-point-cloud localization enables robots to understand spatial positions through natural language descriptions, which is crucial for human-robot collaboration in applications such as autonomous driving and last-mile delivery. However, existing methods employ pooled global descriptors for similarity retrieval, which suffer from severe information loss and fail to capture discriminative scene structures. To address these issues, we propose SympLoc, a novel coarse-to-fine localization framework with multi-level alignment in the coarse stage. Different from previous methods that rely solely on global descriptors, our coarse stage consists of three complementary alignment levels: 1) Instance-level alignment establishes direct correspondence between individual object instances in point clouds and textual hints through Riemannian self-attention in hyperbolic space; 2) Relation-level alignment explicitly models pairwise spatial relationships between objects using the Information-Symplectic Relation Encoder (ISRE), which reformulates relation features through Fisher-Rao metric and Hamiltonian dynamics for uncertainty-aware geometrically consistent propagation; 3) Global-level alignment synthesizes discriminative global descriptors via the Spectral Manifold Transform (SMT) that extracts structural invariants through graph spectral analysis. This hierarchical alignment strategy progressively captures fine-grained to coarse-grained scene semantics, enabling robust cross-modal retrieval. Extensive experiments on the KITTI360Pose dataset demonstrate that SympLoc achieves a 19% improvement in Top-1 recall@10m compared to existing state-of-the-art approaches.[93] Towards Minimal Focal Stack in Shape from Focus
Khurram Ashfaq,Muhammad Tariq Mahmood
Main category: cs.CV
TL;DR: 本文提出了一种基于物理的双图焦点堆栈增强方法,结合全焦图像(AiF)和差分能量图(EOD),并设计了多尺度ConvGRU深度网络,使Shape from Focus(SFF)方法仅用两张图像即可实现高精度深度估计。
Details
Motivation: 现有Shape from Focus(SFF)方法依赖大量密集采样的焦点图像(focal stack),限制了其实际应用;亟需减少输入图像数量而不损失精度。 Method: 提出一种物理驱动的焦点堆栈增强策略:从两张输入图像估计全焦图像(AiF),并计算AiF与输入图像间的差分能量图(EOD);构建端到端深度网络,利用ConvGRU在多尺度上迭代优化深度图。 Result: 在合成与真实数据集上验证,所提方法使现有SFF模型仅用2张图像即达到与使用大堆栈相当的精度,并保持SOTA性能。 Conclusion: 双图焦点堆栈增强是一种高效可行的轻量化SFF方案,显著提升SFF方法的实用性与部署效率。 Abstract: Shape from Focus (SFF) is a depth reconstruction technique that estimates scene structure from focus variations observed across a focal stack, that is, a sequence of images captured at different focus settings. A key limitation of SFF methods is their reliance on densely sampled, large focal stacks, which limits their practical applicability. In this study, we propose a focal stack augmentation that enables SFF methods to estimate depth using a reduced stack of just two images, without sacrificing precision. We introduce a simple yet effective physics-based focal stack augmentation that enriches the stack with two auxiliary cues: an all-in-focus (AiF) image estimated from two input images, and Energy-of-Difference (EOD) maps, computed as the energy of differences between the AiF and input images. Furthermore, we propose a deep network that computes a deep focus volume from the augmented focal stacks and iteratively refines depth using convolutional Gated Recurrent Units (ConvGRUs) at multiple scales. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed augmentation benefits existing state-of-the-art SFF models, enabling them to achieve comparable accuracy. The results also show that our approach maintains state-of-the-art performance with a minimal stack size.[94] F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling
Morui Zhu,Mohammad Dehghani Tezerjani,Mátyás Szántó,Márton Vaitkus,Song Fu,Qing Yang
Main category: cs.CV
TL;DR: F3DGS是一种面向去中心化多智能体3D重建的联邦3D高斯泼溅框架,通过共享几何骨架初始化和可见性感知的联邦优化,在保证几何一致性的同时实现分布式训练。
Details
Motivation: 现有3D高斯泼溅(3DGS)方法依赖集中式数据,难以适用于分布式机器人系统;而直接迁移至多智能体场景会带来通信开销与几何不一致问题。 Method: 首先利用多客户端本地融合的LiDAR点云配准构建共享几何骨架以初始化全局3DGS模型;在联邦优化中固定高斯位置以保持几何对齐,仅由各客户端更新外观属性(协方差、不透明度、球谐系数),服务器采用可见性感知聚合(按各客户端对每个高斯的观测频率加权)。 Result: 在自建多序列室内外同步LiDAR/RGB/IMU数据集上验证,F3DGS重建质量媲美集中式训练,同时支持多智能体分布式优化。 Conclusion: F3DGS有效解决了多智能体3D重建中的数据分散、部分可观测与几何一致性难题,为联邦学习在三维视觉中的应用提供了新范式。 Abstract: We present F3DGS, a federated 3D Gaussian Splatting framework for decentralized multi-agent 3D reconstruction. Existing 3DGS pipelines assume centralized access to all observations, which limits their applicability in distributed robotic settings where agents operate independently, and centralized data aggregation may be restricted. Directly extending centralized training to multi-agent systems introduces communication overhead and geometric inconsistency. F3DGS first constructs a shared geometric scaffold by registering locally merged LiDAR point clouds from multiple clients to initialize a global 3DGS model. During federated optimization, Gaussian positions are fixed to preserve geometric alignment, while each client updates only appearance-related attributes, including covariance, opacity, and spherical harmonic coefficients. The server aggregates these updates using visibility-aware aggregation, weighting each client's contribution by how frequently it observed each Gaussian, resolving the partial-observability challenge inherent to multi-agent exploration. To evaluate decentralized reconstruction, we collect a multi-sequence indoor dataset with synchronized LiDAR, RGB, and IMU measurements. Experiments show that F3DGS achieves reconstruction quality comparable to centralized training while enabling distributed optimization across agents. The dataset, development kit, and source code will be publicly released.[95] NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy
Kyeonghun Kim,Hyeonseok Jung,Youngung Han,Hyunsu Go,Eunseob Choi,Seongbin Park,Junsu Lim,Jiwon Yang,Sumin Lee,Insung Hwang,Ken Ying-Kai Liao,Nam-Joon Kim
Main category: cs.CV
TL;DR: 本文提出NEMESIS,一种面向3D CT影像的内存高效自监督学习框架,通过局部超块(128x128x128)处理、噪声增强重建、双路径解剖掩码Transformer模块(MATB)及跨尺度NEMESIS Tokens(NT),显著提升小样本与无监督表征性能。
Details
Motivation: 3D CT标注成本高,需自监督学习;但全体积Transformer内存开销大,且传统掩码策略难以建模CT数据的各向异性空间结构。 Method: 提出NEMESIS:基于局部128×128×128超块的掩码自编码器;引入噪声增强重建作为代理任务;设计Masked Anatomical Transformer Blocks(MATB)实现平面级与轴向级并行掩码;引入NEMESIS Tokens(NT)聚合跨尺度上下文。 Result: 在BTCV多器官分类基准上,冻结主干+线性分类器达到平均AUROC 0.9633,优于SuPreM(0.9493)和VoCo(0.9387);仅用10%标注时AUROC仍达0.9075;单次前向计算量降至31.0 GFLOPs(对比全体积基线985.8 GFLOPs)。 Conclusion: NEMESIS通过结构化掩码与轻量超块设计,在保持解剖细节的同时大幅降低计算与内存开销,为3D医学影像自监督学习提供了可扩展、鲁棒的新范式。 Abstract: Volumetric CT imaging is essential for clinical diagnosis, yet annotating 3D volumes is expensive and time-consuming, motivating self-supervised learning (SSL) from unlabeled data. However, applying SSL to 3D CT remains challenging due to the high memory cost of full-volume transformers and the anisotropic spatial structure of CT data, which is not well captured by conventional masking strategies. We propose NEMESIS, a masked autoencoder (MAE) framework that operates on local 128x128x128 superpatches, enabling memory-efficient training while preserving anatomical detail. NEMESIS introduces three key components: (i) noise-enhanced reconstruction as a pretext task, (ii) Masked Anatomical Transformer Blocks (MATB) that perform dual-masking through parallel plane-wise and axis-wise token removal, and (iii) NEMESIS Tokens (NT) for cross-scale context aggregation. On the BTCV multi-organ classification benchmark, NEMESIS with a frozen backbone and a linear classifier achieves a mean AUROC of 0.9633, surpassing fully fine-tuned SuPreM (0.9493) and VoCo (0.9387). Under a low-label regime with only 10% of available annotations, it retains an AUROC of 0.9075, demonstrating strong label efficiency. Furthermore, the superpatch-based design reduces computational cost to 31.0 GFLOPs per forward pass, compared to 985.8 GFLOPs for the full-volume baseline, providing a scalable and robust foundation for 3D medical imaging.[96] Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models
Jiawei Chen,Simin Huang,Jiawei Du,Shuaihang Chen,Yu Tian,Mingjie Wei,Chao Yu,Zhaoxia Yin
Main category: cs.CV
TL;DR: 本文提出Tex3D框架,首次实现面向视觉-语言-动作(VLA)模型的端到端3D对抗纹理优化,通过Foreground-Background Decoupling(FBD)和Trajectory-Aware Adversarial Optimization(TAAO)解决3D渲染不可微与跨视角鲁棒性难题,在仿真与真机实验中使VLA任务失败率高达96.7%,揭示了VLA系统在物理世界中的关键脆弱性。
Details
Motivation: 现有VLA模型对物理可实现的对抗攻击(如3D纹理攻击)的鲁棒性研究不足;2D视觉或语言扰动攻击物理真实性低,而3D对抗纹理更贴近真实部署威胁,但缺乏端到端可微优化方法。 Method: 提出Foreground-Background Decoupling(FBD)以实现可微纹理优化,结合Trajectory-Aware Adversarial Optimization(TAAO)提升长时序与多视角下的攻击稳定性,并基于二者构建Tex3D框架,在VLA仿真环境中直接优化3D对抗纹理。 Result: Tex3D在仿真和真实机器人平台上显著降低多种操作任务成功率,最高任务失败率达96.7%;验证了3D对抗纹理对VLA系统的强有效性与物理可行性。 Conclusion: VLA系统对物理可实现的3D对抗纹理高度脆弱,亟需在训练中引入鲁棒性意识;Tex3D为评估和提升VLA鲁棒性提供了首个端到端、物理 grounded 的3D攻击基准与工具。 Abstract: Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.[97] Automatic Image-Level Morphological Trait Annotation for Organismal Images
Vardaan Pahuja,Samuel Stevens,Alyson East,Sydne Record,Yu Su
Main category: cs.CV
TL;DR: 本文提出了一种基于稀疏自编码器和视觉-语言提示的自动化形态性状标注方法,构建了包含8万条性状标注的Bioscan-Traits数据集,提升了大规模生态研究中形态性状分析的可扩展性与生物学合理性。
Details
Motivation: 形态性状对生态研究至关重要,但当前依赖专家的手动提取效率低、难以规模化,且缺乏高质量图像-性状配对数据集。 Method: 利用基础模型特征训练稀疏自编码器,获得具有单义性和空间定位能力的神经元;结合显著区域定位与视觉-语言提示生成可解释的性状描述;构建Bioscan-Traits数据集并开展系统消融实验。 Result: 成功构建含80K条性状标注、覆盖19K张昆虫图像的Bioscan-Traits数据集;人工评估验证了生成描述的生物学合理性;消融实验明确了关键设计选择对性状描述质量的影响。 Conclusion: 该模块化自动标注流程可替代高成本人工标注,为大尺度形态学分析提供可扩展方案,并弥合生态学意义与机器学习实用性的鸿沟。 Abstract: Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.[98] LivingWorld: Interactive 4D World Generation with Environmental Dynamics
Hyeongju Mun,In-Hwan Jin,Sohyeong Kim,Kyeongbo Kong
Main category: cs.CV
TL;DR: 本文提出LivingWorld框架,通过单张图像生成具有环境动态(如云、水、烟)的4D世界,采用几何感知对齐模块和哈希运动场实现全局一致、低延迟的交互式动态场景生成。
Details
Motivation: 现有3D场景生成方法多聚焦静态几何重建,缺乏对场景尺度环境动态(如云、水、烟)建模;而建模此类动态需兼顾运动全局一致性与用户交互低延迟,存在挑战。 Method: 提出LivingWorld框架:1)渐进式构建全局一致运动场;2)引入几何感知对齐模块解决多视角方向与尺度歧义;3)采用紧凑哈希运动场表示,支持高效查询与稳定动态传播,并支持渲染时双向运动传播。 Result: 在单块RTX 5090 GPU上,每步场景扩展耗时9秒,运动对齐与更新耗时3秒,实现交互式4D世界生成;生成长时序、时间连贯的4D序列,无需昂贵视频后处理。 Conclusion: LivingWorld首次实现了从单图出发、支持全局一致环境动态的交互式4D世界生成,为动态虚拟环境构建提供了新范式。 Abstract: We introduce LivingWorld, an interactive framework for generating 4D worlds with environmental dynamics from a single image. While recent advances in 3D scene generation enable large-scale environment creation, most approaches focus primarily on reconstructing static geometry, leaving scene-scale environmental dynamics such as clouds, water, or smoke largely unexplored. Modeling such dynamics is challenging because motion must remain coherent across an expanding scene while supporting low-latency user feedback. LivingWorld addresses this challenge by progressively constructing a globally coherent motion field as the scene expands. To maintain global consistency during expansion, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities across views. We further represent motion using a compact hash-based motion field, enabling efficient querying and stable propagation of dynamics throughout the scene. This representation also supports bidirectional motion propagation during rendering, producing long and temporally coherent 4D sequences without relying on expensive video-based refinement. On a single RTX 5090 GPU, generating each new scene expansion step requires 9 seconds, followed by 3 seconds for motion alignment and motion field updates, enabling interactive 4D world generation with globally coherent environmental dynamics. Video demonstrations are available at cvsp-lab.github.io/LivingWorld.[99] TOL: Textual Localization with OpenStreetMap
Youqi Liao,Shuhao Kang,Jingyu Xu,Olaf Wysocki,Yan Xia,Jianping Li,Zhen Dong,Bisheng Yang,Xieyuanli Chen
Main category: cs.CV
TL;DR: 本文提出了一种基于自然语言描述在OpenStreetMap(OSM)上进行全球定位的新任务(T2O),构建了首个大规模多城市基准TOL,并设计了粗到精的定位框架TOLoc,在多个精度阈值下显著优于现有方法。
Details
Motivation: 现有定位方法依赖稠密点云或高分辨率影像,而OSM语义丰富、轻量且开源,但文本到OSM的定位(T2O)尚未被探索;需支持无几何观测和GNSS初值的大规模、语义驱动的2DoF城市定位。 Method: 提出TOLoc框架:粗阶段提取方向感知的文本与OSM全局描述符以检索候选位置;细阶段通过专用对齐模块融合文本与局部地图特征,回归2DoF位姿;并构建跨洲多城基准TOL(121K文本-地图对,覆盖316km道路轨迹)。 Result: TOLoc在5m/10m/25m定位精度上分别比最优基线提升6.53%、9.93%、8.31%,且具备强跨环境泛化能力。 Conclusion: T2O定位是可行且有效的范式;TOLoc验证了显式建模语义与方向信息对文本驱动地图定位的关键作用;TOL基准为该方向提供了重要基础资源。 Abstract: Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.[100] MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
Junyoung Jung,Seokwon Kim,Jun Uk Kim
Main category: cs.CV
TL;DR: 本文提出了一种面向稀疏标注单目3D目标检测的新框架,包含道路感知补丁增强(RAPA)和基于原型的过滤(PBF)两个核心模块,以提升在少量3D标注下的检测性能。
Details
Motivation: 单目3D目标检测在密集标注数据集上表现优异,但3D标注成本高,现实中常仅能获得稀疏标注,导致模型性能下降。 Method: 提出Road-Aware Patch Augmentation(RAPA)在道路区域上几何一致地增强物体补丁;并提出Prototype-Based Filtering(PBF),利用2D RoI特征原型相似性和深度不确定性筛选高质量伪标签。 Result: 在稀疏标注设定下显著提升单目3D检测性能,实验验证了方法有效性。 Conclusion: 结合几何保持增强与原型引导伪标签策略,可有效缓解稀疏标注带来的监督不足问题,提升模型鲁棒性。 Abstract: Monocular 3D object detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely annotated setting is common in real-world scenarios where annotating every object is impractical. To address this, we propose a novel framework for sparsely annotated monocular 3D object detection with two key modules. First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. It maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates. Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision. Extensive experiments demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/VisualAIKHU/MonoSAOD .[101] Moiré Video Authentication: A Physical Signature Against AI Video Generation
Yuan Qing,Kunyu Zheng,Lingxiao Li,Boqing Gong,Chang Xiao
Main category: cs.CV
TL;DR: 本文提出了一种基于莫尔效应的物理认证签名,利用真实相机拍摄时自然产生的光学现象(莫尔条纹相位与光栅图像位移的线性耦合关系),来区分真实视频与AI生成视频;该签名在真实视频中稳定存在,而当前主流生成模型无法准确复现,从而实现高鲁棒性的视频真伪鉴别。
Details
Motivation: 随着视频生成技术的进步,AI合成内容越来越难以与真实视频区分,亟需一种物理上可验证、难以被生成模型伪造的鉴别方法。 Method: 利用真实相机成像中自然产生的莫尔效应,推导出‘莫尔运动不变量’——即莫尔条纹相位与光栅图像位移之间由光学几何决定的线性耦合关系(与拍摄距离和光栅结构无关);通过从视频中分别提取这两个信号并检验其相关性来完成鉴别。 Result: 在多个SOTA视频生成模型(如Sora、Pika等)生成的视频与真实拍摄视频上验证表明,二者在该不变量的相关性上存在显著差异,真实视频呈现强线性相关,而AI视频几乎无相关性。 Conclusion: 确定性的光学现象(如莫尔运动不变量)可作为物理可解释、不可伪造的视频真实性签名,为AI生成视频检测提供了新范式。 Abstract: Recent advances in video generation have made AI-synthesized content increasingly difficult to distinguish from real footage. We propose a physics-based authentication signature that real cameras produce naturally, but that generative models cannot faithfully reproduce. Our approach exploits the Moiré effect: the interference fringes formed when a camera views a compact two-layer grating structure. We derive the Moiré motion invariant, showing that fringe phase and grating image displacement are linearly coupled by optical geometry, independent of viewing distance and grating structure. A verifier extracts both signals from video and tests their correlation. We validate the invariant on both real-captured and AI-generated videos from multiple state-of-the-art generators, and find that real and AI-generated videos produce significantly different correlation signatures, suggesting a robust means of differentiating them. Our work demonstrates that deterministic optical phenomena can serve as physically grounded, verifiable signatures against AI-generated video.[102] DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
Wonjoon Jin,Jiyun Won,Janghyeok Han,Qi Dai,Chong Luo,Seung-Hwan Baek,Sunghyun Cho
Main category: cs.CV
TL;DR: DynaVid 是一种新型视频合成框架,通过在训练中引入合成的光流运动数据(而非完整视频),解耦运动建模与外观建模,从而提升动态视频生成的真实性与运动可控性。
Details
Motivation: 现有视频扩散模型受限于真实训练数据中高度动态或精细可控运动样本的稀缺;真实数据难以提供丰富多样的运动模式和精确控制信号,而直接使用合成视频又会引入不自然的外观。 Method: 提出两阶段框架:1)运动生成器合成光流(由计算机图形管线渲染的合成运动数据);2)运动引导的视频生成器基于该光流生成真实感视频帧;关键创新在于仅用合成光流(不含外观信息)进行运动学习,避免合成外观污染。 Result: 在剧烈人体运动生成和极端相机运动控制两个挑战性任务上显著优于现有方法,验证了其在动态视频真实性与运动可控性上的提升。 Conclusion: 利用合成光流作为可控、多样且外观无关的运动监督信号,是提升视频扩散模型动态建模能力的有效范式;解耦运动与外观的学习路径可兼顾合成质量与可控性。 Abstract: Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.[103] Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion
Juncen Guo,Xiaoguang Zhu,Jingyi Wu,Jingyu Zhang,Jingnan Cai,Zhenghao Niu,Liang Song
Main category: cs.CV
TL;DR: 本文提出了一种无需领域标识和历史样本的增量学习框架,通过解耦表征和权重融合策略,提升具身多媒体系统在动态环境中的连续适应能力与泛化性能。
Details
Motivation: 现有领域增量感知方法依赖测试阶段预知的领域ID,且易过拟合场景特异性感知噪声,导致泛化差和灾难性遗忘。 Method: 设计解耦表征机制以消除环境风格干扰、聚焦语义本征特征;采用权重融合策略在参数空间动态整合新旧环境知识,无需存储历史数据或领域ID。 Result: 在多个标准基准数据集上显著降低灾难性遗忘,在完全无样本、无领域ID设定下精度优于现有SOTA方法。 Conclusion: 所提框架实现了具身感知系统在开放物理空间中鲁棒、持续的环境自适应,兼顾泛化性与旧任务判别力保留。 Abstract: Embodied perception systems face severe challenges of dynamic environment distribution drift when they continuously interact in open physical spaces. However, the existing domain incremental awareness methods often rely on the domain id obtained in advance during the testing phase, which limits their practicability in unknown interaction scenarios. At the same time, the model often overfits to the context-specific perceptual noise, which leads to insufficient generalization ability and catastrophic forgetting. To address these limitations, we propose a domain-id and exemplar-free incremental learning framework for embodied multimedia systems, which aims to achieve robust continuous environment adaptation. This method designs a disentangled representation mechanism to remove non-essential environmental style interference, and guide the model to focus on extracting semantic intrinsic features shared across scenes, thereby eliminating perceptual uncertainty and improving generalization. We further use the weight fusion strategy to dynamically integrate the old and new environment knowledge in the parameter space, so as to ensure that the model adapts to the new distribution without storing historical data and maximally retains the discrimination ability of the old environment. Extensive experiments on multiple standard benchmark datasets show that the proposed method significantly reduces catastrophic forgetting in a completely exemplar-free and domain-id free setting, and its accuracy is better than the existing state-of-the-art methods.[104] HOT: Harmonic-Constrained Optimal Transport for Remote Photoplethysmography Domain Adaptation
Ba-Thinh Nguyen,Thi-Duyen Ngo,Thanh-Trung Huynh,Thanh-Ha Le,Huy-Hieu Pham
Main category: cs.CV
TL;DR: 本文提出频率域自适应(FDA)和调和约束最优传输(HOT)方法,以提升远程光电容积脉搏波(rPPG)模型在跨域场景下的鲁棒性与泛化能力。
Details
Motivation: 现有深度学习rPPG方法易过拟合于光照、相机特性等外观相关因素,导致跨域性能显著下降。 Method: 提出频率域自适应(FDA)建模外观变化,并设计调和约束最优传输(HOT)利用心率信号的谐波特性对齐原始与FDA转换后的表征。 Result: 在多个数据集上的跨域实验表明,FDA+HOT框架显著提升了rPPG模型的鲁棒性和泛化能力。 Conclusion: FDA与HOT联合策略能有效解耦外观变化与生理信号,增强rPPG模型在真实多变场景中的实用性。 Abstract: Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos; however, its practical deployment is often hindered by substantial performance degradation under domain shift. While recent deep learning-based rPPG methods have achieved strong performance on individual datasets, they frequently overfit to appearance-related factors, such as illumination, camera characteristics, and color response, that vary significantly across domains. To address this limitation, we introduce frequency domain adaptation (FDA) as a principled strategy for modeling appearance variation in rPPG. By transferring low-frequency spectral components that encode domain-dependent appearance characteristics, FDA encourages rPPG models to learn invariance to appearance variations while retaining cardiac-induced signals. To further support physiologically consistent alignment under such appearance variation, we propose Harmonic-Constrained Optimal Transport (HOT), which leverages the harmonic property of cardiac signals to guide alignment between original and FDA-transferred representations. Extensive cross-dataset experiments demonstrate that the proposed FDA and HOT framework effectively enhances the robustness and generalization of rPPG models across diverse datasets.[105] GPA: Learning GUI Process Automation from Demonstrations
Zirui Zhao,Jun Hao Liew,Yan Yang,Wenzhuo Yang,Ziyang Luo,Doyen Sahoo,Silvio Savarese,Junnan Li
Main category: cs.CV
TL;DR: 本文提出GUI Process Automation (GPA),一种轻量级、基于视觉的机器人流程自动化方法,通过顺序蒙特卡洛定位、就绪度校准和本地化执行,实现鲁棒、确定性高且隐私安全的GUI任务自动化,并在实验中显著优于Gemini 3 Pro。
Details
Motivation: 解决传统RPA的脆弱性和当前视觉语言模型GUI代理的非确定性风险,满足企业工作流对适应性、鲁棒性和安全性的需求。 Method: 引入基于顺序蒙特卡洛的定位以增强鲁棒性,通过就绪度校准保障确定性和可靠性,并采用快速全本地执行确保隐私;同时支持作为MCP/CLI工具供其他具备编码能力的智能体调用。 Result: 在试点实验中,GPA在长周期GUI任务上的成功率高于Gemini 3 Pro(配备CUA工具),执行速度快10倍。 Conclusion: GPA是一种高效、稳定且安全的GUI自动化方案,兼具企业级实用性与多智能体协同潜力。 Abstract: GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.[106] Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding
Yuheng Jiang,Yiwen Cai,Zihao Wang,Yize Wu,Sicheng Li,Zhuo Su,Shaohui Jiao,Lan Xu
Main category: cs.CV
TL;DR: 本文提出Director,一种统一的时空高斯表示方法,联合建模人类动作、高质量渲染和实例级语义,通过语言对齐的语义监督与光流引导的运动优化,实现动态场景中稳定、可解释的4D重建。
Details
Motivation: 现有基于高斯的体视频方法侧重外观重建,缺乏实例级结构建模,难以支持动态场景下的稳定跟踪与语义推理。 Method: 提出时空统一的高斯表示;利用多模态大语言模型生成的句子嵌入和时序对齐的实例掩码,通过两个MLP解码器监督高斯语义特征;融合2D光流优化高斯运动以提升时序稳定性;引入几何感知的SDF约束与表面连续性正则化。 Result: 在保持高保真渲染的同时,实现了时序一致的4D重建,并支持实例分割与开放词汇查询。 Conclusion: 嵌入实例一致的语义能自然增强4D建模能力,Director为动态场景理解提供了兼具几何精度、时序稳定性和语义可解释性的新范式。 Abstract: Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.[107] BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography
Ba-Thinh Nguyen,Thi-Duyen Ngo,Thanh-Trung Huynh,Thanh-Ha Le,Huy-Hieu Pham
Main category: cs.CV
TL;DR: 本文提出BTS-rPPG框架,利用正交蝶形时序移位(BTS)和正交特征传递机制(OFT)增强远距离时序建模能力,显著提升无接触式rPPG信号估计性能。
Details
Motivation: 现有深度学习方法在建模rPPG信号时多依赖局部时序操作(如时序移位或卷积),导致时序感受野有限、难以捕获长程生理动态。 Method: 提出基于FFT蝶形结构的正交蝶形时序移位(BTS),通过XOR配对策略实现逐级扩展的帧间交互;并设计正交特征传递机制(OFT),在时序移位前对源特征进行目标上下文正交化过滤。 Result: 在多个基准数据集上实验表明,BTS-rPPG显著优于现有时序建模范式,在rPPG信号估计任务中提升长程时序建模能力与精度。 Conclusion: BTS-rPPG通过结构化长程时序交互与正交特征传递,有效缓解了rPPG中局部建模瓶颈,为视频式生理感知提供了更鲁棒的时序建模范式。 Abstract: Remote photoplethysmography (rPPG) enables contactless physiological sensing from facial videos by analyzing subtle appearance variations induced by blood circulation. However, modeling the temporal dynamics of these signals remains challenging, as many deep learning methods rely on temporal shifting or convolutional operators that aggregate information primarily from neighboring frames, resulting in predominantly local temporal modeling and limited temporal receptive fields. To address this limitation, we propose BTS-rPPG, a temporal modeling framework based on Orthogonal Butterfly Temporal Shifting (BTS). Inspired by the butterfly communication pattern in the Fast Fourier Transform (FFT), BTS establishes structured frame interactions via an XOR-based butterfly pairing schedule, progressively expanding the temporal receptive field and enabling efficient propagation of information across distant frames. Furthermore, we introduce an orthogonal feature transfer mechanism (OFT) that filters the source feature with respect to the target context before temporal shifting, retaining only the orthogonal component for cross-frame transmission. This reduces redundant feature propagation and encourages complementary temporal interaction. Extensive experiments on multiple benchmark datasets demonstrate that BTS-rPPG improves long-range temporal modeling of physiological dynamics and consistently outperforms existing temporal modeling strategies for rPPG estimation.[108] From Understanding to Erasing: Towards Complete and Stable Video Object Removal
Dingming Liu,Wenjing Wang,Chen Li,Jing Lyu
Main category: cs.CV
TL;DR: 本文提出了一种结合外部知识蒸馏与内部上下文注意力机制的视频目标擦除方法,提升了对目标物体及其物理效应(如阴影、反射)的理解,实现了更一致、更真实的视频修复效果,并建立了首个真实世界视频目标擦除基准。
Details
Motivation: 现有扩散模型在视频目标擦除中难以消除目标引发的副作用(如阴影、反射、光照变化),主因是对目标物体及其与场景物理/语义交互理解不足。 Method: 1)外部:通过知识蒸馏将视觉基础模型中物体与其诱导效应的关系迁移到视频扩散模型;2)内部:设计帧级上下文交叉注意力机制,使每个去噪模块聚焦于目标区域周围未遮挡的上下文信息。 Result: 在多个指标上达到SOTA性能,并构建了首个面向真实场景的视频目标擦除基准数据集。 Conclusion: 融合内外双重理解机制显著提升了视频目标擦除的真实性与时空一致性,为该任务提供了新范式和实用基准。 Abstract: Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: https://github.com/WeChatCV/UnderEraser.[109] Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation
Lingyu Liu,Yaxiong Wang,Li Zhu,Zhedong Zheng
Main category: cs.CV
TL;DR: 本文提出一种基于时间循环一致性的双向视频帧插值框架,通过可学习的方向性标记和课程学习策略,在训练中施加前向生成与后向重建的对称约束,提升长序列插值的运动一致性与边界对齐,且推理时保持单向高效。
Details
Motivation: 现有生成式视频帧插值方法多为单向,缺乏自验证机制,易导致运动漂移、方向模糊和边界错位,尤其在长序列中问题突出。 Method: 提出双向循环一致框架:引入可学习方向性token显式建模时间方向,共享主干网络联合优化前向合成与后向重建;采用课程学习从短序列逐步过渡到长序列;循环约束仅用于训练,不增加推理开销。 Result: 在37帧和73帧长序列任务上达到图像质量、运动平滑度与动态控制的SOTA性能,显著优于强基线,且无额外计算开销。 Conclusion: 时间循环一致性是一种有效正则化手段,能显著提升插值模型的逻辑可逆性与时空一致性,所提双向框架兼顾性能与效率。 Abstract: Video frame interpolation aims to synthesize realistic intermediate frames between given endpoints while adhering to specific motion semantics. While recent generative models have improved visual fidelity, they predominantly operate in a unidirectional manner, lacking mechanisms to self-verify temporal consistency. This often leads to motion drift, directional ambiguity, and boundary misalignment, especially in long-range sequences. Inspired by the principle of temporal cycle-consistency in self-supervised learning, we propose a novel bidirectional framework that enforces symmetry between forward and backward generation trajectories. Our approach introduces learnable directional tokens to explicitly condition a shared backbone on temporal orientation, enabling the model to jointly optimize forward synthesis and backward reconstruction within a single unified architecture. This cycle-consistent supervision acts as a powerful regularizer, ensuring that generated motion paths are logically reversible. Furthermore, we employ a curriculum learning strategy that progressively trains the model from short to long sequences, stabilizing dynamics across varying durations. Crucially, our cyclic constraints are applied only during training; inference requires a single forward pass, maintaining the high efficiency of the base model. Extensive experiments show that our method achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines while incurring no additional computational overhead.[110] Bias mitigation in graph diffusion models
Meng Yu,Kun Zhan
Main category: cs.CV
TL;DR: 本文提出了一种综合方法,通过Langevin采样对齐前向最大扰动分布以缓解反向起始偏差,并引入基于分数差的分数校正机制来解决暴露偏差,从而提升图扩散模型生成质量。
Details
Motivation: 现有图扩散模型存在显著偏差问题:前向扩散的最大扰动分布偏离标准高斯分布,而反向采样却始终从标准高斯分布开始,导致反向起始偏差;叠加扩散模型固有的暴露偏差,进一步降低生成质量。 Method: 1)设计新型Langevin采样算法,使反向过程起始点与前向最大扰动分布对齐,缓解反向起始偏差;2)定义分数差并据此引入分数校正机制,缓解暴露偏差;整个方法无需修改网络结构。 Result: 在多个模型、数据集和任务上验证有效,取得当前最优性能(state-of-the-art)。 Conclusion: 所提方法能有效缓解图扩散模型中的反向起始偏差与暴露偏差,显著提升生成质量,且具备良好的通用性和即插即用性。 Abstract: Most existing graph diffusion models have significant bias problems. We observe that the forward diffusion's maximum perturbation distribution in most models deviates from the standard Gaussian distribution, while reverse sampling consistently starts from a standard Gaussian distribution, which results in a reverse-starting bias. Together with the inherent exposure bias of diffusion models, this results in degraded generation quality. This paper proposes a comprehensive approach to mitigate both biases. To mitigate reverse-starting bias, we employ a newly designed Langevin sampling algorithm to align with the forward maximum perturbation distribution, establishing a new reverse-starting point. To address the exposure bias, we introduce a score correction mechanism based on a newly defined score difference. Our approach, which requires no network modifications, is validated across multiple models, datasets, and tasks, achieving state-of-the-art results.Code is at https://github.com/kunzhan/spp[111] End-to-End Shared Attention Estimation via Group Detection with Feedback Refinement
Chihiro Nakatani,Norimichi Ukita,Jean-Marc Odobez
Main category: cs.CV
TL;DR: 本文提出了一种端到端的、基于群体检测的共享注意力(SA)估计方法,通过两步过程联合完成群体检测与SA热图生成,并在性能上优于现有方法。
Details
Motivation: 以往方法未显式检测关注同一目标的人群,或假设图像中仅存在单一SA点,限制了实际应用与性能。 Method: 采用两步流程:(i) 基于个体注视热图和群体成员标量生成SA热图;(ii) 利用初始SA热图优化群体成员关系,输出最终SA热图。 Result: 在群体检测与共享注意力估计任务上均超越现有方法,消融实验验证各模块有效性。 Conclusion: 所提联合建模方法提升了SA估计的实用性与准确性,为真实场景中的群体行为理解提供了新思路。 Abstract: This paper proposes an end-to-end shared attention estimation method via group detection. Most previous methods estimate shared attention (SA) without detecting the actual group of people focusing on it, or assume that there is a single SA point in a given image. These issues limit the applicability of SA detection in practice and impact performance. To address them, we propose to simultaneously achieve group detection and shared attention estimation using a two step process: (i) the generation of SA heatmaps relying on individual gaze attention heatmaps and group membership scalars estimated in a group inference; (ii) a refinement of the initial group memberships allowing to account for the initial SA heatmaps, and the final prediction of the SA heatmap. Experiments demonstrate that our method outperforms other methods in group detection and shared attention estimation. Additional analyses validate the effectiveness of the proposed components. Code: https://github.com/chihina/sagd-CVPRW2026.[112] SteerFlow: Steering Rectified Flows for Faithful Inversion-Based Image Editing
Thinh Dao,Zhen Wang,Kien T. Pham,Long Chen
Main category: cs.CV
TL;DR: 本文提出SteerFlow,一种模型无关的文本引导图像编辑框架,通过引入摊销不动点求解器、轨迹插值和自适应掩码机制,在保持源图像高保真度的同时提升编辑质量与背景一致性。
Details
Motivation: 现有基于流的生成模型在文本引导图像编辑中难以兼顾源图像保真度与编辑灵活性:高阶求解器增加计算开销,截断反演限制可编辑性,特征注入方法缺乏架构可迁移性。 Method: 提出SteerFlow框架,包含三部分:1)前向过程使用摊销不动点求解器,通过强制连续时间步速度一致性隐式拉直前向轨迹;2)反向过程采用轨迹插值,自适应融合目标编辑与源重建速度;3)引入自适应掩码机制,结合概念引导分割与源-目标速度差进行空间约束。 Result: 在FLUX.1-dev和Stable Diffusion 3.5 Medium上实验表明,SteerFlow在编辑质量(如保真度、编辑准确性、背景一致性)上持续优于现有方法,并天然支持多轮编辑且无累积漂移。 Conclusion: SteerFlow是一种具备强理论保真保证、模型无关、可扩展至多轮编辑的通用图像编辑框架,有效解决了当前流式生成模型在文本引导编辑中的核心保真度问题。 Abstract: Recent advances in flow-based generative models have enabled training-free, text-guided image editing by inverting an image into its latent noise and regenerating it under a new target conditional guidance. However, existing methods struggle to preserve source fidelity: higher-order solvers incur additional model inferences, truncated inversion constrains editability, and feature injection methods lack architectural transferability. To address these limitations, we propose SteerFlow, a model-agnostic editing framework with strong theoretical guarantees on source fidelity. In the forward process, we introduce an Amortized Fixed-Point Solver that implicitly straightens the forward trajectory by enforcing velocity consistency across consecutive timesteps, yielding a high-fidelity inverted latent. In the backward process, we introduce Trajectory Interpolation, which adaptively blends target-editing and source-reconstruction velocities to keep the editing trajectory anchored to the source. To further improve background preservation, we introduce an Adaptive Masking mechanism that spatially constrains the editing signal with concept-guided segmentation and source-target velocity differences. Extensive experiments on FLUX.1-dev and Stable Diffusion 3.5 Medium demonstrate that SteerFlow consistently achieves better editing quality than existing methods. Finally, we show that SteerFlow extends naturally to a complex multi-turn editing paradigm without accumulating drift.[113] Setup-Independent Full Projector Compensation
Haibo Li,Qingyue Deng,Jijiang Li,Haibin Ling,Bingyao Huang
Main category: cs.CV
TL;DR: 本文提出SIComp,首个无需微调或重训练即可泛化至新投影设置的全投影补偿框架,通过构建大规模真实世界数据集和解耦几何与光度补偿的协同自适应设计实现。
Details
Motivation: 现有投影仪补偿方法高度依赖特定设置,缺乏大规模多样化训练数据,且几何校正模型难以泛化到新几何配置。 Method: 提出SIComp框架,包含:1)构建含277种不同投影-相机设置的大规模真实数据集;2)采用协同自适应设计,用定制光学流模块在线进行几何校正,用新型光度网络处理光度补偿;3)在网路中引入随光照变化的表面先验以增强鲁棒性。 Result: SIComp在多种未见设置下均能持续生成高质量补偿结果,在泛化能力上显著优于现有方法,是首个可泛化的投影补偿解决方案。 Conclusion: SIComp成功解决了投影补偿中长期存在的设置依赖性问题,为实际复杂场景下的鲁棒投影应用提供了通用、即插即用的新范式。 Abstract: Projector compensation seeks to correct geometric and photometric distortions that occur when images are projected onto nonplanar or textured surfaces. However, most existing methods are highly setup-dependent, requiring fine-tuning or retraining whenever the surface, lighting, or projector-camera pose changes. Progress has been limited by two key challenges: (1) the absence of large, diverse training datasets and (2) existing geometric correction models are typically constrained by specific spatial setups; without further retraining or fine-tuning, they often fail to generalize directly to novel geometric configurations. We introduce SIComp, the first Setup-Independent framework for full projector Compensation, capable of generalizing to unseen setups without fine-tuning or retraining. To enable this, we construct a large-scale real-world dataset spanning 277 distinct projector-camera setups. SIComp adopts a co-adaptive design that decouples geometry and photometry: A carefully tailored optical flow module performs online geometric correction, while a novel photometric network handles photometric compensation. To further enhance robustness under varying illumination, we integrate intensity-varying surface priors into the network design. Extensive experiments demonstrate that SIComp consistently produces high-quality compensation across diverse unseen setups, substantially outperforming existing methods in terms of generalization ability and establishing the first generalizable solution to projector compensation. The code and dataset are available on our project page: https://hai-bo-li.github.io/SIComp/[114] Dense Point-to-Mask Optimization with Reinforced Point Selection for Crowd Instance Segmentation
Hongru Chen,Jiyang Huang,Jia Wan,Antoni B. Chan
Main category: cs.CV
TL;DR: 本文提出Dense Point-to-Mask Optimization (DPMO) 和 Reinforced Point Selection (RPS) 框架,利用点标注生成高质量实例分割掩码,并提升密集人群计数精度。
Details
Motivation: 现有密集人群数据集多为点标注,缺乏准确的区域(如框或掩码)标注;直接应用SAM等大模型效果不佳,需结合点标注生成高质量掩码以提升分割与计数性能。 Method: 提出DPMO方法,将SAM与最近邻独占圆(NNEC)约束结合,从点标注生成密集实例掩码;再构建基于GRPO强化学习的RPS框架,优化关键点选择;并设计掩码监督的新损失函数。 Result: 在ShanghaiTech、UCF-QNRF、JHU-CROWD++和NWPU-Crowd四个主流人群数据集上达到SOTA分割性能,并显著提升多种模型的计数精度。 Conclusion: 掩码标注对密集人群实例分割与计数均具有关键作用;DPMO与RPS联合策略可有效桥接点标注与高质量掩码,推动人群分析任务发展。 Abstract: Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.[115] Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception
Haoyuan Li,Wen Yang,Fang Xu,Hong Tan,Haijian Zhang,Shengyang Li,Gui-Song Xia
Main category: cs.CV
TL;DR: 本文提出了一种几何感知的无人机地理定位框架,通过重建局部3D场景并生成正射校正的鸟瞰图(BEV),统一实现跨视角粗粒度位置检索与细粒度6自由度姿态估计,显著提升GNSS拒止环境下无人机在卫星地图上的定位精度。
Details
Motivation: 解决无人机在无GNSS环境下,因倾斜视角图像与正射卫星地图之间严重几何差异导致的跨视角地理定位难题;现有方法将透视畸变视为外观噪声,未显式建模几何关系。 Method: 提出基于视觉几何的Transformer(VGGT)从多视角无人机图像序列重建局部3D场景,并渲染几何一致的虚拟鸟瞰图(BEV)作为跨视角中介;引入卫星级注意力模块(Satellite-wise Attention Block)解耦各候选卫星区域与无人机场景的交互;发布坐标重标定、带空间重叠分析的University-1652数据集。 Result: 在重标定University-1652和SUES-200数据集上显著超越SOTA方法,实现鲁棒的米级定位精度,在复杂城市环境中泛化性更强。 Conclusion: 显式建模3D几何并引入BEV作为几何中介,可有效弥合跨视角表征鸿沟;所提端到端框架及新基准数据集为无人机视觉地理定位提供了新范式。 Abstract: Cross-view geo-localization for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments remains challenging due to the severe geometric discrepancy between oblique UAV imagery and orthogonal satellite maps. Most existing methods address this problem through a decoupled pipeline of place retrieval and pose estimation, implicitly treating perspective distortion as appearance noise rather than an explicit geometric transformation. In this work, we propose a geometry-aware UAV geo-localization framework that explicitly models the 3D scene geometry to unify coarse place recognition and fine-grained pose estimation within a single inference pipeline. Our approach reconstructs a local 3D scene from multi-view UAV image sequences using a Visual Geometry Grounded Transformer (VGGT), and renders a virtual Bird's-Eye View (BEV) representation that orthorectifies the UAV perspective to align with satellite imagery. This BEV serves as a geometric intermediary that enables robust cross-view retrieval and provides spatial priors for accurate 3 Degrees of Freedom (3-DoF) pose regression. To efficiently handle multiple location hypotheses, we introduce a Satellite-wise Attention Block that isolates the interaction between each satellite candidate and the reconstructed UAV scene, preventing inter-candidate interference while maintaining linear computational complexity. In addition, we release a recalibrated version of the University-1652 dataset with precise coordinate annotations and spatial overlap analysis, enabling rigorous evaluation of end-to-end localization accuracy. Extensive experiments on the refined University-1652 benchmark and SUES-200 demonstrate that our method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.[116] Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
Jiayun Jin,Haolong Chai,Xueying Huang,Xiaoqing Guo,Zengwei Zheng,Zhan Zhou,Junmei Wang,Xinyu Wang,Jie Liu,Binbin Zhou
Main category: cs.CV
TL;DR: 本文提出US-365K数据集和Ultrasonographic Diagnostic Taxonomy(UDT)知识体系,并设计Ultrasound-CLIP模型,通过语义软标签、语义损失与异构图模态实现超声图像-文本的语义对齐,在多项下游任务中达到SOTA性能。
Details
Motivation: 现有视觉语言预训练模型(如CLIP)难以直接适配超声影像,因其具有解剖结构异质性强、诊断属性多样等特点,亟需构建专用数据集与知识体系以支撑超声模态的语义理解。 Method: 构建大规模超声图文数据集US-365K(365k样本,52类解剖部位);提出超声诊断分类体系UDT,包含解剖层级分类与九维诊断属性框架(UDAF);基于此设计Ultrasound-CLIP框架,引入语义软标签、语义损失,并从UDAF文本表示构建异构图模态以支持病灶-属性关系推理。 Result: 在患者级划分的分类与检索基准上达到SOTA;在零样本、线性探测与微调任务中展现出强泛化能力。 Conclusion: 本工作为超声影像的多模态理解提供了系统性数据基础、结构化知识表示与语义感知建模方法,显著提升了超声图像-文本跨模态学习性能与临床可解释性。 Abstract: Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.[117] Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion
Edoardo A. Dominici,Thomas Deixelberger,Konstantinos Vardis,Markus Steinberger
Main category: cs.CV
TL;DR: 本文提出了一种轻量级架构和训练策略,利用自监督学习(如DINO)提取的高维特征,解耦外观与其它需保留的特征,从而实现对视频域迁移和视频生成等任务中风格化、重打光等外观变化的鲁棒控制。
Details
Motivation: 现有视频扩散模型多依赖感知、几何或简单语义信号进行条件控制,而自监督视觉特征(如DINO)虽富含场景信息,但因风格、光照与语义高度纠缠,限制了其在生成任务中的可控性。本文旨在探索如何将这类高维特征作为通用条件信号,提升预训练视频扩散模型的生成可控性。 Method: 提出一种轻量级网络架构与训练策略,显式解耦外观特征与其他结构/语义特征;利用高维低空间分辨率特征补偿空间细节损失,增强从显式空间表征(如3D)生成视频时的可控性。 Result: 在视频域迁移与视频-3D生成任务上实现了有效的外观编辑(如风格迁移、重打光),验证了解耦策略与高维特征补偿机制的有效性与鲁棒性。 Conclusion: 自监督高维特征可作为通用条件接口用于预训练视频扩散模型,通过特征解耦与维度-分辨率权衡设计,能显著提升生成过程中的外观可控性,为视频生成与世界模拟提供新范式。 Abstract: Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.[118] Cosine-Normalized Attention for Hyperspectral Image Classification
Muhammad Ahmad,Manuel Mazzara
Main category: cs.CV
TL;DR: 本文提出一种基于余弦归一化的注意力机制,强调高光谱数据的角关系,减少对幅度变化的敏感性,并在极低监督下显著提升高光谱图像分类性能。
Details
Motivation: 传统Transformer的点积注意力混合了特征的模长和方向,在高光谱数据中可能次优;需从几何角度重新设计注意力打分方式。 Method: 引入余弦归一化注意力:将查询和键嵌入投影到单位超球面,采用平方余弦相似度计算注意力分数,突出角度关系、抑制幅度干扰;集成至空-谱Transformer架构中。 Result: 在三个基准数据集上,该方法在极低监督条件下持续优于多种最新Transformer和Mamba模型,且使用轻量骨干网络;控制实验表明余弦打分提供了可靠的归纳偏置。 Conclusion: 余弦归一化注意力更契合高光谱信号的几何特性,是一种更适配高光谱表示学习的注意力设计范式。 Abstract: Transformer-based methods have improved hyperspectral image classification (HSIC) by modeling long-range spatial-spectral dependencies; however, their attention mechanisms typically rely on dot-product similarity, which mixes feature magnitude and orientation and may be suboptimal for hyperspectral data. This work revisits attention scoring from a geometric perspective and introduces a cosine-normalized attention formulation that aligns similarity computation with the angular structure of hyperspectral signatures. By projecting query and key embeddings onto a unit hypersphere and applying a squared cosine similarity, the proposed method emphasizes angular relationships while reducing sensitivity to magnitude variations. The formulation is integrated into a spatial-spectral Transformer and evaluated under extremely limited supervision. Experiments on three benchmark datasets demonstrate that the proposed approach consistently achieves higher performance, outperforming several recent Transformer- and Mamba-based models despite using a lightweight backbone. In addition, a controlled analysis of multiple attention score functions shows that cosine-based scoring provides a reliable inductive bias for hyperspectral representation learning.[119] Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning
Seyed Amir Kasaei,Arash Marioriyad,Mahbod Khaleti,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban
Main category: cs.CV
TL;DR: 本文提出RebusBench基准,用于评估大视觉语言模型在需多步推理(如字谜)任务中的神经符号能力,发现当前SOTA模型表现极差(<10%精确匹配),揭示其缺乏连接感知与知识的认知推理机制。
Details
Motivation: 现有大视觉语言模型虽擅长显式视觉识别,但在需将视觉线索经多步推理(如提取属性、调用语言先验、抽象映射)才能得出答案的任务(如字谜)上存在显著认知缺陷。 Method: 构建RebusBench基准(含1164个字谜),系统评估Qwen、InternVL、LLaVA等SOTA模型在精确匹配与语义准确率上的表现,并分析模型缩放与上下文学习的影响。 Result: 所有被测模型在Exact Match上低于10%,语义准确率低于20%,且模型规模扩大或使用In-Context Learning均未带来显著提升。 Conclusion: 当前LVLM具备视觉与语言组件,但缺乏将二者有效整合的‘认知推理粘合剂’,亟需增强神经符号协同推理能力。 Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.[120] DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning
Yang Zhou,Xiaofeng Wang,Hao Shao,Letian Wang,Guosheng Zhao,Jiangnan Shao,Jiagang Zhu,Tingdong Yu,Zheng Zhu,Guan Huang,Steven L. Waslander
Main category: cs.CV
TL;DR: DriveDreamer-Policy 是一种统一的驾驶世界-动作模型,通过整合深度生成、未来视频生成与运动规划,在几何感知的世界表征基础上提升预测连贯性与驾驶决策质量。
Details
Motivation: 现有世界-动作模型(WAM)多建模2D外观或潜在表示,缺乏对物理世界至关重要的几何接地能力,限制了具身系统在真实驾驶场景中的表现。 Method: 提出DriveDreamer-Policy:以大语言模型处理语言指令、多视角图像和动作输入,后接三个轻量级生成器分别输出深度图、未来视频帧和驾驶动作;显式学习几何感知的世界表征,并统一用于未来预测与动作规划。 Result: 在Navsim v1/v2上分别达到89.2 PDMS和88.7 EPDMS,优于现有基于世界模型的方法;生成的未来视频与深度图质量更高;消融实验证明显式深度学习可提升视频想象质量与规划鲁棒性。 Conclusion: 几何感知的世界表征与模块化统一架构是提升驾驶世界-动作模型性能的关键,兼顾高生成质量、强规划能力与可控延迟。 Abstract: Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.[121] FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation
Taimur Khan,Hannes Feilhauer,Muhammad Jazib Zafar
Main category: cs.CV
TL;DR: 本文提出FSKD框架,通过知识蒸馏将LiDAR数据中的森林结构信息(如CHM、PAI、FHD)迁移至仅使用RGBI影像的轻量模型,实现低成本、高精度、多指标联合预测,并支持跨季节遥感数据输入。
Details
Motivation: airborne LiDAR虽为森林结构监测金标准,但成本高、获取频次低;亟需利用更易获取的RGBI影像替代或补充LiDAR,实现大规模、高频次、高分辨率森林结构制图。 Method: 提出FSKD:一种LiDAR-to-RGBI知识蒸馏框架;教师模型为多模态(LiDAR+RGBI),采用跨注意力融合LiDAR平面指标与垂直剖面;学生模型为纯RGBI输入的SegFormer;在德国萨克森州384 km²数据上训练,在8个地理异质区域测试。 Result: 学生模型实现SOTA零样本CHM预测(MedAE=4.17 m, R²=0.51, IoU=0.87),MAE较HRCHM/DAC基线降低29–46%;首次实现CHM、PAI、FHD三指标联合预测;对冬-夏时相不匹配鲁棒,支持非同步采集。 Conclusion: FSKD成功将LiDAR先验知识高效蒸馏至RGBI模型,在保持20 cm GSD精度的同时显著降低成本与部署门槛,推动高分辨率森林结构产品在数字孪生德国等国家级工程中的规模化应用。 Abstract: Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29--46% in MAE (5.81 m vs. 8.14--10.84 m) with stronger correlation coefficients (0.713 vs. 0.166--0.652). Ablations show that multi-modal fusion improves performance by 10--26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.[122] A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection
Arezoo Borji,Gernot Kronreif,Bernhard Angermayr,Francisco Mario Calisto,Wolfgang Birkfellner,Inna Servetnyk,Yinyin Yuan,Sepideh Hatamikia
Main category: cs.CV
TL;DR: 本文提出了一种结合NSGA-II优化与蒙特卡洛Dropout不确定性估计的深度学习框架,从H&E染色全切片图像中直接预测PAM50亚型,减少对昂贵分子检测的依赖。在TCGA-BRCA和CPTAC-BRCA数据集上验证了其高F1和AUC性能,具备临床转化潜力。
Details
Motivation: 降低乳腺癌PAM50分子分型对昂贵、耗时的分子检测的依赖,探索基于常规H&E染色图像的无创、可扩展替代方案。 Method: 提出一种优化驱动的深度学习框架:联合优化补丁信息量、空间多样性、不确定性及补丁数量;采用NSGA-II多目标进化算法与Monte Carlo Dropout不确定性估计相结合;使用ResNet18提取特征,自定义CNN头分类;仅选择少量高信息量WSI补丁进行预测。 Result: 在内部TCGA-BRCA数据集(627张WSIs)上达F1=0.8812、AUC=0.9841;在外部CPTAC-BRCA数据集上达F1=0.7952、AUC=0.9512;显著提升计算效率与分类性能。 Conclusion: 该不确定性感知、优化引导的补丁选择策略可高效、准确实现基于组织病理图像的PAM50分型,有望成为临床可行的分子检测替代方案。 Abstract: Breast cancer is a highly heterogeneous disease with diverse molecular profiles. The PAM50 gene signature is widely recognized as a standard for classifying breast cancer into intrinsic subtypes, enabling more personalized treatment strategies. In this study, we introduce a novel optimization-driven deep learning framework that aims to reduce reliance on costly molecular assays by directly predicting PAM50 subtypes from H&E-stained whole-slide images (WSIs). Our method jointly optimizes patch informativeness, spatial diversity, uncertainty, and patch count by combining the non-dominated sorting genetic algorithm II (NSGA-II) with Monte Carlo dropout-based uncertainty estimation. The proposed method can identify a small but highly informative patch subset for classification. We used a ResNet18 backbone for feature extraction and a custom CNN head for classification. For evaluation, we used the internal TCGA-BRCA dataset as the training cohort and the external CPTAC-BRCA dataset as the test cohort. On the internal dataset, an F1-score of 0.8812 and an AUC of 0.9841 using 627 WSIs from the TCGA-BRCA cohort were achieved. The performance of the proposed approach on the external validation dataset showed an F1-score of 0.7952 and an AUC of 0.9512. These findings indicate that the proposed optimization-guided, uncertainty-aware patch selection can achieve high performance and improve the computational efficiency of histopathology-based PAM50 classification compared to existing methods, suggesting a scalable imaging-based replacement that has the potential to support clinical decision-making.[123] STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
Emad Bahrami,Olga Zatsarynna,Parth Pathak,Sunando Sengupta,Juergen Gall,Mohsen Fayyaz
Main category: cs.CV
TL;DR: STRIVE是一种面向视频问答的时空强化学习框架,通过构建视频的多种时空变体并联合归一化文本生成与视觉变体,提升奖励信号质量与策略更新稳定性;引入重要性感知采样机制确保语义相关帧被优先探索,实验证明其在多个视频推理基准上显著优于现有强化学习基线。
Details
Motivation: 现有基于组的策略优化方法在大型多模态模型中常因响应正确性相似而导致奖励方差低,进而造成优势估计弱或不稳定。 Method: 提出STRIVE框架:1)构造输入视频的多个时空变体;2)对文本生成和视觉变体进行联合归一化;3)设计重要性感知采样机制,优先选择与问题最相关的帧,同时保持时间覆盖。 Result: 在VideoMME、TempCompass、VideoMMMU、MMVU、VSI-Bench和PerceptionTest共六个视频推理基准上,STRIVE在多个大型多模态模型上持续超越强强化学习基线。 Conclusion: 结构化的时空探索是一种稳定多模态强化学习、提升视频推理性能的原理性机制。 Abstract: We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.[124] SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers
Xiang Yang,Feifei Li,Mi Zhang,Geng Hong,Xiaoyu You,Min Yang
Main category: cs.CV
TL;DR: 本文提出SafeRoPE,一种轻量、细粒度的安全生成框架,通过分析MMDiT中注意力机制的安全关键子空间,并利用Rotary Position Embedding(RoPE)扰动实现对有害语义的精准抑制,兼顾安全性与生成保真度。
Details
Motivation: 现有T2I模型(如SD3、FLUX)虽生成质量高,但易受多词触发的不安全语义影响;而主流缓解方法依赖微调或注意力调控,计算开销大且难以适配Transformer架构(如MMDiT)。 Method: 基于对MMDiT注意力机制的深入分析,识别出负责不安全特征提取的安全关键注意力头,并构建头级不安全子空间;提出Latent Risk Score(LRS)量化输入风险,并设计头级RoPE扰动以选择性抑制不安全语义,实现查询/键向量的风险导向旋转。 Result: SafeRoPE在MMDiT上实现了有害内容缓解与生成效用保留之间的最优平衡,达到安全生成任务的SOTA性能。 Conclusion: SafeRoPE是一种高效、可解释、无需微调的安全生成新范式,为Transformer-based扩散模型提供了实用且可扩展的安全保障方案。 Abstract: Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of safety-critical heads is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images. Motivated by these insights, we propose SafeRoPE, a lightweight and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combines both head-wise LRS and RoPE perturbations to perform risk-specific head-wise rotation on query and key vector embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT. Codes are available at https://github.com/deng12yx/SafeRoPE.[125] Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
Yaxin Luo,Zhiqiang Shen
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注的随机标签桥接训练方法,使大语言模型(LLM)参数能有效适配视觉任务,并发现部分LLM层具备强基础性,无需微调即可直接用于视觉任务,为跨模态(语言-视觉)适应提供了新路径。
Details
Motivation: 现有研究普遍认为语言预训练模型因参数空间与视觉模型差异大而不适合视觉下游任务,因而多聚焦于跨域迁移而非跨模态桥接;本文旨在挑战这一假设,探索LLM在视觉任务中的直接潜力。 Method: 提出‘随机标签桥接训练’作为模态适配学习器,在语言模型和视觉任务之间引入一个轻量桥接训练阶段;同时探索部分层桥接策略,识别并保留LLM中天然适用于视觉任务的基础层。 Result: 实验证明随机标签桥接训练能有效对齐LLM参数与视觉任务;部分桥接优于全桥接,特定LLM层在未微调时即表现出良好视觉泛化能力。 Conclusion: 语言与视觉模态间并非不可逾越,通过桥接训练(尤其是部分桥接)可高效利用LLM参数解决视觉任务,为跨模态模型设计提供新范式。 Abstract: The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.[126] Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images
Jamie S. J. Stirling,Noura Al-Moubayed,Hubert P. H. Shum
Main category: cs.CV
TL;DR: 本文提出了一种位置无关的离散图像表示方法PI-VQ,通过约束潜在码不携带位置信息,促使模型学习全局语义特征,并支持无需先验的直接插值;引入基于最优二分匹配的匹配量化算法,提升瓶颈容量3.5倍,实现单次前向传播的插值采样,在CelebA等数据集上取得有竞争力的生成质量指标。
Details
Motivation: 探究空间对齐数据的离散表示是否必须依赖位置信息,挑战现有VQ-VAE/VQ-GAN中位置依赖与上下文纠缠带来的建模复杂性。 Method: 提出置换不变向量量化自编码器(PI-VQ),强制潜在码无位置信息;设计匹配量化(matching quantization)算法,基于最优二分匹配替代最近邻量化以提升有效瓶颈容量;利用码的组合结构实现插值式单步图像合成。 Result: 在CelebA、CelebA-HQ和FFHQ上生成图像达到有竞争力的精确度(precision)、密度(density)和覆盖度(coverage)指标;验证了无需先验即可直接插值、码具有更好可分性与可解释性。 Conclusion: 位置信息并非离散图像表示的必要条件;PI-VQ证明了置换不变表征能有效捕获全局语义,匹配量化显著提升容量,为解耦、可解释及高效生成建模提供了新路径。 Abstract: Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.[127] FaCT-GS: Fast and Scalable CT Reconstruction with Gaussian Splatting
Pawel Tomasz Pieta,Rasmus Juul Pedersen,Sina Borgi,Jakob Sauer Jørgensen,Jens Wenzel Andreasen,Vedrana Andersen Dahl
Main category: cs.CV
TL;DR: 本文提出FaCT-GS框架,通过优化体素化和光栅化流程,显著提升高斯点绘(GS)在CT重建中的速度与灵活性,实现比现有最优GS方法快4–13倍的性能,并支持基于先验体积的快速高斯拟合。
Details
Motivation: 尽管高斯点绘(GS)在CT重建中表现不俗,但其优势尚不足以推动从成熟算法迁移;亟需解决GS方法在计算效率、可扩展性及先验利用方面的关键限制。 Method: 提出FaCT-GS框架,深度优化GS中的体素化与光栅化流水线,支持对预存体数据快速拟合高斯分布,可用于warm-start重建或作为压缩表示。 Result: 在512×512投影下比当前最优GS CT方法快4倍以上,在2K投影下快13倍以上;具备良好可扩展性,支持大尺寸投影与输出体积。 Conclusion: FaCT-GS有效克服了GS在CT重建中速度与灵活性的瓶颈,为临床与实时应用提供了更实用的GS方案。 Abstract: Gaussian Splatting (GS) has emerged as a dominating technique for image rendering and has quickly been adapted for the X-ray Computed Tomography (CT) reconstruction task. However, despite being on par or better than many of its predecessors, the benefits of GS are typically not substantial enough to motivate a transition from well-established reconstruction algorithms. This paper addresses the most significant remaining limitations of the GS-based approach by introducing FaCT-GS, a framework for fast and flexible CT reconstruction. Enabled by an in-depth optimization of the voxelization and rasterization pipelines, our new method is significantly faster than its predecessors and scales well with projection and output volume size. Furthermore, the improved voxelization enables rapid fitting of Gaussians to pre-existing volumes, which can serve as a prior for warm-starting the reconstruction, or simply as an alternative, compressed representation. FaCT-GS is over 4X faster than the State of the Art GS CT reconstruction on standard 512x512 projections, and over 13X faster on 2k projections. Implementation available at: https://github.com/PaPieta/fact-gs.[128] Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance
Jason Qiu,Zachary Meurer,Xavier Thomas,Deepti Ghadiyaram
Main category: cs.CV
TL;DR: 本文揭示了当前视觉-语言模型(VLMs)在基本几何变换(如旋转、缩放)下缺乏空间不变性与等变性,导致其在稀疏语义内容(如简笔画、抽象艺术)中识别物体能力显著下降,暴露了语义理解与空间推理之间的系统性差距。
Details
Motivation: 现代VLMs虽在语义任务上表现优异,但在基础空间变换下的鲁棒性不足,亟需探究其空间推理能力的缺陷。 Method: 通过在符号草图、自然照片和抽象艺术等多种视觉域上进行系统性评估,测试不同架构、模型规模和提示策略下VLMs对旋转、缩放等几何变换的响应。 Result: VLMs在几何变换下性能急剧下降,尤其在语义稀疏图像中;该现象跨模型架构、容量和提示方式普遍存在。 Conclusion: 当前VLMs存在语义理解与空间推理之间的根本脱节,未来多模态系统需强化几何感知与建模能力。 Abstract: This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.[129] HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models
Yansong Guo,Chaoyang Zhu,Jiayi Ji,Jianghang Lin,Liujuan Cao
Main category: cs.CV
TL;DR: 本文提出HieraVid,一种分层动态视频token剪枝框架,在保持高性能的同时显著降低VideoLLM的计算开销。
Details
Motivation: 现有方法仅在输入层面剪枝,忽视视频和LLM内部的信息结构,导致冗余高、效率低。 Method: 基于视频的片段-帧结构和LLM中多模态信息单向传播特性,设计三层剪枝:片段级(时空合并)、帧级(段内相似帧联合剪枝保多样性)、层级(随LLM层数增加渐进减少冗余)。 Result: 在四个主流视频理解基准上验证有效;仅保留30% token时,性能超越SOTA,并分别保持LLaVA-Video-7B和LLaVA-OneVision-7B的98%和99%性能。 Conclusion: HieraVid通过结构感知的分层剪枝,在大幅压缩视频token的同时维持甚至提升模型性能,为高效VideoLLM部署提供了新范式。 Abstract: Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.[130] Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation
Hinako Mitsuoka,Kazuhiro Hotta
Main category: cs.CV
TL;DR: 本文提出了一种轻量级双损失训练框架,用于提升时序动作分割(TAS)的细粒度质量,仅需增加一个输出通道和两个辅助损失项,无需大幅修改模型结构。
Details
Motivation: 现有TAS方法依赖复杂架构,不利于实际部署,亟需更轻量、高效的改进方案。 Method: 结合边界回归损失(单通道边界预测)与基于累积分布函数(CDF)的段级正则化损失,实现架构无关的训练增强。 Result: 在三个基准数据集上显著提升段级一致性与边界质量,F1和Edit分数提高,帧准确率基本不变。 Conclusion: 通过简洁的损失函数设计即可实现高精度时序动作分割,无需依赖更重的模型或推理阶段优化。 Abstract: Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.[131] MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation
Kai Dong,Tingting Bai
Main category: cs.CV
TL;DR: 本文提出MAR-MAER,一种分层自回归文本到图像生成框架,通过度量感知嵌入正则化和条件变分模块,提升生成图像的质量一致性与对模糊提示的语义灵活性。
Details
Motivation: 解决现有自回归文本到图像模型在生成质量不达标和难以处理多义性提示两方面的问题。 Method: 提出MAR-MAER框架,包含:1)基于轻量投影头和自适应核回归损失的度量感知嵌入正则化,对齐CLIPScore、HPSv2等人偏好指标;2)条件变分模块引入可控随机性,增强分层token生成中对模糊语义的建模能力。 Result: 在COCO和新构建的Ambiguous-Prompt Benchmark上实验表明,相比Hi-MAR基线,CLIPScore提升+1.6、HPSv2提升+5.3;对模糊提示生成多样性显著增强,且经人工评估与自动指标双重验证。 Conclusion: MAR-MAER有效提升了自回归图像生成模型在人类评价一致性与语义开放性之间的平衡,为处理现实场景中模糊、多义提示提供了新思路。 Abstract: Autoregressive (AR) models have demonstrated significant success in the realm of text-to-image generation. However, they usually face two major challenges. Firstly, the generated images may not always meet the quality standards expected by humans. Furthermore, these models face difficulty when dealing with ambiguous prompts that could be interpreted in several valid ways. To address these issues, we introduce MAR-MAER, an innovative hierarchical autoregressive framework. It combines two main components. It is a metric-aware embedding regularization method. The other one is a probabilistic latent model used for handling ambiguous semantics. Our method utilizes a lightweight projection head, which is trained with an adaptive kernel regression loss function. This aligns the model's internal representations with human-preferred quality metrics, such as CLIPScore and HPSv2. As a result, the embedding space that is learned more accurately reflects human judgment. We are also introducing a conditional variational module. This approach incorporates an aspect of controlled randomness within the hierarchical token generation process. This capability allows the model to produce a diverse array of coherent images based on ambiguous or open-ended prompts. We conducted extensive experiments using COCO and a newly developed Ambiguous-Prompt Benchmark. The results show that MAR-MAER achieves excellent performance in both metric consistency and semantic flexibility. It exceeds the baseline Hi-MAR model's performance, showing an improvement of +1.6 in CLIPScore and +5.3 in HPSv2. For unclear inputs, it produces a notably wider range of outputs. These findings have been confirmed through both human evaluation and automated metrics.[132] GeoAI Agency Primitives
Akram Zaytar,Rohan Sawahn,Caleb Robinson,Gilles Q. Hacheme,Girmaw A. Tadesse,Inbal Becker-Reshef,Rahul Dodhia,Juan Lavista Ferres
Main category: cs.CV
TL;DR: 本文提出了一套面向地理人工智能(GeoAI)助手的9个‘能动性原语’(agency primitives),旨在弥合基础模型能力与GIS从业者实际工作流(如制图、矢量/栅格生产)之间的鸿沟,强调以人类为中心的迭代协作,而非单纯提升模型性能。
Details
Motivation: 现有GeoAI能力(如卫星图像描述、视觉问答、可提示分割)尚未转化为GIS从业者的实际生产力提升,核心问题在于缺乏支持人机迭代协作的‘能动性层’。 Method: 提出一套包含9个核心原语(如导航、感知、地理参考记忆、双重建模等)的能动性词汇,并配套设计一个以人类生产力为指标的基准测试。 Result: 构建了一个结构化的能动性原语框架及相应的人类生产力评估基准,使GeoAI助手的能动性辅助变得可实现、可测试、可比较。 Conclusion: 能动性原语是连接大模型与GIS实际工作流的关键中间层;其标准化将推动以人为中心、可落地的GeoAI助手发展。 Abstract: We present ongoing research on agency primitives for GeoAI assistants -- core capabilities that connect Foundation models to the artifact-centric, human-in-the-loop workflows where GIS practitioners actually work. Despite advances in satellite image captioning, visual question answering, and promptable segmentation, these capabilities have not translated into productivity gains for practitioners who spend most of their time producing vector layers, raster maps, and cartographic products. The gap is not model capability alone but the absence of an agency layer that supports iterative collaboration. We propose a vocabulary of $9$ primitives for such a layer -- including navigation, perception, geo-referenced memory, and dual modeling -- along with a benchmark that measures human productivity. Our goal is a vocabulary that makes agentic assistance in GIS implementable, testable, and comparable.[133] A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes
Di Li,Jie Feng,Guanbin Li,Ronghua Shang,Yuhui Zheng,Weisheng Dong,Guangming Shi
Main category: cs.CV
TL;DR: 本文提出A3R框架,将细粒度功能推理重构为顺序证据获取过程,通过多维证据(3D几何+2D语义)逐步消解歧义,并结合GRPO策略优化策略学习,在复杂3D高斯场景中显著优于静态单次预测方法。
Details
Motivation: 现有方法将功能推理视为基于静态观测的单次预测,但在复杂3D场景中常因固定视角下任务相关证据不全而失败,而非模型预测能力不足。 Method: 将功能推理建模为序列化证据获取过程;提出A3R框架,利用MLLM策略迭代选择证据采集动作并更新功能信念;引入基于GRPO的策略学习方法提升证据获取效率与推理精度。 Result: 在场景级基准测试中,A3R持续超越静态单次基线方法。 Conclusion: 基于智能体的跨维度证据获取机制能有效提升复杂3D高斯场景中细粒度功能推理的性能。 Abstract: Affordance reasoning in 3D Gaussian scenes aims to identify the region that supports the action specified by a given text instruction in complex environments. Existing methods typically cast this problem as one-shot prediction from static scene observations, assuming sufficient evidence is already available for reasoning. However, in complex 3D scenes, many failure cases arise not from weak prediction capacity, but from incomplete task-relevant evidence under fixed observations. To address this limitation, we reformulate fine-grained affordance reasoning as a sequential evidence acquisition process, where ambiguity is progressively reduced through complementary 3D geometric and 2D semantic evidence. Building on this formulation, we propose A3R, an agentic affordance reasoning framework that enables an MLLM-based policy to iteratively select evidence acquisition actions and update the affordance belief through cross-dimensional evidence acquisition. To optimize such sequential decision making, we further introduce a GRPO-based policy learning strategy that improves evidence acquisition efficiency and reasoning accuracy. Extensive experiments on scene-level benchmarks show that A3R consistently surpasses static one-shot baselines, demonstrating the advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.[134] GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
Xianben Yang,Tao Wang,Yuxuan Li,Yi Jin,Haibin Ling
Main category: cs.CV
TL;DR: 本文提出GS²方法,通过图结构特征编码、ELBO自适应稠密化和不透明度感知渐进剪枝,优化3D高斯点的空间分布,在显著减少高斯点数量(仅12.5%)的同时提升渲染质量。
Details
Motivation: 3D高斯泼溅(3DGS)虽在新视角合成和实时渲染中表现优异,但因高斯点数量庞大导致内存开销过高;现有基于剪枝的压缩方法常损害空间一致性并引入渲染伪影。 Method: 提出图结构引导的空间分布优化框架GS²,包含三部分:1)基于证据下界(ELBO)的自适应稠密化策略;2)不透明度感知的渐进式剪枝策略;3)图结构特征编码模块实现特征引导的点位置调整。 Result: GS²在仅使用约12.5%的高斯点情况下,PSNR高于原始3DGS,并全面优于所有对比基线,在渲染质量和内存效率两方面均取得最优性能。 Conclusion: GS²通过协同优化高斯点的空间分布与密度,有效平衡了模型紧凑性与重建质量,为高效高质量的3D场景表示提供了新范式。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting. Extensive experiments validate that GS\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5\% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.[135] Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters
Ahmed B Mustafa,Zihan Ye,Yang Lu,Michael P Pound,Shreyank N Gowda
Main category: cs.CV
TL;DR: 本文揭示了当前文本到图像生成模型在安全过滤方面存在严重漏洞,提出了一套无需模型访问或优化的自然语言提示攻击方法,并构建了视觉越狱技术分类体系,实验显示攻击成功率高达74.47%。
Details
Motivation: 为应对文本到图像生成模型的滥用风险,现有系统依赖安全过滤与审核流程,但其有效性尚不明确;作者旨在系统性揭示这些过滤机制在面对低开销自然语言攻击时的脆弱性。 Method: 提出并系统研究多种基于提示词的视觉越狱策略(如艺术重构、材料替换、伪教育式表述、生活方式美学伪装、模糊动作替换),构建分类体系,在不访问模型、无需优化或对抗训练的前提下实施攻击。 Result: 在多个前沿文本到图像系统上验证了所提策略的有效性,攻击成功率(ASR)最高达74.47%,表明当前提示过滤与视觉安全机制难以理解语义层面的对抗意图。 Conclusion: 表面级提示过滤与深层语义理解之间存在关键鸿沟,亟需更鲁棒的安全机制以应对自然语言驱动的越狱攻击。 Abstract: Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.[136] ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery
Ke Li,Ting Wang,Di Wang,Yongshan Zhu,Yiming Zhang,Tao Lei,Quan Wang
Main category: cs.CV
TL;DR: 本文提出ProVG框架,通过解耦语言表达为全局上下文、空间关系和物体属性,并采用渐进式跨模态调制器实现粗到细的视觉-语言对齐,显著提升了遥感图像中基于自然语言表达的目标定位精度。
Details
Motivation: 现有方法依赖句子级视觉-语言对齐,难以利用细粒度语言线索(如空间关系和物体属性),而这些线索在不同定位阶段起不同作用,需针对性利用。 Method: 提出ProVG框架:1)将语言表达解耦为全局上下文、空间关系和物体属性;2)设计渐进式跨模态调制器,按‘概览-定位-验证’流程动态调节视觉注意力;3)引入跨尺度融合模块缓解遥感影像尺度变化;4)使用语言引导校准解码器优化预测时的跨模态对齐;5)统一多任务头支持指代表达理解和分割任务。 Result: 在RRSIS-D和RISBench两个基准上,ProVG持续超越现有方法,达到新的SOTA性能。 Conclusion: ProVG通过细粒度语言解耦与渐进式跨模态对齐机制,有效提升了遥感视觉定位任务的精度与鲁棒性,为RSVG任务提供了新范式。 Abstract: Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.[137] SHARC: Reference point driven Spherical Harmonic Representation for Complex Shapes
Panagiotis Sapoutzoglou,George Terzakis,Maria Pateraki
Main category: cs.CV
TL;DR: SHARC是一种基于球谐函数(SH)距离场表示的新框架,通过在物体内部最优位置放置参考点并进行可见性采样与球谐变换,实现高保真、高效且简洁的任意拓扑形状合成。
Details
Motivation: 现有方法在重建精度、计算效率和模型简洁性之间难以兼顾,且对复杂拓扑形状建模能力有限;需一种能自适应处理任意 genus 形状、同时保持细节表现力与计算效率的新表示方法。 Method: 提出SHARC框架:在物体内部自动选取兼顾稀疏性、中心性与表面可见性的参考点;对各点通过光线投射采样可见距离场,并用快速球谐变换(FSHT)计算SH系数;引入可配置低通滤波和基于邻近关系的局部一致性优化以提升几何保真度。 Result: 在重建精度和时间效率上均优于当前最先进方法,同时保持模型参数精简(parsimony);开源代码已发布。 Conclusion: SHARC为任意拓扑形状提供了高效、紧凑且高保真的隐式表示与合成方案,验证了球谐距离场在三维几何学习中的潜力。 Abstract: We propose SHARC, a novel framework that synthesizes arbitrary, genus-agnostic shapes by means of a collection of Spherical Harmonic (SH) representations of distance fields. These distance fields are anchored at optimally placed reference points in the interior volume of the surface in a way that maximizes learning of the finer details of the surface. To achieve this, we employ a cost function that jointly maximizes sparsity and centrality in terms of positioning, as well as visibility of the surface from their location. For each selected reference point, we sample the visible distance field to the surface geometry via ray-casting and compute the SH coefficients using the Fast Spherical Harmonic Transform (FSHT). To enhance geometric fidelity, we apply a configurable low-pass filter to the coefficients and refine the output using a local consistency constraint based on proximity. Evaluation of SHARC against state-of-the-art methods demonstrates that the proposed method outperforms existing approaches in both reconstruction accuracy and time efficiency without sacrificing model parsimony. The source code is available at https://github.com/POSE-Lab/SHARC.[138] FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation
Xilai Li,Chusheng Fang,Xiaosong Li
Main category: cs.CV
TL;DR: 本文提出FTPFusion,一种基于频率感知、时间扰动和稀疏跨模态交互的红外与可见光视频融合方法,旨在同时提升空间细节保持与时间稳定性。
Details
Motivation: 现有方法难以兼顾时间稳定性与空间细节保留:帧级增强方法缺乏时序建模,而强时空聚合方法又易损失高频细节。 Method: FTPFusion将特征分解为高低频分量:高频分支采用稀疏跨模态时空交互以捕获运动上下文与互补细节;低频分支引入时间扰动策略以增强对闪烁、抖动和局部错位等干扰的鲁棒性;并设计偏移感知的时间一致性约束显式稳定跨帧表征。 Result: 在多个公开基准上,FTPFusion在空间保真度与时间一致性各项指标上均显著优于当前最先进方法。 Conclusion: FTPFusion通过频率解耦、稀疏交互与扰动建模,有效解决了红外-可见光视频融合中时空协同建模的核心难题,为智能监控与弱光监测提供了更鲁棒、更清晰的融合方案。 Abstract: Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at https://github.com/ixilai/FTPFusion.[139] Light-ResKAN: A Parameter-Sharing Lightweight KAN with Gram Polynomials for Efficient SAR Image Recognition
Pan Yi,Weijie Li,Xiaodong Chen,Jiehua Zhang,Li Liu,Yongxiang Liu
Main category: cs.CV
TL;DR: 本文提出Light-ResKAN,一种基于Kolmogorov-Arnold网络(KAN)的轻量级模型,用于资源受限边缘设备上的SAR图像识别;通过将ResNet中的卷积替换为KAN卷积、采用Gram多项式激活函数、以及通道内参数共享策略,在精度与效率间取得更好平衡。
Details
Motivation: 大型SAR图像尺寸限制了深度学习在边缘设备上的部署,现有轻量模型难以兼顾高精度特征提取与低计算开销。 Method: 1)将ResNet中的标准卷积替换为可学习激活的KAN卷积;2)采用Gram多项式作为激活函数以适配SAR数据的非线性特性;3)在每个卷积核的通道内实施参数共享策略,降低参数量和FLOPs。 Result: 在MSTAR、FUSAR-Ship和SAR-ACD数据集上分别达到99.09%、93.01%和97.26%准确率;在缩放至1024×1024的MSTAR上相比VGG16减少82.90× FLOPs和163.78×参数量。 Conclusion: Light-ResKAN为边缘端SAR图像识别提供了一种高效、高精度的新方案,验证了KAN结构在遥感图像任务中的潜力。 Abstract: Synthetic Aperture Radar (SAR) image recognition is vital for disaster monitoring, military reconnaissance, and ocean observation. However, large SAR image sizes hinder deep learning deployment on resource-constrained edge devices, and existing lightweight models struggle to balance high-precision feature extraction with low computational requirements. The emerging Kolmogorov-Arnold Network (KAN) enhances fitting by replacing fixed activations with learnable ones, reducing parameters and computation. Inspired by KAN, we propose Light-ResKAN to achieve a better balance between precision and efficiency. First, Light-ResKAN modifies ResNet by replacing convolutions with KAN convolutions, enabling adaptive feature extraction for SAR images. Second, we use Gram Polynomials as activations, which are well-suited for SAR data to capture complex non-linear relationships. Third, we employ a parameter-sharing strategy: each kernel shares parameters per channel, preserving unique features while reducing parameters and FLOPs. Our model achieves 99.09%, 93.01%, and 97.26% accuracy on MSTAR, FUSAR-Ship, and SAR-ACD datasets, respectively. Experiments on MSTAR resized to $1024 \times 1024$ show that compared to VGG16, our model reduces FLOPs by $82.90 \times$ and parameters by $163.78 \times$. This work establishes an efficient solution for edge SAR image recognition.[140] Lifting Unlabeled Internet-level Data for 3D Scene Understanding
Yixin Chen,Yaowei Zhang,Huangyue Yu,Junchao He,Yan Wang,Jiangyong Huang,Hongyu Shen,Junfeng Ni,Shaofei Wang,Baoxiong Jia,Song-Chun Zhu,Siyuan Huang
Main category: cs.CV
TL;DR: 本文提出利用网络上大量未标注视频,通过精心设计的数据引擎自动生成3D场景理解任务的训练数据,弥补人工标注数据稀缺昂贵的问题,并在多个3D感知与推理任务上验证了其有效性。
Details
Motivation: 3D场景标注数据稀缺且昂贵,而互联网上存在大量未标注视频,亟需一种方法利用这些免费资源提升3D场景理解模型性能。 Method: 设计并分析自动化数据生成的数据引擎,识别并解决其中的关键瓶颈,结合人工标注数据,端到端训练涵盖不同感知粒度(从低层检测分割到高层VQA与VLN)的3D理解模型。 Result: 在3D目标检测、实例分割、空间视觉问答(VQA)和视觉-语言导航(VLN)等任务上,仅用生成数据训练的模型即展现出强零样本性能,微调后进一步提升。 Conclusion: 网络未标注视频可被高效转化为高质量训练数据,为构建更强大的3D场景理解系统提供可行路径。 Abstract: Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.[141] Night Eyes: A Reproducible Framework for Constellation-Based Corneal Reflection Matching
Virmarie Maquiling,Yasmeen Abdrabou,Enkelejda Kasneci
Main category: cs.CV
TL;DR: 本文提出了一种基于2D几何和星图识别启发的多光斑(glint)检测与匹配框架,强调可复现性和清晰评估,通过SLA方法实现稳定、身份保持的对应关系,并开源代码与工具。
Details
Motivation: 现有眼动追踪中角膜反射(glint)检测多依赖硬件相关的启发式方法,缺乏可复现性;需一种通用、结构化且可评估的多光斑处理方案。 Method: 提出基于星座(constellation)的2D几何驱动流程,借鉴星图识别思想;设计相似性-布局对齐(SLA)算法,整合过检测、自适应候选回退、外观感知打分及可选语义布局先验,并明确分离检测与匹配步骤。 Result: 在公开多LED数据集上验证了该系统在噪声条件下仍能提供稳定、身份保持的glint对应;并开源全部代码、预设与评测脚本。 Conclusion: 该框架提升了多光斑检测的可复现性与鲁棒性,为P-CR眼动追踪提供了模块化、可评估、易复现的新范式。 Abstract: Corneal reflection (glint) detection plays an important role in pupil-corneal reflection (P-CR) eye tracking, but in practice it is often handled as heuristics embedded within larger systems, making reproducibility difficult across hardware setups. We introduce a 2D geometry-driven, constellation-based pipeline for mulit-glint detection and matching, focusing on reproducibility and clear evaluation. Inspired by lost-in-space star identification, we treat glints as structured constellations rather than independent blobs. We propose a Similarity-Layout Alignment (SLA) procedure which adapts constellation matching to the specific constraints of multi-LED eye tracking. The framework brings together controlled over-detection, adaptive candidate fallback, appearance-aware scoring, and optional semantic layout priors while keeping detection and correspondence explicitly separated. Evaluated on a public multi-LED dataset, the system provides stable identity-preserving correspondence under noisy conditions. We release code, presets, and evaluation scripts to enable transparent replication, comparison, and dataset annotation.[142] Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts
Yifan Gao,Tao Zhou,Yi Zhou,Ke Zou,Yizhe Zhang,Huazhu Fu
Main category: cs.CV
TL;DR: 本文提出KnowMVG框架,通过知识增强提示和全局-局部注意力机制提升医学视觉定位(MVG)的空间精度,显著优于现有方法。
Details
Motivation: 现有视觉语言模型(VLMs)在医学视觉定位任务中缺乏显式的定位先验,仅依赖潜在嵌入导致空间定位精度不足。 Method: 提出KnowMVG框架:1)知识增强提示策略,将医学知识编码为紧凑嵌入;2)全局-局部注意力机制,联合利用粗粒度全局信息与细粒度局部线索以提升定位精度。 Result: 在四个MVG基准上,AP50提升3.0%,mIoU提升2.6%,显著超越当前最优方法;消融与定性实验验证各模块有效性。 Conclusion: KnowMVG通过引入医学知识先验与全局-局部注意力设计,在不增加文本推理开销的前提下,有效弥合语义理解与细粒度视觉感知之间的鸿沟,提升了MVG的可解释性与临床实用性。 Abstract: Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.[143] Learning Spatial Structure from Pre-Beamforming Per-Antenna Range-Doppler Radar Data via Visibility-Aware Cross-Modal Supervision
George Sebastian,Philipp Berthold,Bianca Forkel,Leon Pohl,Mirko Maehlisch
Main category: cs.CV
TL;DR: 本文探讨了是否能直接从预波束成形的每根天线距离-多普勒(RD)数据中学习有意义的空间结构,而不依赖传统波束成形构建角度域表示;实验基于商用汽车雷达,在端到端数据驱动框架下,利用双啁啾共享权重编码器,并以LiDAR引导的可见性感知跨模态监督进行BEV占用率重建作为几何可恢复性探针;结果表明无需显式角度域构造或手工信号处理即可恢复空间结构。
Details
Motivation: 传统汽车雷达感知流程依赖波束成形构建角度域表示,本文旨在探索能否绕过该步骤、直接从原始每根天线的RD数据中学习空间结构,从而简化流程并提升鲁棒性与泛化能力。 Method: 采用6发8收(48虚拟通道)CS-FMCW雷达,利用A/B啁啾序列实现可控发射孔径变化;在预波束成形的每根天线RD张量上,使用双啁啾共享权重编码器进行端到端训练;以可见性感知、射线建模的LiDAR BEV occupancy为监督信号,评估几何可恢复性。 Result: 实验证明:仅用预波束成形RD数据即可有效恢复鸟瞰图空间结构;不同啁啾配置(A-only/B-only/A+B)和频段分析显示发射配置显著影响几何可恢复性;性能优于物理对齐基线模型。 Conclusion: 空间结构可直接从预波束成形的每根天线RD数据中学习,无需显式角度域构造或手工设计的信号处理模块,为雷达感知提供了更简洁、数据驱动的新范式。 Abstract: Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird's-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.[144] Rethinking Representations for Cross-Domain Infrared Small Target Detection: A Generalizable Perspective from the Frequency Domain
Yimin Fu,Songbo Wang,Feiyan Wu,Jialin Lyu,Zhunga Liu,Michael K. Ng
Main category: cs.CV
TL;DR: 本文提出了一种面向跨域红外小目标检测的时空谱协同感知网络(S²CPNet),通过频域视角分析域差异,设计相位校正模块(PRM)、正交注意力机制(OAM)和选择性风格重组(SSR)以提升模型在未见域上的泛化能力,并在多个数据集上达到SOTA性能。
Details
Motivation: 现有红外小目标检测方法多局限于同域设定,难以应对训练与测试数据间因观测条件和环境变化导致的分布偏移;加之红外小目标本身信噪比低、特征不显著,易导致模型过拟合于源域特有模式,从而在跨域部署时性能严重下降。 Method: 提出空间-谱协同感知网络S²CPNet:1)从频域角度揭示域差异主要表现为谱相位不一致,设计相位校正模块(PRM)增强目标感知的泛化性;2)在跳跃连接中引入正交注意力机制(OAM)以兼顾位置信息与表征优化;3)采用选择性风格重组(SSR)缓解对域特有模式的偏差。 Result: 在三个红外小目标检测数据集上进行了大量跨域实验,所提方法在多种跨域设置下均持续取得最优性能,显著优于现有方法。 Conclusion: 频域建模可有效揭示跨域差异本质,结合相位校正、正交注意力与风格重组的协同设计,能显著提升红外小目标检测模型的域泛化能力,为实际部署提供更鲁棒的解决方案。 Abstract: The accurate target-background separation in infrared small target detection (IRSTD) highly depends on the discriminability of extracted representations. However, most existing methods are confined to domain-consistent settings, while overlooking whether such discriminability can generalize to unseen domains. In practice, distribution shifts between training and testing data are inevitable due to variations in observational conditions and environmental factors. Meanwhile, the intrinsic indistinctiveness of infrared small targets aggravates overfitting to domain-specific patterns. Consequently, the detection performance of models trained on source domains can be severely degraded when deployed in unseen domains. To address this challenge, we propose a spatial-spectral collaborative perception network (S$^2$CPNet) for cross-domain IRSTD. Moving beyond conventional spatial learning pipelines, we rethink IRSTD representations from a frequency perspective and reveal inconsistencies in spectral phase as the primary manifestation of domain discrepancies. Based on this insight, we develop a phase rectification module (PRM) to derive generalizable target awareness. Then, we employ an orthogonal attention mechanism (OAM) in skip connections to preserve positional information while refining informative representations. Moreover, the bias toward domain-specific patterns is further mitigated through selective style recomposition (SSR). Extensive experiments have been conducted on three IRSTD datasets, and the proposed method consistently achieves state-of-the-art performance under diverse cross-domain settings.[145] Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm
Sixing Li,Zhibin Gu,Ziqi Zhang,Weiguo Pan,Bing Li,Ying Wang,Hongzhe Liu
Main category: cs.CV
TL;DR: 本文提出ECAC基准数据集和RSRS混合训练框架,用于提升幼儿教育图像描述生成的准确性和专业性,并开发了KinderMM-Cap-3B模型,在教学玩具识别等指标上显著超越现有方法。
Details
Motivation: 现有图像描述方法在幼儿教育(ECE)领域面临两大挑战:缺乏大规模、领域专用数据集,导致描述泛化、不精确;传统监督学习或强化学习范式难以有效提升专业对象(如教学玩具)的细粒度命名能力。 Method: 构建大规模ECE图像描述基准ECAC(含25.6万张真实图像及专家标注),设计面向教育领域的评估指标TTS(Teaching Toy Recognition Score);提出RSRS混合训练框架,动态切换强化学习与监督微调,将零奖励难样本重定向至监督优化,缓解优势坍塌问题;基于ECAC和RSRS训练域适配的多模态大模型KinderMM-Cap-3B。 Result: KinderMM-Cap-3B在TTS指标上达到51.06,显著优于现有最先进方法,同时保持高质量的自然语言描述能力。 Conclusion: ECAC数据集与RSRS训练框架有效提升了ECE图像描述的专业性与细粒度识别能力,所提出的KinderMM-Cap-3B模型展现出在教育自动化评估等实际场景中的应用潜力。 Abstract: Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.[146] A Self supervised learning framework for imbalanced medical imaging datasets
Yash Kumar Sharma,Charan Ramtej Kodi,Vineet Padmanabhan
Main category: cs.CV
TL;DR: 本文提出AMIMV方法,通过构建不对称多图像多视角对,结合自监督学习解决医学影像分类中的数据稀缺与类别不平衡问题,并在MedMNIST数据集上验证其有效性。
Details
Motivation: 医学影像分析常面临标注数据稀缺和类别不平衡两大挑战;现有自监督学习方法虽缓解数据稀缺,但其对类别不平衡的鲁棒性研究不足。 Method: 扩展先前提出的MIMV方法,引入新数据增强策略构建不对称多图像多视角(AMIMV)对;开展数据不平衡程度下的鲁棒性分析;在11个MedMNIST数据集、长尾分布及有限监督下评估8种典型自监督方法。 Result: 在MedMNIST数据集上,AMIMV方法在retinaMNIST、tissueMNIST和DermaMNIST上分别提升4.25%、1.88%和3.1%。 Conclusion: AMIMV能有效缓解医学影像分类中的数据稀缺与类别不平衡问题,且在多种SSL方法和长尾分布设定下展现出良好鲁棒性与泛化能力。 Abstract: Two problems often plague medical imaging analysis: 1) Non-availability of large quantities of labeled training data, and 2) Dealing with imbalanced data, i.e., abundant data are available for frequent classes, whereas data are highly limited for the rare class. Self supervised learning (SSL) methods have been proposed to deal with the first problem to a certain extent, but the issue of investigating the robustness of SSL to imbalanced data has rarely been addressed in the domain of medical image classification. In this work, we make the following contributions: 1) The MIMV method proposed by us in an earlier work is extended with a new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs to address both data scarcity and dataset imbalance in medical image classification. 2) We carry out a data analysis to evaluate the robustness of AMIMV under varying degrees of class imbalance in medical imaging . 3) We evaluate eight representative SSL methods in 11 medical imaging datasets (MedMNIST) under long-tailed distributions and limited supervision. Our experimental results on the MedMNIST dataset show an improvement of 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST.[147] MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction
Xilai Li,Weijun Jiang,Xiaosong Li,Yang Liu,Hongbin Wang,Tao Ye,Huafeng Li,Haishu Tan
Main category: cs.CV
TL;DR: 本文提出MAVFusion,一种端到端红外与可见光视频融合框架,通过运动感知的稀疏交互机制,在保证高质量融合效果的同时显著提升计算效率。
Details
Motivation: 现有方法多针对静态图像设计,难以有效处理视频帧间运动;当前视频融合方法虽提升时序一致性,但计算开销大。 Method: 利用光流识别多模态序列中的动态区域,对这些稀疏区域自适应地施加高成本跨模态注意力;对静态背景区域采用轻量级弱交互模块;解耦动态与静态区域处理。 Result: 在多个红外-可见光视频基准上达到SOTA性能,在640×480分辨率下推理速度达14.16 FPS。 Conclusion: MAVFusion在保持优异融合质量与时序一致性的同时,大幅提升了推理效率,验证了运动感知稀疏交互策略的有效性。 Abstract: Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16\,FPS at $640 \times 480$ resolution. The source code will be available at https://github.com/ixilai/MAVFusion.[148] Automated Prostate Gland Segmentation in MRI Using nnU-Net
Pablo Rodriguez-Belenguer,Gloria Ribas,Javier Aquerreta Escribano,Rafael Moreno-Calatayud,Leonor Cerda-Alberich,Luis Marti-Bonmati
Main category: cs.CV
TL;DR: 本文提出了一种基于nnU-Net v2框架的专用深度学习方法,利用多模态mpMRI(T2、DWI、ADC)自动分割前列腺腺体,在内部交叉验证和外部验证中均取得优异性能(Dice分别为0.96和0.82),显著优于通用分割工具TotalSegmentator(Dice=0.15)。
Details
Motivation: 手动勾画前列腺耗时且存在观察者间差异,通用分割工具在前列腺特异性任务中精度不足。 Method: 采用nnU-Net v2框架,融合T2WI、DWI和ADC多模态mpMRI数据进行训练;使用PI-CAI数据集981例全腺体标注数据训练,并通过5折交叉验证及La Fe医院54例外部数据集验证。 Result: 交叉验证平均Dice为0.96±0.00,外部测试集Dice为0.82;对比TotalSegmentator(Dice=0.15),本方法显著更优,尤其避免了欠分割问题。 Conclusion: 任务特异、多模态的深度学习策略对前列腺分割至关重要,该模型已容器化并开源,具备临床研究落地潜力。 Abstract: Accurate segmentation of the prostate gland in multiparametric MRI (mpMRI) is a fundamental step for a wide range of clinical and research applications, including image registration, volume estimation, and radiomic analysis. However, manual delineation is time-consuming and subject to inter-observer variability, while general-purpose segmentation tools often fail to provide sufficient accuracy for prostate-specific tasks. In this work, we propose a dedicated deep learning-based approach for automatic prostate gland segmentation using the nnU-Net v2 framework. The model leverages multimodal mpMRI data, including T2-weighted imaging, diffusion-weighted imaging (DWI), and apparent diffusion coefficient (ADC) maps, to exploit complementary tissue information. Training was performed on 981 cases from the PI-CAI dataset using whole-gland annotations, and model performance was assessed through 5-fold cross-validation and external validation on an independent cohort of 54 patients from Hospital La Fe. The proposed model achieved a mean Dice score of 0.96 +/- 0.00 in cross-validation and 0.82 on the external test set, demonstrating strong generalization despite domain shift. In comparison, a general-purpose approach (TotalSegmentator) showed substantially lower performance, with a Dice score of 0.15, primarily due to under-segmentation of the gland. These results highlight the importance of task-specific, multimodal segmentation strategies and demonstrate the potential of the proposed approach for reliable integration into clinical research workflows. To facilitate reproducibility and deployment, the model has been fully containerized and is available as a ready-to-use inference tool.[149] Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Junbin Xiao,Shenglang Zhang,Pengxiang Zhu,Angela Yao
Main category: cs.CV
TL;DR: 本文提出了首个面向自我中心视频个性化问答(ego-grounding)的多模态大语言模型(MLLMs)系统性分析,并构建了首个相关基准数据集MyEgo,发现现有MLLMs在理解、记忆与推理‘我’的相关信息方面存在显著不足,尤其在长期记忆和自我定位能力上远落后于人类。
Details
Motivation: 现有MLLMs缺乏对自我中心视频中‘相机佩戴者’(即‘我’)的理解能力,亟需系统性评估其在个性化问答中的ego-grounding能力。 Method: 构建首个自我中心VideoQA数据集MyEgo(含541个长视频、5K个性化问题),涵盖‘我的物品’、‘我的活动’、‘我的过去’三类问题;在多种主流MLLMs(开源/闭源、是否启用推理、不同规模)上进行基准测试与消融分析。 Result: 所有主流MLLMs表现远逊于人类(最高仅46% vs. ~86%),且模型缩放与显式推理未带来稳定提升;提供显式证据可短暂提升性能,但随时间推移效果迅速衰减,揭示其长期自我追踪与记忆能力薄弱。 Conclusion: ego-grounding与长程记忆是实现自我中心视频个性化问答的关键瓶颈,MyEgo为推动该方向研究提供了重要基准与资源。 Abstract: We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo[150] SDesc3D: Towards Layout-Aware 3D Indoor Scene Generation from Short Descriptions
Jie Feng,Jiawei Shen,Junjia Huang,Junpeng Zhang,Mingtao Feng,Weisheng Dong,Guanbin Li
Main category: cs.CV
TL;DR: 本文提出SDesc3D框架,通过多视角结构先验增强、功能感知的布局定位和迭代反思修正机制,在短文本条件下生成物理合理、细节丰富的3D室内场景。
Details
Motivation: 现有文本驱动的3D场景生成方法在短文本条件下物理合理性差、细节不足,主要因其过度依赖显式语义关系线索,缺乏有效的3D推理能力(如先验整合与空间锚定)。 Method: 提出SDesc3D框架:1)多视角场景先验增强,将稀疏文本映射为多视角关系先验;2)功能感知布局定位,利用区域功能隐式定义空间锚点并分层推理布局;3)迭代反思-修正机制,实现结构合理性的渐进优化。 Result: 在短文本驱动的3D室内场景生成任务上显著优于现有方法,提升了物理合理性和细节丰富度。 Conclusion: 融合多视角先验与功能语义的3D推理机制可有效缓解短文本引导下的语义压缩问题,为交互式3D环境构建提供更鲁棒、更精细的生成方案。 Abstract: 3D indoor scene generation conditioned on short textual descriptions provides a promising avenue for interactive 3D environment construction without the need for labor-intensive layout specification. Despite recent progress in text-conditioned 3D scene generation, existing works suffer from poor physical plausibility and insufficient detail richness in such semantic condensation cases, largely due to their reliance on explicit semantic cues about compositional objects and their spatial relationships. This limitation highlights the need for enhanced 3D reasoning capabilities, particularly in terms of prior integration and spatial anchoring.Motivated by this, we propose SDesc3D, a short-text conditioned 3D indoor scene generation framework, that leverages multi-view structural priors and regional functionality implications to enable 3D layout reasoning under sparse textual guidance.Specifically, we introduce a Multi-view scene prior augmentation that enriches underspecified textual inputs with aggregated multi-view structural knowledge, shifting from inaccessible semantic relation cues to multi-view relational prior aggregation. Building on this, we design a Functionality-aware layout grounding, employing regional functionality grounding for implicit spatial anchors and conducting hierarchical layout reasoning to enhance scene organization and semantic plausibility.Furthermore, an Iterative reflection-rectification scheme is employed for progressive structural plausibility refinement via self-rectification.Extensive experiments show that our method outperforms existing approaches on short-text conditioned 3D indoor scene generation.Code will be publicly available.[151] NearID: Identity Representation Learning via Near-identity Distractors
Aleksandar Cvejic,Rameen Abdal,Abdelrahman Eldesokey,Bernard Ghanem,Peter Wonka
Main category: cs.CV
TL;DR: 本文提出NearID框架,通过构建近身份干扰项(Near-identity distractors)来解耦身份与背景上下文,提升视觉编码器在身份相关任务中的鲁棒性;构建NearID数据集并设计严格评估协议;提出两层对比学习目标,在冻结骨干网络上学习身份感知表征,显著提升身份判别性能与人类判断一致性。
Details
Motivation: 现有视觉编码器在身份聚焦任务中将对象身份与背景上下文纠缠,导致表征和评估指标不可靠。 Method: 提出Near-identity(NearID)干扰项构造范式,构建含19K身份、316K匹配上下文干扰项的NearID数据集,并设计基于严格边距的评估协议;在此基础上,采用两层对比学习目标(same identity > NearID distractor > random negative)在冻结骨干网上学习身份感知表征。 Result: 预训练编码器在NearID协议下Sample Success Rate(SSR)低至30.7%;所提方法将SSR提升至99.2%,部件级判别能力提升28.0%,并在DreamBench++上更符合人类判断。 Conclusion: NearID为身份表征评估提供了首个原则性基准与方法框架,揭示并修正了主流编码器在身份判别上的根本缺陷,推动个性化生成与编辑任务的可信评估与建模。 Abstract: When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/[152] Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation
Yuqing Huang,Guotian Zeng,Zhenqiao Yuan,Zhenyu He,Xin Li,Yaowei Wang,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: 本文提出交互式跟踪(Interactive Tracking)新范式,通过自然语言指令实现人类实时干预,并构建首个大规模基准InteractTrack、评估协议及基线模型IMAT。
Details
Motivation: 现有视觉跟踪器为非交互式,难以适应需人工介入的真实场景,亟需支持人类在环(human-in-the-loop)的自适应跟踪方法。 Method: 构建包含150个视频与时间戳语言指令的大规模基准InteractTrack;设计综合评估协议;提出基于动态记忆机制的基线模型IMAT,以学习并响应用户反馈。 Result: 25种主流跟踪器在交互场景下性能显著下降,传统SOTA方法无法迁移;IMAT展现出更强的适应性与反馈学习能力。 Conclusion: 交互式跟踪是迈向智能、协同感知系统的关键方向,本文提供的基准、协议与基线为该领域奠定基础。 Abstract: Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at https://github.com/NorahGreen/InteractTrack.git.[153] Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models
Antoine Saporta,Baptiste Callard,Corentin Dancette,Julien Khlaut,Charles Corbière,Leo Butsanets,Amaury Prat,Pierre Manceron
Main category: cs.CV
TL;DR: 本文提出了Curia-2,一种专为CT和MRI影像优化的百亿参数多模态基础模型,改进了预训练策略与表征质量,并构建了包含2D/3D双轨的新型评测基准CuriaBench。
Details
Motivation: 医学影像快速增长导致放射科医生工作负担加重,现有基础模型在处理复杂放射影像(如CT、MRI)时仍有优化空间。 Method: 基于Curia框架,提出Curia-2:改进预训练策略,支持扩展至十亿参数量级的Vision Transformer;并重构CuriaBench为2D(切片级)与3D(体素级)两个评测轨道。 Result: Curia-2在视觉任务上全面超越现有基础模型,在临床复杂任务(如病灶检测)中表现媲美视觉-语言模型;模型权重将开源。 Conclusion: Curia-2通过架构扩展、预训练优化与评测标准化,显著提升了医学影像基础模型的性能与实用性,推动该领域向更可靠、可复现的方向发展。 Abstract: The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.[154] Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
Boyang Gong,Yu Zheng,Fanye Kong,Jie Zhou,Jiwen Lu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的Inertia-aware Visual Excitation(IVE)方法,用于缓解多模态大语言模型(MLLMs)中视觉注意力的‘惯性’问题,从而提升其对对象间关系的动态推理能力,有效减少认知型幻觉。
Details
Motivation: 现有幻觉缓解方法主要针对感知型幻觉(如物体存在性或属性错误),但难以解决需要对象间关系推理的认知型幻觉;作者发现MLLMs视觉注意力在解码早期即趋于静态(即‘视觉惯性’),阻碍组合式认知推理。 Method: 通过词元级注意力分析识别视觉惯性现象;提出无需训练的IVE方法:1)动态选择相对于历史注意力趋势新兴的视觉词元;2)引入惯性感知惩罚项,抑制注意力过度集中与局部区域持续驻留。 Result: IVE在多个基础MLLM和幻觉评测基准上均显著提升性能,尤其在认知型幻觉任务上效果突出,且不依赖额外训练。 Conclusion: 视觉注意力的动态响应能力是实现认知推理的关键;IVE通过建模并打破注意力惯性,为提升MLLMs的可解释性与可靠性提供了新思路。 Abstract: Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.[155] Resonance4D: Frequency-Domain Motion Supervision for Preset-Free Physical Parameter Learning in 4D Dynamic Physical Scene Simulation
Changshe Zhang,Jie Feng,Siyu Chen,Guanbin Li,Ronghua Shang,Junpeng Zhang
Main category: cs.CV
TL;DR: Resonance4D是一种结合3D高斯泼溅与质点法的轻量级物理驱动4D动态仿真框架,通过双域运动监督(DMS)和全参数物理恢复策略,在保证物理真实性和运动一致性的前提下显著降低计算与内存开销。
Details
Motivation: 现有方法依赖高成本的视频扩散或光流监督,且仅优化部分材料参数,难以处理复杂材质与动态场景。 Method: 提出Resonance4D框架,融合3D高斯泼溅与质点法;引入双域运动监督(DMS),联合约束空间结构一致性与频域谱一致性;结合零样本文本分割与仿真引导初始化,实现高斯体的对象-部件级分解与全材料参数联合优化。 Result: 在合成与真实场景实验中,Resonance4D实现了高物理保真度与运动一致性,峰值GPU显存从35GB以上降至约20GB,支持单张消费级GPU运行。 Conclusion: Resonance4D通过轻量但物理可表达的监督机制与全参数建模能力,突破了4D动态仿真中计算效率与物理真实性之间的权衡瓶颈。 Abstract: Physics-driven 4D dynamic simulation from static 3D scenes remains constrained by an overlooked contradiction: reliable motion supervision often relies on online video diffusion or optical-flow pipelines whose computational cost exceeds that of the simulator itself. Existing methods further simplify inverse physical modeling by optimizing only partial material parameters, limiting realism in scenes with complex materials and dynamics. We present Resonance4D, a physics-driven 4D dynamic simulation framework that couples 3D Gaussian Splatting with the Material Point Method through lightweight yet physically expressive supervision. Our key insight is that dynamic consistency can be enforced without dense temporal generation by jointly constraining motion in complementary domains. To this end, we introduce Dual-domain Motion Supervision (DMS), which combines spatial structural consistency for local deformation with frequency-domain spectral consistency for oscillatory and global dynamic patterns, substantially reducing training cost and memory overhead while preserving physically meaningful motion cues. To enable stable full-parameter physical recovery, we further combine zero-shot text-prompted segmentation with simulation-guided initialization to automatically decompose Gaussians into object-part-level regions and support joint optimization of full material parameters. Experiments on both synthetic and real scenes show that Resonance4D achieves strong physical fidelity and motion consistency while reducing peak GPU memory from over 35\,GB to around 20\,GB, enabling high-fidelity physics-driven 4D simulation on a single consumer-grade GPU.[156] MTLSI-Net: A Linear Semantic Interaction Network for Parameter-Efficient Multi-Task Dense Prediction
Chen Liu,Hengyu Man,Xiaopeng Fan,Debin Zhao
Main category: cs.CV
TL;DR: 本文提出MTLSI-Net,通过线性注意力机制实现多任务密集预测中低复杂度的跨任务语义交互,在NYUDv2和PASCAL-Context上达到SOTA。
Details
Motivation: 标准自注意力在高分辨率特征上具有平方复杂度,难以高效建模多任务密集预测中的全局跨任务交互。 Method: 提出MTLSI-Net,包含三个核心模块:多任务多尺度查询线性融合块(共享全局上下文矩阵实现线性复杂度跨任务建模)、语义令牌蒸馏器(压缩冗余特征、提炼关键跨任务知识)和跨窗口集成注意力块(双分支注入全局语义,兼顾一致性与空间精度)。 Result: 在NYUDv2和PASCAL-Context数据集上取得当前最优性能,验证了方法在效果与效率上的优势。 Conclusion: MTLSI-Net以线性复杂度和更少参数实现了全面、高效的跨任务交互建模,为多任务密集预测提供了新思路。 Abstract: Multi-task dense prediction aims to perform multiple pixel-level tasks simultaneously. However, capturing global cross-task interactions remains non-trivial due to the quadratic complexity of standard self-attention on high-resolution features. To address this limitation, we propose a Multi-Task Linear Semantic Interaction Network (MTLSI-Net), which facilitates cross-task interaction through linear attention. Specifically, MTLSI-Net incorporates three key components: a Multi-Task Multi-scale Query Linear Fusion Block, which captures cross-task dependencies across multiple scales with linear complexity using a shared global context matrix; a Semantic Token Distiller that compresses redundant features into compact semantic tokens, distilling essential cross-task knowledge; and a Cross-Window Integrated attention Block that injects global semantics into local features via a dual-branch architecture, preserving both global consistency and spatial precision. These components collectively enable the network to capture comprehensive cross-task interactions at linear complexity with reduced parameters. Extensive experiments on NYUDv2 and PASCAL-Context demonstrate that MTLSI-Net achieves state-of-the-art performance, validating its effectiveness and efficiency in multi-task learning.[157] ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction
Sirshapan Mitra,Yogesh S. Rawat
Main category: cs.CV
TL;DR: 本文提出ProDiG框架,通过渐进式高斯点阵变换和扩散引导,从纯航拍图像生成高质量地面视角视图和一致的3D场景模型,无需多高度真实数据。
Details
Motivation: 现有方法在极大视角变化、中间观测缺失和尺度差异下难以生成几何一致的地面视图和3D模型;后处理优化易失真,多高度真值数据又极难获取。 Method: 提出ProDiG(Progressive Altitude Gaussian Splatting):1)渐进式生成中间高度视图;2)几何感知的因果注意力模块将对极结构注入参考视图扩散过程;3)距离自适应高斯模块动态调节高斯尺度与不透明度。 Result: 在合成与真实数据集上显著优于现有方法,在视觉质量、几何一致性及对极端视角变化的鲁棒性方面均表现更优。 Conclusion: ProDiG实现了无需额外地面真值视角的、几何有据的渐进式重建,为航拍到地面视角的跨尺度3D建模提供了新范式。 Abstract: Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-to-ground gaps. To address these limitations, we introduce ProDiG (Progressive Altitude Gaussian Splatting), a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes.[158] Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models
Osher Rafaeli,Tal Svoray,Ariel Nahlieli
Main category: cs.CV
TL;DR: 本文提出Prior2DSM,一种无需训练的数字表面模型(DSM)补全框架,利用DINOv3视觉特征与单目深度基础模型,在测试时通过语义特征空间匹配和轻量级LoRA+MLP自适应校准,实现高精度、度量一致的DSM补全。
Details
Motivation: 现有DSM常存在缺失或过时区域;传统插值法依赖空间连续性而失效,学习方法又受限于监督数据和泛化能力。 Method: 结合DINOv3自监督ViT特征与单目深度基础模型,在测试时通过语义特征空间对应传播高度信息;采用LoRA与轻量MLP进行测试时自适应,预测空变尺度与偏移参数,将相对深度转为度量高度。 Result: 相比插值、先验重标定及SOTA单目深度模型,显著降低重建误差(RMSE最高下降46%),保持结构保真度,并支持DSM更新与RGB-DSM协同生成。 Conclusion: Prior2DSM是一种通用、免训练、测试时自适应的DSM补全新范式,突破了对监督数据和传感器特异性的依赖,提升了跨域鲁棒性与实用性。 Abstract: Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.[159] Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation
Jie Feng,Fengze Li,Junpeng Zhang,Siyu Chen,Yuping Liang,Junying Chen,Ronghua Shang
Main category: cs.CV
TL;DR: 本文提出DR-Seg框架,通过解耦CLIP特征为语义主导与结构主导子空间,并利用DINO特征进行有针对性的结构增强,结合图校正与自适应融合模块,在遥感开放词汇语义分割任务中取得新SOTA。
Details
Motivation: CLIP全局对齐的视觉表征难以捕捉结构细节,而现有引入DINO特征的方法未区分CLIP特征通道的功能异质性,导致边界划分不准且可能破坏语义完整性。 Method: 提出DR-Seg解耦-校正框架:1)将CLIP特征按功能异质性解耦为语义主导与结构主导子空间;2)基于DINO引导构建先验驱动的图校正模块以注入高保真结构先验;3)设计不确定性引导的自适应融合模块动态融合增强分支与原始CLIP分支。 Result: 在八个遥感基准上全面实验验证,DR-Seg达到新的最先进性能(SOTA)。 Conclusion: DR-Seg通过功能解耦与定向结构增强,有效兼顾语言对齐语义与细粒度空间划分,为开放词汇遥感语义分割提供了新范式。 Abstract: Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.[160] Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence
Dian Liu,Jie Feng,Di Li,Yuhui Zheng,Guanbin Li,Weisheng Dong,Guangming Shi
Main category: cs.CV
TL;DR: 本文提出LinkS²Bench,首个面向无人机与卫星协同动态跨视角空间智能评估的基准,包含1022分钟动态无人机视频与200 km²高分辨率卫星影像配对,构建17.9k高质量QA对覆盖感知、定位、关系与推理四维度;实验发现当前VLM在跨视角动态对齐上存在显著瓶颈,并提出Cross-View Alignment Adapter有效提升性能。
Details
Motivation: 现有Vision-Language Models(VLMs)缺乏对无人机与卫星协同空间智能的评估能力,因现有基准仅覆盖孤立的无人机视频或静态卫星图像,无法测试动态局部到全局的空间映射与跨视角推理能力。 Method: 构建首个动态跨视角空间智能基准LinkS²Bench,通过LMM辅助流程与人工精标,将1022分钟动态无人机视频与覆盖200 km²的高分辨率卫星影像对齐,生成17.9k QA对,涵盖4个维度12项细粒度任务;设计Cross-View Alignment Adapter以显式建模跨视角对齐,并开展18个主流VLM的系统评测与微调实验。 Result: 18个代表性VLM在LinkS²Bench上显著落后于人类基线,证实跨视角动态对齐是核心瓶颈;所提Adapter显著提升性能;微调实验证明LinkS²Bench能有效促进VLM在复杂空间推理任务上的适应能力。 Conclusion: LinkS²Bench填补了VLM跨视角空间智能评估的空白,揭示了动态对齐的关键挑战,并为未来面向广域应急响应与安全任务的多源空间智能模型发展提供了新基准与技术路径。 Abstract: Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS$^2$Bench, the first comprehensive benchmark designed to evaluate VLMs' wide-area, dynamic cross-view spatial intelligence. LinkS$^2$Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km$^2$. Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS$^2$Bench in advancing VLM adaptation for complex spatial reasoning.[161] Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data
Alejandro Castañeda Garcia,Jan van Gemert,Daan Brinks,Nergis Tömen
Main category: cs.CV
TL;DR: 本文提出一种针对空间数据不平衡问题的自编码器改进方法,通过自熵损失函数和样本传播机制,提升对稀有空间位置的重建能力,在多个真实与模拟数据集上验证了其有效性。
Details
Motivation: 自编码器在处理空间采样不均匀的图像(如医学影像、生物图像和物理图像)时存在偏差,倾向于重建占主导地位的背景区域,导致稀有但重要的细节丢失和重建模糊。 Method: 提出两种互补策略:(i) 基于自熵的损失函数,增强统计上罕见空间位置的权重;(ii) 样本传播(Sample Propagation),一种在训练中跨批次选择性重放难重建样本的回放机制。 Result: 在受控模拟数据集及三个真实世界数据集(物理、生物、天文)上验证,该方法在多种重建指标上优于基线方法,尤其在空间不平衡分布下表现更优。 Conclusion: 空间不平衡是影响无监督图像重建质量的关键因素,所提方法通过关注稀有空间位置显著提升了重建一致性与细节保留能力,凸显了批数据表示与稀有样本建模的重要性。 Abstract: Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.[162] IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline
Sebastian-Ion Nae,Radu Moldoveanu,Alexandra Stefania Ghita,Adina Magda Florea
Main category: cs.CV
TL;DR: 本文介绍了IndoorCrowd数据集,一个用于室内人类检测、实例分割和多目标跟踪的多场景大规模数据集,包含31个视频(9913帧),并提供了基准测试结果与多种基础模型及跟踪算法的性能对比。
Details
Motivation: 现有数据集难以充分反映真实室内环境中人群行为的复杂性与多样性,缺乏大规模、高质量、多场景的室内人群数据。 Method: 构建了IndoorCrowd数据集,涵盖四个校园室内场景,提供人工校验的逐实例分割掩码;设计控制子集评估SAM3、GroundingSAM和EfficientGroundingSAM等自动标注器性能;使用YOLOv8n/YOLOv26n/RT-DETR-L与ByteTrack/BoT-SORT/OC-SORT组合建立检测、分割与跟踪基线;采用Cohen's κ、AP、精度、召回率、掩码IoU等指标进行定量评估。 Result: ACS-EC场景最具挑战性(79.3%帧为高密度,平均实例尺度仅60.8像素);自动标注器在控制子集上表现各异;各算法在不同场景中性能差异显著,验证了场景难度多样性。 Conclusion: IndoorCrowd填补了真实复杂室内人群理解任务的数据空白,为检测、分割与跟踪研究提供了可靠基准,并揭示了场景特性对模型性能的关键影响。 Abstract: Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.[163] Efficient Reasoning via Thought Compression for Language Segmentation
Qing Zhou,Shiyu Zhang,Yuyu Jia,Junyu Gao,Weiping Ni,Junzheng Wu,Qi Wang
Main category: cs.CV
TL;DR: WISE是一种新型高效推理范式,通过‘思考两次’(先学习后提速)策略,在保持性能的同时大幅减少推理长度。
Details
Motivation: 链式思维(CoT)虽提升了多模态模型在语言引导分割任务中的性能,但其生成冗长推理过程导致计算成本过高,限制了实际应用。 Method: WISE提出一种结构化序列生成方式:先生成简洁推理、再输出最终答案、最后给出详细解释;利用自回归条件机制确保简洁推理能充分支持详细解释的生成,并通过自蒸馏目标联合优化语义保真度与简洁性;推理时仅使用简洁推理,并引入WISE-S策略,通过在用户查询中注入简洁性指令来缓解分布偏移。 Result: WISE-S在ReasonSeg基准上实现零样本最优性能(58.3 cIoU),推理长度从112词元降至23词元,压缩近5倍。 Conclusion: WISE证明了将详细推理内化为紧凑形式的可行性,兼顾高效性与高性能,为实际部署提供了新路径。 Abstract: Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textit{thinking twice -- once for learning, once for speed}. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user's query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf{5$\times$} -- from 112 to just 23 tokens. Code is available at \href{https://github.com/mrazhou/WISE}{WISE}.[164] Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models
Issa Sugiura,Keito Sasagawa,Keisuke Nakao,Koki Maeda,Ziqi Yin,Zhishen Yang,Shuhei Kurita,Yusuke Oda,Ryoko Tokuhisa,Daisuke Kawahara,Naoaki Okazaki
Main category: cs.CV
TL;DR: 本文提出Jagle,目前最大的日语多模态后训练数据集(920万样本),通过多种策略(VLM生成、翻译、文本渲染)从异构数据源构建,显著提升日语VLM性能,并兼容英语性能。
Details
Motivation: 现有非英语VLM训练受限于小规模、领域覆盖窄的日语VQA数据集,难以支持高质量多语言VLM发展。 Method: 构建Jagle数据集:整合图像、图文对、PDF等异构源,采用VLM自动生成问答对、跨语言翻译、文本渲染等多种策略生成日语VQA样本;在2.2B模型上进行后训练并评估多语言性能。 Result: 基于Jagle训练的2.2B模型在10项日语评测任务平均分超越InternVL3.5-2B,接近Qwen3-VL-2B-Instruct(相差约5分);与FineVision联合训练不损害英语性能,反而提升。 Conclusion: Jagle为日语及多语言VLM提供了高质量、可扩展的数据基础,验证了脱离传统VQA依赖、通过多样化合成策略构建多模态数据的有效性,并开源全部资源以推动后续研究。 Abstract: Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.[165] True to Tone? Quantifying Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines
Gabriel Ferri Schneider,Erick Menezes,Rafael Mecenas,Paulo Knob,Victor Araujo,Soraia Raupp Musse
Main category: cs.CV
TL;DR: 本文提出了一种全自动、可扩展的方法,用于系统评估虚拟人(VH)生成流程中肤色保真度,涵盖肤色与光照提取、纹理重着色、实时渲染及定量色彩分析;基于CFD数据集实验发现肤色提取策略存在表型依赖性,且深色肤色的色度误差始终更高。
Details
Motivation: 现有虚拟人创建流程多依赖未经过色度校准的照片输入,易导致肤色再现不一致和偏差,影响真实性、身份识别与公平性。 Method: 构建端到端自动工作流:结合面颊采样与全脸多维掩码两种肤色提取策略,引入TRUST框架进行光照隔离,将提取肤色应用于MetaHuman纹理并在多种光照下渲染,最后在CIELAB空间用ΔE和ITA指标定量评估肤色一致性。 Result: 共生成并分析约19,848个渲染实例;结果显示肤色提取策略性能具有表型依赖性,且深色肤色始终表现出更高的色度误差(ΔE)。 Conclusion: 该方法无需人工干预和训练,计算成本低、可大规模部署,揭示了当前虚拟人管线中对深色肤色系统性再现偏差的问题,为提升公平性与保真度提供了可量化的评估基准。 Abstract: Accurate reproduction of facial skin tone is essential for realism, identity preservation, and fairness in Virtual Human (VH) rendering. However, most accessible avatar creation pipelines rely on photographic inputs that lack colorimetric calibration, which can introduce inconsistencies and bias. We propose a fully automatic and scalable methodology to systematically evaluate skin tone fidelity across the VH generation pipeline. Our approach defines a full workflow that integrates skin color and illumination extraction, texture recolorization, real-time rendering, and quantitative color analysis. Using facial images from the Chicago Face Database (CFD), we compare skin tone extraction strategies based on cheek-region sampling, following the literature, and multidimensional masking derived from full-face analysis. Additionally, we test both strategies with lighting isolation, using the pre-trained TRUST framework, employed without any training or optimization within our pipeline. Extracted skin tones are applied to MetaHuman textures and rendered under multiple lighting configurations. Skin tone consistency is evaluated objectively in the CIELAB color space using the $ΔE$ metric and the Individual Typology Angle (ITA). The proposed methodology operates without manual intervention and, with the exception of pre-trained illumination compensation modules, the pipeline does not include learning or training stages, enabling low computational cost and large-scale evaluation. Using this framework, we generate and analyze approximately 19,848 rendered instances. Our results show phenotype-dependent behavior of extraction strategies and consistently higher colorimetric errors for darker skin tones.[166] COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing
Hao Wang,Yanyu Qian,Pengcheng Weng,Zixuan Xia,William Dan,Yangxin Xu,Fei Wang
Main category: cs.CV
TL;DR: 本文提出COMPASS框架,通过为每个缺失模态生成目标特定的代理标记,确保融合头始终接收固定N槽的多模态输入,从而提升多模态感知鲁棒性。
Details
Motivation: 现有方法在处理缺失模态时导致融合头输入结构与训练时不一致,造成融合不完整和跨模态交互减弱。 Method: COMPASS基于融合完整性原则,为每个缺失模态利用成对源到目标生成器在共享潜在空间中合成目标特定代理标记,并结合代理对齐、共享空间正则化和逐代理判别监督来保证其表征兼容性和任务信息性。 Result: 在XRF55、MM-Fi和OctoNet数据集上的多种单/多模态缺失场景下,COMPASS在大多数情况下优于先前方法。 Conclusion: 保持模态完整的融合接口是一种简单而有效的鲁棒多模态感知设计原则。 Abstract: Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.[167] CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
Jingliang Li,Jindou Jia,Tuo An,Chuhao Zhou,Xiangyu Chen,Shilin Shan,Boyu Ma,Bofan Lyu,Gen Li,Jianfei Yang
Main category: cs.CV
TL;DR: 本文提出多物体情境下的意图驱动型3D功能接地新任务,构建首个聚焦隐式意图与混淆物体对的基准CompassAD,并设计CompassNet框架,通过实例约束的跨模态注入和双层对比精化模块提升在混淆场景中的功能定位精度,实验证明其在仿真与真实机器人抓取任务中均达到SOTA性能。
Details
Motivation: 现有3D功能识别方法多基于单物体、显式类别提示,无法应对真实杂乱场景中多个物体具有相同功能(即“混淆对”)但仅一个符合当前任务意图的问题。 Method: 提出CompassNet框架,包含两个核心模块:1)Instance-bounded Cross Injection(ICI),在物体实例边界内约束语言-几何对齐,防止语义跨物体泄漏;2)Bi-level Contrastive Refinement(BCR),在几何组和点两个层级进行对比学习,增强目标与混淆表面的判别性。 Result: 在自建基准CompassAD上取得SOTA性能,泛化至未见指令表现优异;在真实机器人平台上成功部署,验证了其在混淆多物体场景中抓取任务的有效性。 Conclusion: 隐式意图驱动的多物体3D功能接地是迈向实用具身智能的关键一步;CompassAD基准与CompassNet方法为解决真实场景中功能歧义问题提供了系统性方案与技术基础。 Abstract: When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.[168] Network Structure in UK Payment Flows: Evidence on Economic Interdependencies and Implications for Real-Time Measurement
Aditya Humnabadkar
Main category: cs.CV
TL;DR: 本文利用图论方法分析英国53万笔跨行业支付记录,发现网络特征(如中心性、聚类系数)能显著提升支付流预测精度,尤其在经济扰动期(如新冠疫情)效果更突出,为实时经济监测和官方统计提供新工具。
Details
Motivation: 传统双边测量方法难以揭示行业间隐性的结构性经济关系,而网络分析可提供实时经济监测的新视角,尤其在经济扰动期间传统时间序列方法失效时亟需替代方案。 Method: 基于2017–2024年英国89个行业的532,346条支付记录构建行业支付网络,提取中心性、聚类系数等图论特征,将其融入预测模型并与传统时间序列方法对比评估。 Result: 网络特征使预测准确率提升8.8个百分点;疫情期间网络贡献达+13.8个百分点(R²从0.19回升);识别出金融、批发贸易与专业服务为结构上最中心的行业;网络密度整体上升12.5%,且疫情后恢复并超越疫前水平。 Conclusion: 支付网络结构特征可作为经济结构性变化的先行指标,显著增强经济‘现在预测’(nowcasting)能力,尤其适用于传统时间模式失效的动荡时期,有望改进官方统计体系。 Abstract: Network analysis of inter-industry payment flows reveals structural economic relationships invisible to traditional bilateral measurement approaches, with significant implications for real-time economic monitoring. Analysing 532,346 UK payment records (2017--2024) across 89 industry sectors, we demonstrate that graph-theoretic features which include centrality measures and clustering coefficients improve payment flow forecasting by 8.8 percentage points beyond traditional time-series methods. Critically, network features prove most valuable during economic disruptions: during the COVID-19 pandemic, when traditional forecasting accuracy collapsed (R2} falling from 0.38 to 0.19), network-enhanced models maintained substantially better performance, with network contributions reaching +13.8 percentage points. The analysis identifies Financial Services, Wholesale Trade, and Professional Services as structurally central industries whose network positions indicate systemic importance beyond their transaction volumes. Network density increased 12.5\% over the sample period, with visible disruption during 2020 followed by recovery exceeding pre-pandemic integration levels. These findings suggest payment network monitoring could enhance official statistics production by providing leading indicators of structural economic change and improving nowcasting accuracy during periods when traditional temporal patterns prove unreliable.[169] Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection
Soo Won Seo,KyungChae Lee,Hyungchan Cho,Taein Son,Nam Ik Cho,Jun Won Choi
Main category: cs.CV
TL;DR: 本文提出InCoM-Net框架,通过融合视觉语言模型(VLM)的语义知识与检测器的实例特征,增强人-物交互(HOI)检测中的上下文建模能力,在HICO-DET和V-COCO上达到SOTA性能。
Details
Motivation: 现有基于VLM的HOI检测方法未能充分利用场景中分布广泛的多样化上下文线索,限制了交互推理的深度和广度。 Method: 提出Instance-centric Context Mining Network(InCoM-Net),包含两个核心模块:Instance-centric Context Refinement(ICR)用于分别提取实例内、实例间和全局上下文线索;Progressive Context Aggregation(ProCA)迭代融合多层级上下文特征与检测器实例特征。 Result: 在HICO-DET和V-COCO基准上均取得当前最优性能(state-of-the-art)。 Conclusion: InCoM-Net有效提升了HOI检测中对复杂场景上下文的理解与利用能力,验证了实例中心化多粒度上下文建模的有效性。 Abstract: Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.[170] PLUME: Latent Reasoning Based Universal Multimodal Embedding
Chenwei He,Xiangzhao Hao,Tianyu Yang,Yuxiang Ma,Yuheng Jia,Lingxiang Wu,Chaoyang Zhao,Haiyun Guo,Jinqiao Wang
Main category: cs.CV
TL;DR: PLUME提出一种隐式链式推理框架,用连续潜在状态的自回归展开替代显式文本链式推理,通过语义锚点引导的过渡适配器和渐进式显式到隐式的训练课程,在保持推理能力的同时大幅降低计算开销。
Details
Motivation: 现有基于显式链式推理(CoT)的通用多模态嵌入(UME)方法存在推理开销大、多模态信息被压缩至文本瓶颈的问题。 Method: PLUME采用潜变量自回归 rollout 替代显式CoT;引入语义锚点引导的过渡适配器以支持多样化查询;使用渐进式显式→隐式的课程学习策略,仅在训练中利用显式CoT作为临时支架。 Result: 在78任务MMEB-v2基准上超越强显式CoT基线,推理步骤从数百token降至少于10个潜在步,速度提升超30倍;尤其适用于视频与视觉文档等证据密集、结构复杂、难以用文本组织的检索场景。 Conclusion: 结构化潜变量计算可在不牺牲中间推理益处的前提下规避显式理由生成开销,为实用检索系统提供更强更高效的新范式。 Abstract: Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.[171] FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition
Taichi Endo,Guoqing Hao,Kazuhiko Sumi
Main category: cs.CV
TL;DR: 本文提出FlowSlider,一种无需训练的连续图像编辑方法,通过分解FlowEdit更新为保真项和引导项,实现平滑可靠的编辑强度控制。
Details
Motivation: 现有基于学习的滑块式连续编辑方法依赖辅助模块和合成监督,导致训练开销大且在分布偏移下可靠性下降。 Method: FlowSlider在Rectified Flow框架中,将FlowEdit更新分解为源图像条件下的保真项(维持身份与结构)和驱动语义变化的引导项,并利用二者近似正交性,仅缩放引导项以调节编辑强度。 Result: FlowSlider无需后训练即可实现稳定、平滑、可靠的连续编辑,在多种任务上提升了编辑质量。 Conclusion: FlowSlider是一种高效、通用、训练自由的连续图像编辑方案,解决了现有方法对训练分布依赖性强和额外开销大的问题。 Abstract: Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textit{FlowSlider}, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textit{FlowSlider} decomposes FlowEdit's update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textit{FlowSlider} provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.[172] Center-Aware Detection with Swin-based Co-DETR Framework for Cervical Cytology
Yan Kong,Yuan Yin,Hongan Chen,Yuqi Fang,Caifeng Shan
Main category: cs.CV
TL;DR: 本文提出了一种用于宫颈细胞学图像分析的中心点检测方法,在RIVA挑战赛中取得优异成绩,通过Co-DINO与Swin-Large结合、中心点预测建模、中心保持增强及几何框优化等策略提升检测精度。
Details
Motivation: 宫颈癌筛查中Pap涂片图像自动分析因细胞密集、形态复杂而具有挑战性,且现有数据集采用固定尺寸边界框标注,需适配新任务形式。 Method: 基于Co-DINO框架与Swin-Large骨干网络进行多尺度特征提取;将检测建模为中心点预测;设计中心保持的数据增强和解析式几何框优化以抑制定位抖动;并针对不同赛道进行损失权重调优。 Result: 在RIVA Cervical Cytology Challenge中获得Track B第1名、Track A第2名;实验证明所提优化显著提升检测性能。 Conclusion: 所提出的中心点检测范式及配套优化策略为细胞学图像分析提供了高效、鲁棒的解决方案,并开源代码促进后续研究。 Abstract: Automated analysis of Pap smear images is critical for cervical cancer screening but remains challenging due to dense cell distribution and complex morphology. In this paper, we present our winning solution for the RIVA Cervical Cytology Challenge, achieving 1st place in Track B and 2nd place in Track A. Our approach leverages a powerful baseline, integrating the Co-DINO framework with a Swin-Large backbone for robust multi-scale feature extraction. To address the dataset's unique fixed-size bounding box annotations, we formulate the detection task as a center-point prediction problem. Tailoring our approach to this formulation, we introduce a center-preserving data augmentation strategy and an analytical geometric box optimization to effectively absorb localization jitter. Finally, we apply track-specific loss tuning to adapt the loss weights for each task. Experiments demonstrate that our targeted optimizations improve detection performance, providing an effective pipeline for cytology image analysis. Our code is available at https://github.com/YanKong0408/Center-DETR.[173] GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding
Rong Fan,Kaiyan Xiao,Minghao Zhu,Liuyi Wang,Kai Dai,Zhao Yang
Main category: cs.CV
TL;DR: 本文提出GroundVTS,一种面向视频时序定位任务的Vid-LLM新架构,通过查询引导的细粒度视觉令牌采样和渐进式优化策略,提升时序建模能力,在多个基准上显著超越现有方法。
Details
Motivation: 现有视频大语言模型(Vid-LLMs)采用均匀帧采样,导致关键帧稀疏、丢失重要时序线索,难以支撑视频时序定位(VTG)等精细任务。 Method: 提出Grounded Visual Token Sampling(GroundVTS):1)查询引导的细粒度视觉令牌筛选机制,聚焦信息最丰富的时序片段;2)渐进式优化策略,使LLM适应非均匀视觉特征分布,增强时序依赖建模能力。 Result: 在三个标准VTG基准上全面评测,mIoU(时刻检索)提升7.7点,mAP(高亮检测)提升12.0点。 Conclusion: GroundVTS有效缓解了均匀采样带来的时序信息损失问题,提升了Vid-LLMs在视频时序定位任务中的精度与鲁棒性,为扩展其应用提供了新思路。 Abstract: Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at https://github.com/Florence365/GroundVTS.[174] LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
Jiachun Jin,Zetong Zhou,Xiao Yang,Hao Zhang,Pengfei Liu,Jun Zhu,Zhijie Deng
Main category: cs.CV
TL;DR: 本文提出LatentUM,一种新型统一模型,通过在共享语义潜在空间中表示所有模态,消除视觉理解与生成间对像素解码的依赖,从而实现高效、灵活的跨模态推理与生成,并在多个任务上达到SOTA性能。
Details
Motivation: 现有统一模型依赖像素解码作为视觉理解和生成之间的桥梁,效率低且存在编解码偏差,难以支持高效的 interleaved cross-modal reasoning。 Method: 提出LatentUM,将所有模态映射到一个共享的语义潜在空间,避免像素空间中介,实现端到端的跨模态理解与生成。 Result: 在Visual Spatial Planning基准上达到SOTA;提升视觉生成的自反思能力;支持基于语义潜在空间的世界建模与未来视觉状态预测。 Conclusion: 共享语义潜在空间设计显著提升跨模态对齐与效率,为统一多模态模型提供了更优架构范式。 Abstract: Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.[175] CASHG: Context-Aware Stylized Online Handwriting Generation
Jinsu Shin,Sungeun Hong,Jin Yeong Bak
Main category: cs.CV
TL;DR: 本文提出CASHG模型,一种上下文感知的风格化在线手写生成器,通过显式建模字符间连通性,实现风格一致的句子级轨迹合成,并引入边界感知评估指标CSM。
Details
Motivation: 句子级在线手写生成面临字符上下文依赖、笔画连续性和间距控制等挑战,现有方法将这些边界特性隐式建模,导致在句子尺度和有限组合多样性下可靠性不足。 Method: 提出CASHG模型:包含字符上下文编码器获取字符身份与句子上下文记忆;采用双元组感知滑动窗口Transformer解码器,强调局部前驱-当前字符转换;结合门控上下文融合机制;并采用三阶段课程学习(从孤立字形到完整句子)训练。同时设计 Connectivity and Spacing Metrics (CSM) 进行边界感知评估。 Result: 在基准匹配评估中,CASHG在CSM指标上持续优于对比方法,在DTW轨迹相似性上保持竞争力,并经人类评估验证提升效果。 Conclusion: 显式建模字符间连通性与上下文融合策略,结合课程学习与专用评估指标,可有效提升句子级在线手写生成的风格一致性与自然度。 Abstract: Online handwriting represents strokes as time-ordered trajectories, which makes handwritten content easier to transform and reuse in a wide range of applications. However, generating natural sentence-level online handwriting that faithfully reflects a writer's style remains challenging, since sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at the sentence scale and under limited compositional diversity. We propose CASHG, a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis. CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory and fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor--current transitions, complemented by gated context fusion for sentence-level context.Training proceeds through a three-stage curriculum from isolated glyphs to full sentences, improving robustness under sparse transition coverage. We further introduce Connectivity and Spacing Metrics (CSM), a boundary-aware evaluation suite that quantifies cursive connectivity and spacing similarity. Under benchmark-matched evaluation protocols, CASHG consistently improves CSM over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by a human evaluation.[176] CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection
Weidong Tang,Hanbin Sun,Zihan Li,Yikai Wang,Feifan Zhang
Main category: cs.CV
TL;DR: 本文提出CoRegOVCD,一种无需训练的开放词汇变化检测框架,通过后验一致性正则化提升语义变化检测的准确性和空间连贯性。
Details
Motivation: 现有遥感变化检测方法假设固定标签空间,无法响应任意用户定义的查询;而开放词汇变化检测在无训练设定下面临跨时间概念响应难以直接比较的问题,如外观变化、概念间竞争弱和地物空间连续性导致证据噪声大、碎片化、语义不可靠。 Method: 提出CoRegOVCD框架:1)竞争性后验校准(CPC)与语义后验差(SPD)将原始概念响应转化为竞争感知的查询概念后验,并量化其时序差异;2)几何-标记一致性门(GeoGate)与区域共识差异(RCD)通过几何感知结构验证和区域共识抑制无效响应、增强空间一致性。 Result: 在四个涵盖建筑导向与多类场景的基准上,CoRegOVCD相比最强无训练基线提升2.24–4.98 F1$_C$点,在SECOND六类平均F1$_C$达47.50%。 Conclusion: CoRegOVCD有效缓解了无训练开放词汇变化检测中的语义不可靠与空间碎片问题,为遥感图像开放语义变化分析提供了鲁棒、可解释的新范式。 Abstract: Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points and reaches a six-class average of 47.50% F1$_C$ on SECOND.[177] Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation
Saurabh Hinduja,Gurmeet Kaur,Maneesh Bilalpur,Jeffrey Cohn,Shaun Canavan
Main category: cs.CV
TL;DR: 本文揭示了面部动作单元(AU)检测中常用的受试者互斥交叉验证存在显著的随机性噪声,尤其对低频AU和F1等操作点指标影响更大;提出使用跨数据集的Leave-One-Dataset-Out(LODO)评估协议,以消除划分随机性、暴露域级不稳定性,从而获得更稳定、可解释的模型评估结果。
Details
Motivation: 现有AU检测研究依赖受试者互斥交叉验证,但报告的性能提升往往微小,作者怀疑其可能被评估协议本身的随机方差所掩盖。 Method: 在BP4D+上重复进行3折受试者互斥划分,量化F1和AUC等指标的随机波动;进一步设计Leave-One-Dataset-Out(LODO)协议,在5个AU数据集上评估跨数据集鲁棒性。 Result: BP4D+上平均F1存在±0.065的经验噪声底限,低频AU波动更大;F1比AUC更易受划分影响,模型排序会随折次变化;LODO揭示了单数据集CV无法发现的域级不稳定性。 Conclusion: 许多在交叉验证中报告的性能增益可能落入协议固有方差范围内;LODO是一种更稳健、更具解释性的AU检测评估范式。 Abstract: Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of $\pm 0.065$ in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings[178] Reflection Generation for Composite Image Using Diffusion Model
Haonan Zhao,Qingyang Liu,Jiaxuan Chen,Li Niu
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的反射生成方法,通过引入反射位置与外观先验,并采用类型感知设计,在自建的大规模反射数据集DEROBA上实现了物理一致且视觉真实的反射合成。
Details
Motivation: 反射生成在图像合成中研究较少,而阴影生成已得到广泛研究,因此需要专门针对反射生成进行深入探索。 Method: 将反射位置和外观的先验信息注入基础扩散模型,并将反射分为两类,采用类型感知的模型设计;同时构建了首个大规模物体反射数据集DEROBA用于训练。 Result: 实验表明该方法生成的反射具有物理一致性和视觉真实性,建立了反射生成的新基准。 Conclusion: 本工作成功推动了图像合成中反射生成任务的发展,为后续研究提供了有效方法和高质量数据集支持。 Abstract: Image composition involves inserting a foreground object into the background while synthesizing environment-consistent effects such as shadows and reflections. Although shadow generation has been extensively studied, reflection generation remains largely underexplored. In this work, we focus on reflection generation. We inject the prior information of reflection placement and reflection appearance into foundation diffusion model. We also divide reflections into two types and adopt type-aware model design. To support training, we construct the first large-scale object reflection dataset DEROBA. Experiments demonstrate that our method generates reflections that are physically coherent and visually realistic, establishing a new benchmark for reflection generation.[179] ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline
Juan Manuel Hernandez,Mariana Fernandez-Espinosa,Denis Parra,Diego Gomez-Zara
Main category: cs.CV
TL;DR: 本文提出ViT-Explainer,一个面向Vision Transformer的交互式可视化系统,支持从图像分块到分类决策全过程的可解释性分析。
Details
Motivation: 现有可解释性工具多聚焦于孤立模块或专家分析,缺乏对ViT端到端推理流程的引导式、一体化理解。 Method: 设计并实现了一个基于Web的交互式系统ViT-Explainer,集成动态演示、分块级注意力热图和视觉适配的Logit Lens,并支持引导式与自由探索两种模式。 Result: 用户研究(6名参与者)表明该系统易于学习和使用,能有效帮助用户理解ViT的行为。 Conclusion: ViT-Explainer填补了ViT全流程可解释性分析的空白,为非专家用户提供直观、可操作的理解途径。 Abstract: Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.[180] CXR-LT 2026 Challenge: Projection-Aware Multi-Label and Zero-Shot Chest X-Ray Classification
Juno Cho,Dohui Kim,Mingeon Kim,Hyunseo Jang,Chang Sun Lee,Jong Chul Ye
Main category: cs.CV
TL;DR: 本文提出了一种统一框架,用于胸部X光片(CXR)的多标签分类(已知病变)与零样本分类(未知病变),通过投影特异性模型集成、改进的CheXzero双分支架构(结合对比学习、非对称损失和大语言模型生成提示)以及强数据与测试时增强来应对长尾分布与不同成像角度的挑战。
Details
Motivation: 解决胸部X光片中已知病变的多标签分类与未知病变的零样本分类双重挑战,尤其应对不同投照角度带来的差异及严重长尾类别不平衡问题。 Method: 1)集成投影特异性模型到统一分类网络;2)扩展CheXzero为双分支架构,融合对比学习、Asymmetric Loss(ASL)和LLM生成的描述性提示;3)引入强数据增强与测试时增强(TTA)。 Result: 显著提升多标签分类性能与零样本泛化能力,有效缓解长尾分布问题,并在不同CXRs投影下保持鲁棒性。 Conclusion: 所提统一框架兼顾已知与未知病变识别,通过模型集成、新型零样本学习架构与增强策略,为临床可扩展的胸部影像自动诊断提供了可行路径。 Abstract: This challenge tackles multi-label classification for known chest X-ray (CXR) lesions and zero-shot classification for unseen ones. To handle diverse CXR projections, we integrate projection-specific models via a classification network into a unified framework. For zero-shot classification (Task 2), we extend CheXzero with a novel dual-branch architecture that combines contrastive learning, Asymmetric Loss (ASL), and LLM-generated descriptive prompts. This effectively mitigates severe long-tail imbalances and maximizes zero-shot generalization. Additionally, strong data and test-time augmentations (TTA) ensure robustness across both tasks.[181] Lightweight Spatiotemporal Highway Lane Detection via 3D-ResNet and PINet with ROI-Aware Attention
Sorna Shanmuga Raja,Abdelhafid Zenati
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、端到端的高速公路车道线检测架构,融合3D CNN与实例分割,通过两种改进模型(FPN+自注意力、ROI检测头)提升精度与效率,在TuSimple数据集上达到93.40%准确率,参数更少、延迟更低,适用于ADAS/LAS。
Details
Motivation: 解决真实驾驶场景中车道线检测对空间-时间信息联合建模的需求,提升鲁棒性、精度并降低计算开销。 Method: 提出两种基于3D-ResNet编码器与PINet解码器的模型:其一引入FPN和自注意力机制增强多尺度特征与空间依赖;其二增加ROI检测头以聚焦车道相关区域。 Result: 在TuSimple数据集上,第二模型达93.40%准确率,显著降低漏检率,相比2D/3D基线模型参数更少、延迟更低,并通过离线训练与实时推理验证。 Conclusion: 所提模型轻量高效、鲁棒性强,适合集成至ADAS,并具备向全功能Lane Assist Systems(LAS)扩展的潜力。 Abstract: This paper presents a lightweight, end-to-end highway lane detection architecture that jointly captures spatial and temporal information for robust performance in real-world driving scenarios. Building on the strengths of 3D convolutional neural networks and instance segmentation, we propose two models that integrate a 3D-ResNet encoder with a Point Instance Network (PINet) decoder. The first model enhances multi-scale feature representation using a Feature Pyramid Network (FPN) and Self-Attention mechanism to refine spatial dependencies. The second model introduces a Region of Interest (ROI) detection head to selectively focus on lane-relevant regions, thereby improving precision and reducing computational complexity. Experiments conducted on the TuSimple dataset (highway driving scenarios) demonstrate that the proposed second model achieves 93.40% accuracy while significantly reducing false negatives. Compared to existing 2D and 3D baselines, our approach achieves improved performance with fewer parameters and reduced latency. The architecture has been validated through offline training and real-time inference in the Autonomous Systems Laboratory at City, St George's University of London. These results suggest that the proposed models are well-suited for integration into Advanced Driver Assistance Systems (ADAS), with potential scalability toward full Lane Assist Systems (LAS).[182] UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
Yongkang Li,Lijun Zhou,Sixu Yan,Bencheng Liao,Tianyi Yan,Kaixin Xiong,Long Chen,Hongwei Xie,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Haiyang Sun,Xinggang Wang
Main category: cs.CV
TL;DR: 本文提出UniDriveVLA模型,通过Mixture-of-Transformers实现感知与推理专家解耦,解决自动驾驶中视觉语言动作模型的空间感知与语义推理冲突问题,并在多项任务上达到SOTA。
Details
Motivation: 现有VLA模型在自动驾驶中面临空间感知与语义推理之间的权衡困境,根源在于二者在共享参数中耦合优化。 Method: 提出基于Mixture-of-Transformers的UniDriveVLA模型,包含驾驶理解、场景感知和动作规划三个专家,通过掩码联合注意力协调;结合稀疏感知范式与三阶段渐进训练策略。 Result: 在nuScenes(开环)和Bench2Drive(闭环)上达到SOTA;同时在3D检测、在线建图、运动预测、驾驶向VQA等多类任务中表现优异。 Conclusion: UniDriveVLA通过专家解耦有效缓解感知-推理冲突,是一个具备强泛化能力的统一自动驾驶VLA模型。 Abstract: Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla[183] SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition
Soroush Oraki,Feng Ding,Jie Liang
Main category: cs.CV
TL;DR: 本文提出SCALE框架,通过语义与置信度感知的列表式能量模型实现零样本骨架动作识别,避免显式骨架-文本对齐,利用条件变分自编码器和新型损失函数提升 unseen 类别识别性能。
Details
Motivation: 现有零样本骨架动作识别方法依赖显式的骨架-文本对齐,但在动作名称无法准确描述细粒度动态、未见类别语义易混淆时表现脆弱。 Method: SCALE是一种轻量级、确定性的语义与置信度感知列表式能量框架:1)构建文本条件化的条件变分自编码器(CVAE),冻结文本表征以参数化潜在先验与解码器;2)设计语义与置信度感知的列表式能量损失,强调语义相似的难负样本并融入后验不确定性;3)引入潜在原型对比目标,对齐后验均值与文本导出的潜在原型。 Result: 在NTU-60和NTU-120数据集上,SCALE持续优于现有VAE和对齐方法,并与扩散模型方法性能相当。 Conclusion: SCALE通过能量建模、不确定性感知优化与潜在语义对齐,在不生成样本、不依赖显式对齐的前提下,提升了零样本骨架动作识别的鲁棒性与泛化能力。 Abstract: Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise energy loss that emphasizes semantically similar hard negatives and incorporates posterior uncertainty to adapt decision margins and reweight ambiguous training instances. Additionally, we utilize a latent prototype contrast objective to align posterior means with text-derived latent prototypes, improving semantic organization and class separability without direct feature matching. Experiments on NTU-60 and NTU-120 datasets show that SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.[184] UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
Qiyao Zhang,Shuhua Zheng,Jianli Sun,Chengxiang Li,Xianke Wu,Zihan Song,Zhiyong Cui,Yisheng Lv,Yonglin Tian
Main category: cs.CV
TL;DR: 本文提出了一种面向无人机(UAV)的具身视觉跟踪新方法UAV-Track VLA,构建了大规模多模态跟踪基准数据集,并在CARLA仿真中验证其在长距离行人跟踪、零样本泛化与实时性方面的显著优势。
Details
Motivation: 现有VLA模型存在时间特征冗余和缺乏空间几何先验的问题,且缺乏适用于动态城市环境下具身视觉跟踪的专用基准与模型。 Method: 构建包含89万帧、176个任务、85类物体的大规模数据集与评估基准;基于π₀.₅架构,提出UAV-Track VLA模型:引入时间压缩网络以高效建模帧间动态,设计并行双分支解码器(空间感知辅助定位头 + 光流匹配动作专家)以解耦跨模态特征并生成细粒度连续动作。 Result: 在CARLA中,长距离行人跟踪成功率达61.76%,平均跟踪帧数269.65;具备强零样本泛化能力;单步推理延迟降低33.4%至0.0571秒。 Conclusion: UAV-Track VLA有效提升了多模态具身跟踪的精度、鲁棒性与实时性,为复杂城市环境中无人机自主导航提供了新范式。 Abstract: Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $π_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76\% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4\% (to 0.0571s) compared to the original $π_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: https://github.com/Hub-Tian/UAV-Track\_VLA.[185] SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
Naomi Kombol,Ivan Martinović,Siniša Šegvić,Giorgos Tolias
Main category: cs.CV
TL;DR: 本文提出SPAR,一种无需架构修改、单次前向传播即可处理任意分辨率图像的ViT密集特征提取器,通过知识蒸馏将滑动窗口教师模型的空间推理能力迁移到学生模型,显著提升开放词汇分割任务的mIoU。
Details
Motivation: 基础视觉Transformer(ViT)在需要细粒度空间理解的任务(如开放词汇分割)中表现受限,因其预训练分辨率固定且patch级表征粗糙;现有高分辨率处理方法(如滑动窗口)计算开销大。 Method: 提出SPAR:单次前向、任意分辨率ViT;采用特征回归损失,将高步长滑动窗口教师模型的空间推理能力蒸馏到单次前向的学生模型,无需架构改动或像素级监督。 Result: 在开放词汇分割任务上,SPAR相比单次前向基线提升最高达10.5 mIoU,并反超教师模型,验证了其高效高分辨率推理能力。 Conclusion: SPAR实现了分辨率无关、高效、高精度的密集预测,为ViT在需细粒度空间理解的任务中提供了实用新范式。 Abstract: Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR[186] Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models
Yaoteng Tan,Zikui Cai,M. Salman Asif
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、基于推理时梯度反馈的文本到图像生成模型安全控制框架,利用冻结的多模态基础模型作为语义能量估计器,在采样过程中动态引导生成过程,兼顾安全性与生成质量。
Details
Motivation: 现有安全方法依赖模型微调或人工筛选数据集,易损害生成质量或缺乏可扩展性;亟需一种不修改原模型、高效且通用的安全控制机制。 Method: 提出基于冻结视觉-语言基础模型的推理时引导框架,将基础模型的语义表征作为现成监督信号,通过在每步采样中注入干净潜在估计,将安全引导建模为基于能量的采样问题。 Result: 在NSFW红队测试基准上达到SOTA鲁棒性,支持多目标协同引导,同时在非目标良性提示下保持高生成质量;兼容扩散与流匹配模型,具备跨视觉概念泛化能力。 Conclusion: 该框架为文本到图像生成提供了原理清晰、模块化、免训练且可扩展的安全控制范式,证实了基础模型可作为可靠的语义能量估计器。 Abstract: Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.[187] Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation
Chongjie Ye,Cheng Cao,Chuanyu Pan,Yiming Hao,Yihao Zhi,Yuanming Hu,Xiaoguang Han
Main category: cs.CV
TL;DR: Omni123 是一种3D原生的多模态大模型,通过将文本、图像和3D统一为离散token序列,在自回归框架中联合建模跨模态一致性,利用丰富2D数据作为几何先验提升3D生成质量,无需严格对齐的三元组数据。
Details
Motivation: 现有3D生成方法受限于高质量3D数据稀缺,多依赖间接2D编辑+优化上采流程,导致几何不一致;亟需一种能直接、一致地生成3D并充分利用海量2D数据的原生3D模型。 Method: 提出Omni123模型:1)将文本、图像、3D统一编码为共享离散token序列;2)设计交错X-to-X训练范式,在异构配对数据(如图文、图-3D)上协同训练;3)在自回归序列中构建语义-视觉-几何循环(如text→image→3D→image),联合约束语义对齐、外观保真与多视图几何一致性。 Result: 在文本引导的3D生成与编辑任务上显著优于现有方法,验证了利用2D数据作为几何先验提升3D表征的有效性,并展现出向多模态3D世界模型扩展的可扩展路径。 Conclusion: Omni123证明了通过跨模态token化与循环一致性建模,可在缺乏大规模3D标注的情况下,高效利用2D数据驱动3D原生生成,为构建统一多模态3D基础模型提供了新范式。 Abstract: Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.[188] AdamFlow: Adam-based Wasserstein Gradient Flows for Surface Registration in Medical Imaging
Qiang Ma,Qingjie Meng,Xin Hu,Yicheng Wu,Wenjia Bai
Main category: cs.CV
TL;DR: 本文提出了一种基于概率测度和切片Wasserstein距离的快速表面配准方法AdamFlow,兼顾效率与鲁棒性,在解剖结构配准中表现优异。
Details
Motivation: 现有表面配准方法在效率与鲁棒性之间存在权衡:局部点匹配法快但易受噪声和初值影响;全局点集配准法鲁棒但计算代价高。 Method: 将表面网格建模为概率测度,配准问题转化为分布优化问题;采用具有对数线性复杂度的切片Wasserstein距离度量网格差异;提出AdamFlow优化器,将Adam从欧氏空间推广至概率空间。 Result: 理论证明AdamFlow渐近收敛;实验验证其在仿射与非刚性配准任务中,跨多种解剖结构均优于现有方法,兼具高效性与鲁棒性。 Conclusion: 所提方法有效缓解了效率-鲁棒性权衡问题,为医学图像解剖形状分析提供了实用、可靠的表面配准新范式。 Abstract: Surface registration plays an important role for anatomical shape analysis in medical imaging. Existing surface registration methods often face a trade-off between efficiency and robustness. Local point matching methods are computationally efficient, but vulnerable to noise and initialisation. Methods designed for global point set alignment tend to incur a high computational cost. To address the challenge, here we present a fast surface registration method, which formulates surface meshes as probability measures and surface registration as a distributional optimisation problem. The discrepancy between two meshes is measured using an efficient sliced Wasserstein distance with log-linear computational complexity. We propose a novel optimisation method, AdamFlow, which generalises the well-known Adam optimisation method from the Euclidean space to the probability space for minimising the sliced Wasserstein distance. We theoretically analyse the asymptotic convergence of AdamFlow and empirically demonstrate its superior performance in both affine and non-rigid surface registration across various anatomical structures.[189] VOID: Video Object and Interaction Deletion
Saman Motamed,William Harvey,Benjamin Klein,Luc Van Gool,Zhuoning Yuan,Ta-Ying Cheng
Main category: cs.CV
TL;DR: 本文提出VOID框架,用于视频对象移除,特别针对涉及物理交互(如碰撞)的复杂场景,通过生成反事实数据集并结合视觉语言模型与视频扩散模型,实现物理上合理的视频修复。
Details
Motivation: 现有视频对象移除方法在处理对象间显著物理交互(如碰撞)时失效,导致结果不真实;需提升视频编辑模型对物理因果关系的建模能力。 Method: 构建基于Kubric和HUMOTO的反事实对象移除配对数据集;使用视觉语言模型定位受移除对象影响的区域;以该区域为条件,驱动视频扩散模型生成物理一致的反事实结果。 Result: 在合成与真实数据上实验表明,VOID比现有方法更能保持场景动力学一致性,尤其在涉及物理交互的移除任务中效果显著。 Conclusion: VOID展示了将高阶因果推理融入视频编辑模型的可行性,为构建世界模拟能力更强的视频生成系统提供了新思路。 Abstract: Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.[190] A Simple Baseline for Streaming Video Understanding
Yujiao Shen,Shulin Tian,Jingkang Yang,Ziwei Liu
Main category: cs.CV
TL;DR: 本文提出SimpleStream,一种仅使用最近N帧的滑动窗口基线方法,在流式视频理解任务中表现优异,甚至超越现有复杂记忆机制模型;实验表明长上下文价值依赖于骨干网络而非模型规模,并揭示感知与记忆间的权衡关系,建议未来基准应区分近期场景感知与长程记忆。
Details
Motivation: 挑战当前流式视频理解中依赖复杂记忆机制的趋势,验证简单滑动窗口方法的有效性,并重新审视长上下文建模的必要性与评估标准。 Method: 提出SimpleStream方法,即仅将最近N帧输入现成视觉语言模型(VLM)进行流式推理;在OVO-Bench和StreamingBench上与13个离线/在线视频大模型对比,并开展控制变量消融实验分析上下文长度、骨干网络与感知-记忆权衡的关系。 Result: SimpleStream仅用4帧即在OVO-Bench达67.7%平均准确率、StreamingBench达80.59%;实验证明长上下文增益依赖骨干网络,且增加历史帧会提升召回但削弱实时感知能力。 Conclusion: 复杂记忆模块的进步需以显著超越SimpleStream为前提;未来流式视频基准应明确区分近期感知与长程记忆能力,以更清晰评估模型改进本质。 Abstract: Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.[191] Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
Junxuan Li,Rawal Khirodkar,Chengan He,Zhongshi Jiang,Giljoo Nam,Lingchen Yang,Jihyun Lee,Egor Zakharov,Zhaoen Su,Rinat Abdrashitov,Yuan Dong,Julieta Martinez,Kai Li,Qingyang Tan,Takaaki Shiratori,Matthew Hu,Peihong Guo,Xuhua Huang,Ariyan Zarei,Marco Pesavento,Yichen Xu,He Wen,Teng Deng,Wyatt Borsos,Anjali Thakrar,Jean-Charles Bazin,Carsten Stoll,Ginés Hidalgo,James Booth,Lucy Wang,Xiaowen Ma,Yu Rong,Sairanjith Thalanki,Chen Cao,Christian Häne,Abhishek Kar,Sofien Bouaziz,Jason Saragih,Yaser Sheikh,Shunsuke Saito
Main category: cs.CV
TL;DR: 本文提出Large-Scale Codec Avatars(LCA),通过预训练(100万野外视频)与后训练(高质量多视角数据)的范式,兼顾3D人像建模的保真度与泛化性,在身份保持、表情/手指控制、发型/服装/人群多样性等方面取得突破,并展现出光照重置、宽松衣物模拟及风格化图像零样本鲁棒性等涌现能力。
Details
Motivation: 解决高保真3D人像建模中保真度与泛化性之间的权衡问题:多视角影棚数据保真度高但泛化差;大规模野外数据泛化好但质量低、3D模糊。 Method: 提出面向3D人像的大规模编解码模型LCA,采用预训练(1M野外视频,学习外观与几何先验)+后训练(高质量多视角数据,提升表现力与保真度)的两阶段范式,实现前馈式高效推理。 Result: LCA在发型、服饰、人口统计学特征上具备强泛化能力,支持精细面部表情与手指级关节控制,保持强身份一致性;并涌现出无监督的重光照能力、宽松衣物建模能力,以及对风格化图像的零样本鲁棒性。 Conclusion: LCA首次将大模型预/后训练范式引入3D人像建模,成功弥合保真度与泛化性鸿沟,为世界尺度人群的高质量3D建模提供了新范式。 Abstract: High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.[192] Beyond Referring Expressions: Scenario Comprehension Visual Grounding
Ruozhen He,Nisarg A. Shah,Qihua Dong,Zilin Xiao,Jaywon Koo,Vicente Ordonez
Main category: cs.CV
TL;DR: 本文提出了一种新的视觉定位任务——基于场景的指代表达理解(RSC),强调从角色、意图和关系上下文中推理目标,而非依赖显式命名;并构建了包含31k训练样本的新基准及配套模型ScenGround,验证了其对现有模型系统性缺陷的揭示能力与课程学习的有效性。
Details
Motivation: 现有视觉定位基准主要评估图像区域与字面指代表达的对齐,模型常通过匹配显著命名类别即可成功,缺乏对深层语义(如角色、意图、关系)的理解能力;因此需要一种更富挑战性的、基于场景理解的视觉定位新范式。 Method: 提出了Referring Scenario Comprehension(RSC)基准,包含段落级查询、细粒度难度标签(唯一性、杂乱度、尺寸、重叠、位置)及分布外测试集;并设计ScenGround方法,融合监督预热与难度感知的强化学习课程训练策略。 Result: 实验表明,场景式查询能系统性暴露当前模型在标准基准中无法发现的失败模式;课程训练不仅提升困难子集性能,还能正向迁移至标准基准。 Conclusion: 基于场景的视觉定位是推动模型从表层匹配迈向深层语义推理的关键方向;RSC基准与ScenGround为该方向提供了可解释、可扩展的评估与建模基础。 Abstract: Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.[193] Steerable Visual Representations
Jona Ruthardt,Manu Gaur,Deva Ramanan,Makarand Tapaswi,Yuki M. Asano
Main category: cs.CV
TL;DR: 本文提出了一种可由自然语言引导的视觉表征(Steerable Visual Representations),通过在ViT视觉编码器中早期注入文本信息(轻量级跨模态注意力),实现对全局和局部特征的语义引导,兼顾通用视觉任务性能与文本可控性。
Details
Motivation: 现有预训练ViT(如DINOv2、MAE)虽具通用性但缺乏对非显著概念的定向关注能力;而多模态大模型(如CLIP)虽支持文本引导,却因晚期融合导致视觉表征语言化、削弱通用视觉能力。亟需兼具可引导性与视觉保真度的新表征范式。 Method: 提出早期融合机制:将文本嵌入通过轻量级跨注意力模块直接注入ViT各层视觉编码器中,生成可被自然语言实时引导的全局与局部视觉特征;并构建了衡量表征可引导性的新基准。 Result: 所提方法能在保持原始表征质量前提下,精准聚焦图像中任意指定对象;在异常检测与个性化物体判别任务上达到或超越专用方法,并展现出对分布外任务的零样本泛化能力。 Conclusion: Steerable Visual Representations 成功弥合了通用视觉表征与文本可控性之间的鸿沟,为构建灵活、鲁棒且语义可解释的视觉系统提供了新路径。 Abstract: Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.[194] Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection
Alex Costanzino,Pierluigi Zama Ramirez,Giuseppe Lisanti,Luigi Di Stefano
Main category: cs.CV
TL;DR: ModMap是一个原生多视角、多模态的3D异常检测与分割框架,通过跨视角和跨模态特征映射及特征级调制建模视角依赖关系,并引入跨视角训练策略与专用深度编码器,在SiM3D基准上达到SOTA性能。
Details
Motivation: 现有方法独立处理各视角,缺乏对多视角与多模态间关联的有效建模,且缺乏适配工业场景高分辨率3D数据的深度编码器。 Method: 提出ModMap框架,采用跨模态特征映射范式,结合特征级调制显式建模视角依赖关系;设计跨-view训练策略,利用所有视角组合进行多视角集成与聚合;训练并开源面向工业数据集的深度编码器。 Result: 在SiM3D基准上显著超越先前方法,达到3D异常检测与分割任务的最先进(SOTA)性能。 Conclusion: ModMap通过统一建模多视角与多模态信息,验证了跨视角特征交互与专用编码器对3D异常分析任务的关键作用,为工业视觉检测提供了新范式。 Abstract: We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.[195] Generative World Renderer
Zheng-Hui Huang,Zhixiang Wang,Jiaming Tan,Ruihan Yu,Yidan Zhang,Bo Zheng,Yu-Lun Liu,Yung-Yu Chuang,Kaipeng Zhang
Main category: cs.CV
TL;DR: 本文提出一个大规模动态游戏数据集,用于提升生成式逆向与正向渲染在真实场景中的性能,通过双屏拼接捕获方法获取400万帧同步RGB与G-buffer数据,并设计基于视觉语言模型的无真值评估协议。
Details
Motivation: 现有合成数据集 realism 和 temporal coherence 不足,导致生成式逆向与正向渲染难以扩展到真实世界场景,存在显著域差距。 Method: 提出双屏拼接捕获方法,从AAA游戏中构建含4M连续帧、同步RGB与5通道G-buffer的大规模动态数据集;设计基于VLM的语义-空间-时间一致性评估协议;开发G-buffer引导的文本驱动风格编辑工具链。 Result: 逆向渲染器在该数据集上微调后展现出更优跨数据集泛化能力与可控生成性能;VLM评估结果与人类判断高度相关;前向渲染工具支持基于文本提示对AAA游戏画面进行G-buffer驱动的风格编辑。 Conclusion: 所构建的数据集与评估范式有效弥合了合成数据与真实场景间的域差距,为双向渲染提供了新基准与实用工具。 Abstract: Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.[196] ActionParty: Multi-Subject Action Binding in Generative Video Games
Alexander Pondaven,Ziyi Wu,Igor Gilitschenski,Philip Torr,Sergey Tulyakov,Fabio Pizzati,Aliaksandr Siarohin
Main category: cs.CV
TL;DR: 本文提出ActionParty,一种支持多主体动作控制的生成式视频世界模型,通过引入主体状态令牌和空间偏差机制,解决了现有视频扩散模型中动作绑定问题,并在Melting Pot基准上验证了其可同时控制多达7个玩家的能力。
Details
Motivation: 现有视频扩散模型主要局限于单智能体场景,难以同时控制多个智能体,且存在动作与主体绑定不明确的问题。 Method: 提出ActionParty模型,引入持久化表征各主体状态的主体状态令牌(subject state tokens),并结合空间偏差机制联合建模状态令牌与视频潜在表示,从而解耦全局帧渲染与个体动作驱动的主体更新。 Result: 在Melting Pot基准的46种多样化环境中,首次实现对最多7个玩家的同时控制;显著提升动作跟随准确率与身份一致性,并支持复杂交互下的鲁棒自回归主体追踪。 Conclusion: ActionParty成功解决了多主体视频世界模型中的动作绑定难题,为生成式视频游戏和交互式环境模拟提供了可扩展、可控的新范式。 Abstract: Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.[197] EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
Luca Bartolomei,Fabio Tosi,Matteo Poggi,Stefano Mattoccia,Guillermo Gallego
Main category: cs.CV
TL;DR: EventHub 是一种无需真实标注的深度事件立体视觉网络训练框架,仅使用标准彩色图像生成代理标注和代理事件,从而提升模型在夜间等挑战性场景下的泛化能力。